part of AI and SEI
The authors show that OpenAI’s Large Language Model (LLM) GPT performed well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow.
Researchers evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12 000 records using the same eligibility criteria as human screeners. They tested three different versions of this model, tasked with distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1.
For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5, the recall rate was 100%, meaning no relevant papers were missed; using this mode for screening would have saved 50% of the time that would otherwise have been spent on manual screening. Experimenting with a higher cut of threshold can save more time. With a threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening.
If automation technologies can replicate manual screening by human experts with effectiveness, accuracy and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, the authors caution that more testing and methodological development is needed; they outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.