Automating Systematic Literature Review Abstract Screening Using Large Language AI Models: A Development and Validation Study

Speaker(s)

Zdorovtsova N1, Castanon A2, Marsland A1, Bray BD3
1Lane Clark & Peacock LLP, London, UK, 2Lane Clark and Peacock, London, UK, 3LCP Health Analytics, London, UK

OBJECTIVES: Recent advancements in Large Language Models (LLMs) have enabled the automation of various time-consuming and expensive tasks. Screening medical abstracts for systematic literature reviews (SLR) is one such area. We aimed to develop and validate an approach to using LLMs to automate abstract screening, comparing OpenAI’s GPT-3.5-Turbo and GPT-4-Turbo for this purpose.

METHODS: We developed LLM prompts for SLR abstract screening, drawing on the PICOS (Population, Intervention, Control, Outcome, and Study Design) framework. The LLMs were prompted via API calls in Python to summarise each abstract, make inclusion/exclusion decisions, provide rationales, and report decision confidence levels (on a 1-5 scale). To validate the accuracy of the approach, we evaluated abstracts from three pre-existing SLRs on PubMed, covering various disease areas and including an average of 1,736 abstracts each. Two of the SLRs focused on randomised-controlled trials (RCTs), while the third focused on observational studies. For each SLR dataset, 20 runs were performed per LLM, and performance metrics were generated and compared across datasets and LLMs.

RESULTS: The two LLMs demonstrated high levels of accuracy (96.2%-96.9%) in classifying abstracts from SLRs of RCTs, and a lower level of accuracy (<76.3%) for the observational study SLR. GPT-4-Turbo and GPT-3.5-Turbo achieved similar performance in terms of accuracy, but GPT-3.5-Turbo was faster (e.g. 24 versus 260 minutes for 3,618 abstracts) and lower-cost. Across all LLMs and SLR datasets, the false positive rate for classifications significantly exceeded the false negative rate.

CONCLUSIONS: Our results point to the potential of using LLMs to accelerate SLR timelines while retaining accuracy at the abstract screening stage, particularly for SLRs of RCTs. Including LLM-based tools in the SLR workstream could accelerate medical research consolidation. However, users must review the reliability of abstract inclusion decisions and perform sensible technical checks, as is already common in traditional abstract screening procedures.

Code

SA16

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas