Revolutionizing Systematic Reviews: The Precision of LLMS in Screening Observational Studies
Author(s)
Langham J1, Reason T1, Gimblett A1, Malcolm B2, Hill N3
1Estima Scientific Ltd, South Ruislip, LON, UK, 2Bristol Myers Squibb, Middlesex, LON, UK, 3Bristol Myers Squibb, Princeton, NJ, USA
Presentation Documents
OBJECTIVES: We previously reported the accuracy of GPT-4 in screening titles, abstracts, and full publications for a systematic review of randomized controlled trials, showing specificity and sensitivity of 95.9% and 86.7%, respectively. Our objective was to assess GPT-4’s accuracy in selecting studies for systematic literature reviews (SLRs) of real-world evidence (RWE) compared to traditional double screening by human reviewers.
METHODS: Two case studies where two human reviewers had screened titles and abstracts, and full-text publications were selected. The SLRs had different criteria; one studied the epidemiology of solid tumor sites harboring NTRK fusion mutations, and the other compared outcomes from oncology therapies that have both an IV and SC formulation. GPT-4 was used via a Python API to identify titles and abstracts that fulfilled the eligibility criteria. We compared the screening results of GPT-4 and human reviewers to determine agreement and successful identification of publications.
RESULTS: The sensitivity and specificity (and 95% confidence Interval) of GPT-4 compared to humans was 91.07 (83.60 to 98.54) and 74.38 (71.56 to 77.20), respectively, in case study 1 (n=977) and 87.50 (64.58 to 110.42), and 85.86 (83.28 to 88.44) respectively in case study 2 (n=708). The approximate time required for GPT-4 to process the information was 1 hour for 500 titles and abstracts screened.
CONCLUSIONS: Searching and screening for observational studies is more difficult due to a lack of adherence to reporting guidelines, which raises output. However, GPT-4 quickly and accurately summarized relevant study characteristics from the title and abstract to determine study design eligibility in two diverse RWE SLRs. Further prompt refinement and fine-tuning with GPT-4 would increase the accuracy, particularly for the more complex decisions. Testing on further SLRs and a full publication review will be required to improve prompting and demonstrate generalizability.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
EPH28
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas