Revolutionizing Systematic Reviews: The Precision of LLMS in Screening Observational Studies

Author(s)

Langham J¹, Reason T¹, Gimblett A¹, Malcolm B², Hill N³
¹Estima Scientific Ltd, South Ruislip, LON, UK, ²Bristol Myers Squibb, Middlesex, LON, UK, ³Bristol Myers Squibb, Princeton, NJ, USA

Presentation Documents

Langham_RWE_SLR_d4_13Nov2024143613.pdf

OBJECTIVES: We previously reported the accuracy of GPT-4 in screening titles, abstracts, and full publications for a systematic review of randomized controlled trials, showing specificity and sensitivity of 95.9% and 86.7%, respectively. Our objective was to assess GPT-4’s accuracy in selecting studies for systematic literature reviews (SLRs) of real-world evidence (RWE) compared to traditional double screening by human reviewers.

METHODS: Two case studies where two human reviewers had screened titles and abstracts, and full-text publications were selected. The SLRs had different criteria; one studied the epidemiology of solid tumor sites harboring NTRK fusion mutations, and the other compared outcomes from oncology therapies that have both an IV and SC formulation. GPT-4 was used via a Python API to identify titles and abstracts that fulfilled the eligibility criteria. We compared the screening results of GPT-4 and human reviewers to determine agreement and successful identification of publications.

RESULTS: The sensitivity and specificity (and 95% confidence Interval) of GPT-4 compared to humans was 91.07 (83.60 to 98.54) and 74.38 (71.56 to 77.20), respectively, in case study 1 (n=977) and 87.50 (64.58 to 110.42), and 85.86 (83.28 to 88.44) respectively in case study 2 (n=708). The approximate time required for GPT-4 to process the information was 1 hour for 500 titles and abstracts screened.

CONCLUSIONS: Searching and screening for observational studies is more difficult due to a lack of adherence to reporting guidelines, which raises output. However, GPT-4 quickly and accurately summarized relevant study characteristics from the title and abstract to determine study design eligibility in two diverse RWE SLRs. Further prompt refinement and fine-tuning with GPT-4 would increase the accuracy, particularly for the more complex decisions. Testing on further SLRs and a full publication review will be required to improve prompting and demonstrate generalizability.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

EPH28

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Methodology

Presentation