Revolutionizing Systematic Reviews: The Precision of LLMS in Screening Observational Studies

Author(s)

Langham J1, Reason T1, Gimblett A1, Malcolm B2, Hill N3
1Estima Scientific Ltd, South Ruislip, LON, UK, 2Bristol Myers Squibb, Middlesex, LON, UK, 3Bristol Myers Squibb, Princeton, NJ, USA

OBJECTIVES: We previously reported the accuracy of GPT-4 in screening titles, abstracts, and full publications for a systematic review of randomized controlled trials, showing specificity and sensitivity of 95.9% and 86.7%, respectively. Our objective was to assess GPT-4’s accuracy in selecting studies for systematic literature reviews (SLRs) of real-world evidence (RWE) compared to traditional double screening by human reviewers.

METHODS: Two case studies where two human reviewers had screened titles and abstracts, and full-text publications were selected. The SLRs had different criteria; one studied the epidemiology of solid tumor sites harboring NTRK fusion mutations, and the other compared outcomes from oncology therapies that have both an IV and SC formulation. GPT-4 was used via a Python API to identify titles and abstracts that fulfilled the eligibility criteria. We compared the screening results of GPT-4 and human reviewers to determine agreement and successful identification of publications.

RESULTS: The sensitivity and specificity (and 95% confidence Interval) of GPT-4 compared to humans was 91.07 (83.60 to 98.54) and 74.38 (71.56 to 77.20), respectively, in case study 1 (n=977) and 87.50 (64.58 to 110.42), and 85.86 (83.28 to 88.44) respectively in case study 2 (n=708). The approximate time required for GPT-4 to process the information was 1 hour for 500 titles and abstracts screened.

CONCLUSIONS: Searching and screening for observational studies is more difficult due to a lack of adherence to reporting guidelines, which raises output. However, GPT-4 quickly and accurately summarized relevant study characteristics from the title and abstract to determine study design eligibility in two diverse RWE SLRs. Further prompt refinement and fine-tuning with GPT-4 would increase the accuracy, particularly for the more complex decisions. Testing on further SLRs and a full publication review will be required to improve prompting and demonstrate generalizability.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

EPH28

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×