Better Than ChatGPT – AI System for Criteria-Based Screening in HEOR Based on Open-Source Large Language Models
Speaker(s)
ABSTRACT WITHDRAWN
OBJECTIVES: The predominant approach to automation of screening in literature reviews, including those used to inform economic modeling, is to train binary classifiers that act as an additional reviewer and/or are used to prioritize unscreened references based on their likelihood of inclusion. However, these models suffer from multiple limitations, such as inefficient use of information (every exclusion is treated the same way), “cold start”, and lack of flexibility when screening instructions change.
Automatic taggers and classifiers that extract information from abstracts and match them against inclusion criteria (such as study design, population, interventions, etc.) are a promising new approach to screening automation, thanks to the advent of Large Language Models (LLMs) such as GPT-4. However, the cost, processing time, and carbon footprint associated with these models make it currently infeasible to apply them at the large-scale.METHODS: We fine-tuned a number of open-source LLMs that span 3B-13B parameters to see how smaller, specialized models fare against state-of-the-art LLMs and whether they can enable running these criteria-based classifiers across a vast number of references.
RESULTS: Using a benchmark based on one thousand PubMed abstracts, we found GPT-4 accuracy to be 75.2%. A specialized, fine-tuned 2.7B language model was able to achieve 74.4%, while a 7B achieved 76.1% accuracy.
Encouraged by these results, we decided to see whether the results on the benchmark will hold in a real-world application. We are conducting a prospective study by replicating an existing systematic review used in HTA (Health Technology Assessment). The results of this study will be presented during the meeting.CONCLUSIONS: The advances in AI enable new approaches to literature review automation. However, the application of LLMs can be associated with substantial monetary and environmental costs. Through this work we demonstrate that smaller, specialized models can outperform general purpose ones on tasks specific to criteria-based screening.
Code
MSR2
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas