Assessing the Performance of Large Language Models in Automating Systematic Literature Reviews: Insights From Recent Studies
Speaker(s)
Samur S1, Mody B1, Fleurence R2, Bayraktar E1, Ayer T3, Chhatwal J4
1Value Analytics Labs, Boston, MA, USA, 2NIH, Washington, DC, USA, 3Value Analytics Labs and Georgia Tech, Atlanta, MA, USA, 4Harvard Medical School and Value Analytics Labs, Wilmington, MA, USA
OBJECTIVES: The use of large language models (LLMs) in systematic literature reviews (SLRs) have garnered significant interest; however, their accuracy and reliability compared to human reviewers is not well understood. Our objective was to summarize the findings from recent studies that evaluated the performance of LLMs in various tasks of a conventional SLR.
METHODS: We identified and reviewed eight studies between 2023-2024 conducted in Canada, Germany, Ireland, UK, and USA,. These studies assessed LLM performance on different tasks of SLRs, including abstract screening, data extraction, and risk of bias assessment. The studies employed models like ChatGPT, Claude 2, and others, which were evaluated based on their accuracy, sensitivity, specificity, and the ability to generate reliable summaries.
RESULTS: The studies revealed a mixed performance of LLMs. Screening: ChatGPT achieved high accuracy in formulating review questions and screening abstracts with sensitivity of 76%. Bias assessment: GPT-4 had a Cohen’s kappa score of 0.35 when compared to human reviewers in assessing the risk of bias using ROBINS-I tool Data Extraction: Claude 2 and GPT-4 showed promising data extraction capabilities, with accuracy rates exceeding 96%. Efficiency: The use of LLMs significantly reduced the time required for data extraction and abstract screening compared to traditional methods. Limitations: Key issues included the generation of hallucinated information, inconsistent performance, and the necessity for human oversight to validate the findings. Performance Across Different Tasks: While LLMs performed relatively well in specific tasks like screening and and data extraction, they performed less well in bias assessment.
CONCLUSIONS: LLMs show promise in automating SLR tasks, potentially reducing the time and cost. Studies demonstrated high potential for screening abstracts and data extraction but faced challenges in reliability and consistency. Human oversight is required to ensure accuracy.
Code
MSR84
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas