Assessing the Performance of Large Language Models in Automating Systematic Literature Reviews: Insights From Recent Studies

Speaker(s)

Samur S¹, Mody B¹, Fleurence R², Bayraktar E¹, Ayer T³, Chhatwal J⁴
¹Value Analytics Labs, Boston, MA, USA, ²NIH, Washington, DC, USA, ³Value Analytics Labs and Georgia Tech, Atlanta, MA, USA, ⁴Harvard Medical School and Value Analytics Labs, Wilmington, MA, USA

OBJECTIVES: The use of large language models (LLMs) in systematic literature reviews (SLRs) have garnered significant interest; however, their accuracy and reliability compared to human reviewers is not well understood. Our objective was to summarize the findings from recent studies that evaluated the performance of LLMs in various tasks of a conventional SLR.

METHODS: We identified and reviewed eight studies between 2023-2024 conducted in Canada, Germany, Ireland, UK, and USA,. These studies assessed LLM performance on different tasks of SLRs, including abstract screening, data extraction, and risk of bias assessment. The studies employed models like ChatGPT, Claude 2, and others, which were evaluated based on their accuracy, sensitivity, specificity, and the ability to generate reliable summaries.

RESULTS: The studies revealed a mixed performance of LLMs. Screening: ChatGPT achieved high accuracy in formulating review questions and screening abstracts with sensitivity of 76%. Bias assessment: GPT-4 had a Cohen’s kappa score of 0.35 when compared to human reviewers in assessing the risk of bias using ROBINS-I tool Data Extraction: Claude 2 and GPT-4 showed promising data extraction capabilities, with accuracy rates exceeding 96%. Efficiency: The use of LLMs significantly reduced the time required for data extraction and abstract screening compared to traditional methods. Limitations: Key issues included the generation of hallucinated information, inconsistent performance, and the necessity for human oversight to validate the findings. Performance Across Different Tasks: While LLMs performed relatively well in specific tasks like screening and and data extraction, they performed less well in bias assessment.

CONCLUSIONS: LLMs show promise in automating SLR tasks, potentially reducing the time and cost. Studies demonstrated high potential for screening abstracts and data extraction but faced challenges in reliability and consistency. Human oversight is required to ensure accuracy.

Code

MSR84

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

ISPOR Europe 2024

17 - 20 November