Large Language Models for Risk of Bias Assessment: A Case Study
Speaker(s)
Edwards M1, Bishop E2, Reddish K2, Carr E1, Ferrante di Ruffano L3
1York Health Economics Consortium, York, YOR, UK, 2York Health Economics Consortium, York, UK, 3York Health Economics Consortium, York, NYK, UK
Presentation Documents
OBJECTIVES: Risk of bias assessment (RoBA) of primary studies is a key part of any systematic review. As a repetitive and structured task, RoBA would initially appear to be well suited to automation or AI support. We assessed the chat interface to Claude 3 Opus for accuracy, consistency, presentation of data, and time savings in the context of RoBA of RCTs for a systematic review.
METHODS: Six RCTs were selected from three reviews conducted by our consultancy over the past five years. Following an initial prompt engineering phase using a report of a seventh RCT, the LLM was used to: 1. Conduct fully automated assessment of each paper using Cochrane RoB 1 tool (Method 1), and 2. Supply information only to facilitate joint human / LLM assessment (Method 2) using the same tool. The results were compared to fully human assessment (Method 3).
RESULTS: Method 1 resulted in very brief answers, with little supporting information provided by the model. Asking for supporting information only (Method 2) resulted in better quality and more complete data, although no judgement was made by the LLM. The agreement percentage between the three methods was mixed, ranging from 16.7% to 100% across domains. The lower agreement level was seen on questions relating to treatment allocation, incomplete outcome data and other sources of potential bias. In these instances, the LLM appeared to have misinterpreted the questions, resulting in different answers to the human assessor. However, there were also a few occasions where the LLM picked up information that the human did not.
CONCLUSIONS: Using LLMs for fully automated RoBA is not recommended at this stage, as such models can misinterpret questions and provide limited or incorrect justification for judgments. However, with suitable prompt engineering, and fine tuning using existing RoBA data, the performance of these models may improve with time.
Code
MSR212
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas