Optimising Performance of Generative Artificial Intelligence (GenAI) in Systematic Literature Review (SLR) Screening Using PICOS Criteria
Author(s)
Miles G1, Giles L2, Kerr BC3, Norman B3, Sibbring GC1
1Prime, Knutsford, Cheshire, UK, 2Prime, Cheshire, CHE, UK, 3Prime, London, London, UK
Presentation Documents
OBJECTIVES: GenAI has potential to increase efficiency and speed of literature reviews through automated screening, summarisation, and data extraction. However, few data are available to inform on the feasibility of genAI-based screening for complex research questions. We compared the performance of genAI-based vs human screening using a previously conducted SLR.
METHODS: The dataset comprised 300 titles/abstracts screened for an SLR of dual- vs triple-inhaled therapy in patients with chronic obstructive pulmonary disease. Query prompts (n=17) based on Population, Intervention, Comparator, Outcome, and Study design (PICOS) criteria from the SLR protocol were used to screen the dataset via a proprietary GPT4o-based tool. Responses were restricted to yes, no or unclear. Accuracy, recall, and precision were calculated for genAI vs human review. Additionally, agreement using the mode of genAI responses from 10 runs was evaluated in a subset of 50 records.
RESULTS: GenAI showed high agreement with human review on Population, Outcome and Study design questions (respectively, accuracy: 92%, 83% and 83%; recall: 93%, 88%, and 80%; precision: 98%, 92%, and 85%). Agreement was poor for the Intervention/Comparator question (accuracy: 50%, recall: 33%, precision: 91%). The overall inclusion/exclusion decision showed high accuracy and precision (80% and 81%, respectively), but low recall (35%). Using the mode of genAI responses from 10 runs improved agreement with human review for overall inclusion/exclusion decisions vs a single run (accuracy: 54% vs 40%; recall: 42% vs 30%; precision: 78% vs 59%).
CONCLUSIONS: GenAI showed high agreement with human reviewers for screening literature against Population, Outcome and Study design questions, similar to that expected between human reviewers; however, performance was poor for the complex Intervention/Comparator question. These results support a possible role for genAI assistance with screening of literature search results, with potential for time and cost savings, but also highlight the limitations of GPT4o and equivalent genAI models.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
MSR201
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas