Optimising Performance of Generative Artificial Intelligence (GenAI) in Systematic Literature Review (SLR) Screening Using PICOS Criteria

Author(s)

Miles G¹, Giles L², Kerr BC³, Norman B³, Sibbring GC¹
¹Prime, Knutsford, Cheshire, UK, ²Prime, Cheshire, CHE, UK, ³Prime, London, London, UK

Presentation Documents

18023336_Prime_GenAI-poster_v10 compressed146608.pdf

OBJECTIVES: GenAI has potential to increase efficiency and speed of literature reviews through automated screening, summarisation, and data extraction. However, few data are available to inform on the feasibility of genAI-based screening for complex research questions. We compared the performance of genAI-based vs human screening using a previously conducted SLR.

METHODS: The dataset comprised 300 titles/abstracts screened for an SLR of dual- vs triple-inhaled therapy in patients with chronic obstructive pulmonary disease. Query prompts (n=17) based on Population, Intervention, Comparator, Outcome, and Study design (PICOS) criteria from the SLR protocol were used to screen the dataset via a proprietary GPT4o-based tool. Responses were restricted to yes, no or unclear. Accuracy, recall, and precision were calculated for genAI vs human review. Additionally, agreement using the mode of genAI responses from 10 runs was evaluated in a subset of 50 records.

RESULTS: GenAI showed high agreement with human review on Population, Outcome and Study design questions (respectively, accuracy: 92%, 83% and 83%; recall: 93%, 88%, and 80%; precision: 98%, 92%, and 85%). Agreement was poor for the Intervention/Comparator question (accuracy: 50%, recall: 33%, precision: 91%). The overall inclusion/exclusion decision showed high accuracy and precision (80% and 81%, respectively), but low recall (35%). Using the mode of genAI responses from 10 runs improved agreement with human review for overall inclusion/exclusion decisions vs a single run (accuracy: 54% vs 40%; recall: 42% vs 30%; precision: 78% vs 59%).

CONCLUSIONS: GenAI showed high agreement with human reviewers for screening literature against Population, Outcome and Study design questions, similar to that expected between human reviewers; however, performance was poor for the complex Intervention/Comparator question. These results support a possible role for genAI assistance with screening of literature search results, with potential for time and cost savings, but also highlight the limitations of GPT4o and equivalent genAI models.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

MSR201

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Methodology

Presentation