Automating Systematic Literature Review (SLR) Updates: A Comparative Validation Study of Artificial Intelligence (AI) Versus Human Screeners

Author(s)

Cichewicz A¹, Pande A¹, Borkowska K², Mittal L³, Wittkopf P⁴, Slim M⁵
¹Evidera, Waltham, MA, USA, ²Evidera, Cracow, Poland, ³Evidera, Bengaluru, India, ⁴Evidera, London, UK, ⁵Evidera, Montreal, QC, Canada

Presentation Documents

ISPOR2024_Cichewicz_MSR22_POSTER139709.pdf

OBJECTIVES: Systematic reviewers are becoming increasingly inundated by the growing body of literature. The time and resource-intensive nature of conducting SLRs often result in searches being outdated by the time SLRs are completed. With market access strategies relying heavily on the most up-to-date evidence, there has been a growing interest in living SLRs and expediting the title/abstract screening process with AI-based algorithms. Therefore, we aimed to validate an AI algorithm against human reviewers for SLR updates.

METHODS: Robot Screener was trained on six SLRs evaluating clinical efficacy and safety (CES) or economic burden (EB). An 80% subset of records from each SLR formed the training set, with the remaining 20% constituting a testing set to simulate new records from an SLR update. AI screening decisions were compared with human dual-screening decisions (AI vs dual human). Differences in the mean recall, precision, and overall error rates between AI and human screeners were assessed using Mann-Whitney U-test.

RESULTS: Three CES (3,194 records [testing set=640]) and three EB (8,729 records [testing set=1729]) SLRs yielded comparable mean [SD] recall rates (AI: 0.82 [0.15] vs dual human: 0.75 [0.23]; p=0.59) and overall error rates (AI: 9.9% [5.9%] vs dual human: 7% [8.4%]; p=0.39). However, AI exhibited significantly lower precision rates (0.50 [0.15] vs 0.85 [0.16]; p=0.008). Similar trends were observed when analyses were stratified by SLR topic.

CONCLUSIONS: There were no significant differences between the AI and dual human screeners in recall and overall error rates. Dual screening with only human reviewers is error-prone at a rate comparable to that when AI was employed as a reviewer. Our study supports AI’s capability to expedite the screening process for SLR updates while emphasizing the need for continued model refinement to address precision limitations and enhance the overall model performance.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

MSR22

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Methodology

Presentation