Assessing Recall in Abstract Screening: Artificial Intelligence Vs. Human Reviewers

Author(s)

Thurnham J¹, Kallmes K², Holub KJ²
¹Nested Knowledge, London, LON, UK, ²Nested Knowledge, St. Paul, MN, USA

Presentation Documents

ISPOR24_Thurnham_MSR91_POSTER136092.pdf

OBJECTIVES: Screening for relevant records is a necessary but time-intensive task in the systematic literature review (SLR) process. To reduce human labor input, Artificial Intelligence (AI) has been proposed as a partial or total replacement for human screeners, but concerns exist about the accuracy of AI screening. The most important concern is whether AI tools have lower recall, meaning they would miss more relevant records than human reviewers, leading to incomplete evidence in SLRs. Here, we assess the performance of Nested Knowledge Robot Screener, an AI for inclusion/advancement prediction, by comparing the recall and precision of human reviewers against Robot Screener in SLRs that employed this AI.

METHODS: Clinical, economic, and mental health SLRs that employed Robot Screener with at least 50 abstract-level human screening decisions in the AutoLit software were included. Human and Robot Screener abstract-level advancement decisions were compared against final, adjudicated advancement decisions to determine recall.

RESULTS: Nineteen SLRs with 8,927 final advanced records were assessed. Human reviewers correctly advanced 8,097/8,580 records, with recall of 94.4% and precision of 86.4%. Robot Screener correctly advanced 5,791/5,965 records, with recall of 97.1% and precision of 47.3%. In a two-sided chi-squared analysis, Robot Screener’s recall was significantly higher than human (p<0.001) and precision was significantly lower (p<.001).

CONCLUSIONS: Robot Screener had higher recall and lower precision when compared with human abstract screeners. These findings suggest that Robot Screening may be appropriate as an assistive tool to save time in the SLR screening process without sacrificing comprehensiveness. Limitations include the fact that the selection of SLRs analyzed may not be generalizable and different numbers of records screened by humans vs. AI. Further research is necessary to assess the potential time savings of the integration of AI screening tools and the precision/recall tradeoff.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

MSR91

Topic

Health Technology Assessment, Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis, Systems & Structure

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Presentation