Assessing Recall in Abstract Screening: Artificial Intelligence Vs. Human Reviewers
Author(s)
Thurnham J1, Kallmes K2, Holub KJ2
1Nested Knowledge, London, LON, UK, 2Nested Knowledge, St. Paul, MN, USA
Presentation Documents
OBJECTIVES: Screening for relevant records is a necessary but time-intensive task in the systematic literature review (SLR) process. To reduce human labor input, Artificial Intelligence (AI) has been proposed as a partial or total replacement for human screeners, but concerns exist about the accuracy of AI screening. The most important concern is whether AI tools have lower recall, meaning they would miss more relevant records than human reviewers, leading to incomplete evidence in SLRs. Here, we assess the performance of Nested Knowledge Robot Screener, an AI for inclusion/advancement prediction, by comparing the recall and precision of human reviewers against Robot Screener in SLRs that employed this AI.
METHODS: Clinical, economic, and mental health SLRs that employed Robot Screener with at least 50 abstract-level human screening decisions in the AutoLit software were included. Human and Robot Screener abstract-level advancement decisions were compared against final, adjudicated advancement decisions to determine recall.
RESULTS: Nineteen SLRs with 8,927 final advanced records were assessed. Human reviewers correctly advanced 8,097/8,580 records, with recall of 94.4% and precision of 86.4%. Robot Screener correctly advanced 5,791/5,965 records, with recall of 97.1% and precision of 47.3%. In a two-sided chi-squared analysis, Robot Screener’s recall was significantly higher than human (p<0.001) and precision was significantly lower (p<.001).
CONCLUSIONS: Robot Screener had higher recall and lower precision when compared with human abstract screeners. These findings suggest that Robot Screening may be appropriate as an assistive tool to save time in the SLR screening process without sacrificing comprehensiveness. Limitations include the fact that the selection of SLRs analyzed may not be generalizable and different numbers of records screened by humans vs. AI. Further research is necessary to assess the potential time savings of the integration of AI screening tools and the precision/recall tradeoff.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 6, S1 (June 2024)
Code
MSR91
Topic
Health Technology Assessment, Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis, Systems & Structure
Disease
No Additional Disease & Conditions/Specialized Treatment Areas