Improve Analyst Accuracy in Systematic Literature Reviews Using Reliant Tabular and LLM-Based Relevance Scoring

Author(s)

Christoph R Schlegel, PhD, Sam Work, MSc, Marc G Bellemare, PhD;
Reliant AI Inc., Montréal, QC, Canada

OBJECTIVES: LLMs have demonstrated incredible potential to accelerate evidence generation and systematic literature reviews. This study aimed to assess the effectiveness of the popular chain-of-thought prompting technique to perform SLRs, compared to humans’ performance.
METHODS: We fine-tuned the Llama 3.1 70B open-source LLM on a synthetic dataset of 5100 points tailored to emphasize conciseness and minimize hallucinations. We embedded this LLM into a data agent that combines retrieval-augmented generation and a proprietary algorithm to produce a relevance score through a series of prompts. The agent was tasked with assessing the relevance of a given document to a textual query.
We used our data agent to replicate Kerr et al.’s SLR workflow and compare the performance of an off-the-shelf GPT-4 agent with our own. In our experiment, an initial 982 abstracts were obtained from PubMed using queries from Kerr et al’s analysis, and 50 abstracts were identified as “positive”, meaning that they were of relevance to the SLR.
RESULTS: From the full set of abstracts, our data agent selected 89 for further review. All 50 positive abstracts were selected, yielding an effective recall of 100%. This compares quite favourably with Kerr et al.’s GPT-4 results (87% recall). We further assessed the performance of a single junior analyst to 92% recall.
In collaborative human-AI work, false positives are easily filtered out. False negatives are much more challenging - as the human analyst must then return to the sources to look for missing information. A comparative evaluation of both systems’ false positive rates found them to be comparable (9.1%), demonstrating the substantial value provided by our data agent.
CONCLUSIONS: In this study, we demonstrated that fine-tuning and advanced agent engineering can significantly improve the performance of general-purpose LLMs, paving the way to the development of trustworthy AI tools that researchers and analysts can fully depend on.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR37

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology

Presentation (CTI)