Is a Large Language Model (LLM) More Accurate Than Human Researchers in Correctly Identifying Diseases in Biomedical Abstracts? A Pilot Study

Author(s)

Edema C, Rutherford L, Martin A, Martin C, Bertuzzi A, King E, Letton W
Crystallise, Stanford-le-Hope, ESS, UK

OBJECTIVES: Artificial intelligence (AI) such as large language models are being applied to biomedical research to automate processes and improve efficiency. However, ascertaining the precision and reliability of AI in research tasks requires further investigation. We compared the accuracy and speed of an LLM versus human researchers in correctly identifying diseases in biomedical abstracts.

METHODS: A targeted literature search was conducted to generate a list of 500 biomedical abstracts. Using an Evidence Mapper tool (www.evidencemapper.co.uk), each abstract was indexed separately by researchers and the LLM to nine predefined disease categories. The OpenAI Python library was used to create a suitable prompt for the LLM’s output. The time taken for each method was recorded. A gold-standard disease category against which the researcher and LLM disease indexing were compared was created independently.

RESULTS: Indexing disease categories in the biomedical abstracts was more than three times faster using the LLM (3 hours) compared to researchers (10.4 hours). Overall, indexing by researchers was more accurate with a mean accuracy of 98% compared to 96% using the LLM. The range of sensitivity and specificity for the LLM was 16.67% to 100% and 87.6% to 100% respectively. For researchers, sensitivity ranged between 0% to 91.4% and specificity from 92.7% to 100%. The mean sensitivity and specificity across the disease categories for the LLM was 70.7% and 97.2% versus 66.6% and 98.8% for the researchers. The LLM was most sensitive at identifying diabetes mellitus and infectious diseases (100%); for researchers, cardiovascular disease had the highest proportion of true positives (91.42%).

CONCLUSIONS: Utilizing LLMs in the evidence synthesis process can be timesaving with an acceptable degree of accuracy compared to humans. Human checking of AI-suggested indexing might be the most cost-effective approach for database indexing and abstract screening for literature reviews.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

PT23

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Geriatrics, No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×