Is a Large Language Model (LLM) More Accurate Than Human Researchers in Correctly Identifying Diseases in Biomedical Abstracts? A Pilot Study
Author(s)
Edema C, Rutherford L, Martin A, Martin C, Bertuzzi A, King E, Letton W
Crystallise, Stanford-le-Hope, ESS, UK
Presentation Documents
OBJECTIVES: Artificial intelligence (AI) such as large language models are being applied to biomedical research to automate processes and improve efficiency. However, ascertaining the precision and reliability of AI in research tasks requires further investigation. We compared the accuracy and speed of an LLM versus human researchers in correctly identifying diseases in biomedical abstracts.
METHODS: A targeted literature search was conducted to generate a list of 500 biomedical abstracts. Using an Evidence Mapper tool (www.evidencemapper.co.uk), each abstract was indexed separately by researchers and the LLM to nine predefined disease categories. The OpenAI Python library was used to create a suitable prompt for the LLM’s output. The time taken for each method was recorded. A gold-standard disease category against which the researcher and LLM disease indexing were compared was created independently.
RESULTS: Indexing disease categories in the biomedical abstracts was more than three times faster using the LLM (3 hours) compared to researchers (10.4 hours). Overall, indexing by researchers was more accurate with a mean accuracy of 98% compared to 96% using the LLM. The range of sensitivity and specificity for the LLM was 16.67% to 100% and 87.6% to 100% respectively. For researchers, sensitivity ranged between 0% to 91.4% and specificity from 92.7% to 100%. The mean sensitivity and specificity across the disease categories for the LLM was 70.7% and 97.2% versus 66.6% and 98.8% for the researchers. The LLM was most sensitive at identifying diabetes mellitus and infectious diseases (100%); for researchers, cardiovascular disease had the highest proportion of true positives (91.42%).
CONCLUSIONS: Utilizing LLMs in the evidence synthesis process can be timesaving with an acceptable degree of accuracy compared to humans. Human checking of AI-suggested indexing might be the most cost-effective approach for database indexing and abstract screening for literature reviews.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 6, S1 (June 2024)
Code
PT23
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Geriatrics, No Additional Disease & Conditions/Specialized Treatment Areas