Assessing Efficiencies of AI-Powered Tools in Reporting Healthcare Economics and Outcomes Research (HEOR) Data for Health Technology Assessment (HTA) Submissions
Author(s)
Pandey P1, J S1, Bergemann R2, Pandey R3
1Parexel International, Bangalore, Karnataka, India, 2Parexel International, Basel, Switzerland, 3Parexel International, BENGALURU, KA, India
OBJECTIVES: To evaluate the quality of various AI-powered tools in summarizing HEOR data and assisting report writing intended for HTA submissions.
METHODS: Three tools (Co-Pilot, Parexel-GPT, ChatGPT-4-Turbo) were supplied with sample tables containing clinical efficacy and epidemiology data extracted from multiple studies. These tools were prompted to compose narrative summaries from the provided tables. Clear prompts, unambiguously providing essential background information, were tested until desired outputs were obtained.
A questionnaire was developed using concepts elicited from three subject matter experts, who also assessed the validity, test-retest reliability, and internal consistency of the questionnaire. The questionnaire included four domains (D1: clarity and conciseness [sub-total=15], D2: scientific accuracy [sub-total=10], D3: evidence and impact [sub-total=10], D4: scientific writing expertise [sub-total=25]) and 12 items; each item was rated on a Likert scale (total score=60). Three reviewers used this questionnaire to evaluate the quality of AI outputs. Reviewers were blinded from each other and the source tool. This abstract presents preliminary results from an ongoing study, set to be completed by August 2024.RESULTS: Preliminary results revealed that Co-Pilot underperformed across all 4 domains (median [range]): D1: 7.5 [6, 8]; D2: 2 [2,5]; D3: 2.5 [2, 5]; D4: 10.5 [8, 12]), compared to Parexel-GPT (median [range]: D1: 9.5 [6, 10]; D2: 5.5 [2, 8]; D3: 6.0 [2, 7]; D4:16.5 [10, 18]), and ChatGPT-4-Turbo (median [range]: D1: 10.5 [7, 12]; D2: 6.5 [2, 8]; D3: 7 [2, 8]; D4: 16.5 [10, 19]). The overall score (median [range]) of Co-Pilot, Parexel-GPT, and ChatGPT-4-Turbo were 30.5 [18, 63], 43 [31, 89], and 49 [34, 94], respectively.
CONCLUSIONS: AI is constantly advancing, displaying its ability to augment writing efforts. While the potential benefits of these tools for faster content generation with better quality is significant, there is still a requirement of human investment to appropriately calibrate the prompts and validate the results.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
HTA59
Topic
Clinical Outcomes, Economic Evaluation, Epidemiology & Public Health, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Clinical Outcomes Assessment, Value of Information
Disease
No Additional Disease & Conditions/Specialized Treatment Areas