Assessing Efficiencies of AI-Powered Tools in Reporting Healthcare Economics and Outcomes Research (HEOR) Data for Health Technology Assessment (HTA) Submissions

Author(s)

Pandey P¹, J S¹, Bergemann R², Pandey R³
¹Parexel International, Bangalore, Karnataka, India, ²Parexel International, Basel, Switzerland, ³Parexel International, BENGALURU, KA, India

OBJECTIVES: To evaluate the quality of various AI-powered tools in summarizing HEOR data and assisting report writing intended for HTA submissions.

METHODS: Three tools (Co-Pilot, Parexel-GPT, ChatGPT-4-Turbo) were supplied with sample tables containing clinical efficacy and epidemiology data extracted from multiple studies. These tools were prompted to compose narrative summaries from the provided tables. Clear prompts, unambiguously providing essential background information, were tested until desired outputs were obtained.

A questionnaire was developed using concepts elicited from three subject matter experts, who also assessed the validity, test-retest reliability, and internal consistency of the questionnaire. The questionnaire included four domains (D1: clarity and conciseness [sub-total=15], D2: scientific accuracy [sub-total=10], D3: evidence and impact [sub-total=10], D4: scientific writing expertise [sub-total=25]) and 12 items; each item was rated on a Likert scale (total score=60). Three reviewers used this questionnaire to evaluate the quality of AI outputs. Reviewers were blinded from each other and the source tool.

This abstract presents preliminary results from an ongoing study, set to be completed by August 2024.

RESULTS: Preliminary results revealed that Co-Pilot underperformed across all 4 domains (median [range]): D1: 7.5 [6, 8]; D2: 2 [2,5]; D3: 2.5 [2, 5]; D4: 10.5 [8, 12]), compared to Parexel-GPT (median [range]: D1: 9.5 [6, 10]; D2: 5.5 [2, 8]; D3: 6.0 [2, 7]; D4:16.5 [10, 18]), and ChatGPT-4-Turbo (median [range]: D1: 10.5 [7, 12]; D2: 6.5 [2, 8]; D3: 7 [2, 8]; D4: 16.5 [10, 19]). The overall score (median [range]) of Co-Pilot, Parexel-GPT, and ChatGPT-4-Turbo were 30.5 [18, 63], 43 [31, 89], and 49 [34, 94], respectively.

CONCLUSIONS: AI is constantly advancing, displaying its ability to augment writing efforts. While the potential benefits of these tools for faster content generation with better quality is significant, there is still a requirement of human investment to appropriately calibrate the prompts and validate the results.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

HTA59

Topic

Clinical Outcomes, Economic Evaluation, Epidemiology & Public Health, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Clinical Outcomes Assessment, Value of Information

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Presentation