Machine Learning Algorithms for Identifying Undiagnosed Nonalcoholic Steatohepatitis: A Veterans Affairs Health System Study

Author(s)

Onur Baser, MS, PhD¹, Nehir Yapar, MS², Katarzyna Rodchenko, MA, MPH², Munira Mohamed, MPH³, Alexandra Passarelli, MPH², Shuangrui Chen, MS², Erdem Baser, MS, PhD⁴;
¹City University of New York, New York, NY, USA, ²Columbia Data Analytics, New York, NY, USA, ³Columbia Data Analytics, Ann Arbor, MI, USA, ⁴Mergen Medical Research, Ankara, Turkey

OBJECTIVES: Nonalcoholic steatohepatitis (NASH) is often undiagnosed in clinical practice despite its increasing prevalence. This analysis aimed to identify patients in the Veterans Affairs (VA) health system who likely had undiagnosed NASH using machine learning algorithms.
METHODS: A retrospective analysis was conducted utilizing the VA dataset of 25 million adult enrollees. The study population was categorized as NASH-positive, non-NASH, and at-risk cohorts. Machine learning models, including logistic regression, naïve Bayes, gradient boosting, and random forest, were developed and compared using receiver operator characteristics, area under the curve (AUC), and accuracy metrics.
RESULTS: Of 4,223,443 patients meeting inclusion criteria, 4,903 were NASH-positive and 35,528 were non-NASH. The random forest model performed best, achieving an AUC of 83% and accuracy of 90%. This model identified 514,997 patients (12%) from the at-risk cohort as likely to have undiagnosed NASH, approximately 125 times higher than the number of patients initially identified as NASH-positive. Age, obesity, and abnormal liver function test results were the top determinants in assigning NASH probability.
CONCLUSIONS: Machine learning algorithms can effectively identify patients with potential undiagnosed NASH from large at-risk populations using medical claims data. This approach could serve as an initial screening tool to select patients for further diagnostic evaluation and clinical management, potentially improving early detection and treatment of NASH.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR89

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Diabetes/Endocrine/Metabolic Disorders (including obesity), STA: Multiple/Other Specialized Treatments

Presentation (CTI)