Machine Learning Algorithms for Identifying Undiagnosed Nonalcoholic Steatohepatitis: A Veterans Affairs Health System Study
Author(s)
Onur Baser, MS, PhD1, Nehir Yapar, MS2, Katarzyna Rodchenko, MA, MPH2, Munira Mohamed, MPH3, Alexandra Passarelli, MPH2, Shuangrui Chen, MS2, Erdem Baser, MS, PhD4;
1City University of New York, New York, NY, USA, 2Columbia Data Analytics, New York, NY, USA, 3Columbia Data Analytics, Ann Arbor, MI, USA, 4Mergen Medical Research, Ankara, Turkey
1City University of New York, New York, NY, USA, 2Columbia Data Analytics, New York, NY, USA, 3Columbia Data Analytics, Ann Arbor, MI, USA, 4Mergen Medical Research, Ankara, Turkey
OBJECTIVES: Nonalcoholic steatohepatitis (NASH) is often undiagnosed in clinical practice despite its increasing prevalence. This analysis aimed to identify patients in the Veterans Affairs (VA) health system who likely had undiagnosed NASH using machine learning algorithms.
METHODS: A retrospective analysis was conducted utilizing the VA dataset of 25 million adult enrollees. The study population was categorized as NASH-positive, non-NASH, and at-risk cohorts. Machine learning models, including logistic regression, naïve Bayes, gradient boosting, and random forest, were developed and compared using receiver operator characteristics, area under the curve (AUC), and accuracy metrics.
RESULTS: Of 4,223,443 patients meeting inclusion criteria, 4,903 were NASH-positive and 35,528 were non-NASH. The random forest model performed best, achieving an AUC of 83% and accuracy of 90%. This model identified 514,997 patients (12%) from the at-risk cohort as likely to have undiagnosed NASH, approximately 125 times higher than the number of patients initially identified as NASH-positive. Age, obesity, and abnormal liver function test results were the top determinants in assigning NASH probability.
CONCLUSIONS: Machine learning algorithms can effectively identify patients with potential undiagnosed NASH from large at-risk populations using medical claims data. This approach could serve as an initial screening tool to select patients for further diagnostic evaluation and clinical management, potentially improving early detection and treatment of NASH.
METHODS: A retrospective analysis was conducted utilizing the VA dataset of 25 million adult enrollees. The study population was categorized as NASH-positive, non-NASH, and at-risk cohorts. Machine learning models, including logistic regression, naïve Bayes, gradient boosting, and random forest, were developed and compared using receiver operator characteristics, area under the curve (AUC), and accuracy metrics.
RESULTS: Of 4,223,443 patients meeting inclusion criteria, 4,903 were NASH-positive and 35,528 were non-NASH. The random forest model performed best, achieving an AUC of 83% and accuracy of 90%. This model identified 514,997 patients (12%) from the at-risk cohort as likely to have undiagnosed NASH, approximately 125 times higher than the number of patients initially identified as NASH-positive. Age, obesity, and abnormal liver function test results were the top determinants in assigning NASH probability.
CONCLUSIONS: Machine learning algorithms can effectively identify patients with potential undiagnosed NASH from large at-risk populations using medical claims data. This approach could serve as an initial screening tool to select patients for further diagnostic evaluation and clinical management, potentially improving early detection and treatment of NASH.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR89
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Diabetes/Endocrine/Metabolic Disorders (including obesity), STA: Multiple/Other Specialized Treatments