A MACHINE LEARNING MODEL FOR CANCER BIOMARKER IDENTIFICATION IN ELECTRONIC HEALTH RECORDS
Author(s)
Ambwani G, Cohen A, Estévez M, Singh N, Adamson B, Nussbaum NC, Birnbaum B
Flatiron Health, New York, NY, USA
Presentation Documents
OBJECTIVES Identifying biomarker-defined patient cohorts using electronic health records (EHR) data is important for facilitating real-world outcomes research in precision oncology. Human abstraction is needed to find such cohorts because biomarker results are captured in unstructured fields and require interpretation. We aimed to develop a classification algorithm using machine learning (ML) for prediction of a patient’s biomarker status to reduce the volume of manual abstraction effort. METHODS Patient records and standard-of-care biomarkers from four diseases in the Flatiron Health EHR-derived database were used for model training and testing: metastatic colorectal cancer (mCRC: KRAS, NRAS, BRAF, MSI), metastatic breast cancer (mBreast: ER, PR, HER2), advanced melanoma (aMel: BRAF, KIT, NRAS, PDL1), and advanced non-small cell lung cancer (aNSCLC: EGFR, ALK, ROS1, KRAS, PDL1). Using abstracted biomarker status as labeled data, we trained a regularized logistic regression model on a normalized term frequency vector derived from patient records. The model identifies patients likely to have a positive biomarker; they are subsequently sent for confirmatory chart abstraction. Sensitivity and abstraction savings (defined as percent of patient charts not requiring review) were computed. RESULTS We randomly selected 18,100 patients (3291 mCRC, 2409 mBreast, 1329 aMel, 11,071 aNSCLC). The median (IQR) recorded biomarker-positive patient proportion across all disease-biomarkers pairs was 4.5% (2.0%-23.8%). There were 4,525 patients in the training set and 13,575 in the test set. Across disease-biomarker pairs, the median (IQR) sensitivity was 97.3% (91.9%-99.6%), and the median (IQR) abstraction savings was 64.2% (23.6%-78.8%). CONCLUSIONS This ML classification model is highly sensitive, permitting increased efficiency in identification of patients’ treatment-relevant biomarkers in EHR data. This enables a scalable method for the creation of biomarker-defined cohorts, reduces the need for costly human chart abstraction, and improves our ability to study real-world outcomes in precision oncology.
Conference/Value in Health Info
2019-05, ISPOR 2019, New Orleans, LA, USA
Value in Health, Volume 22, Issue S1 (2019 May)
Code
PPM8
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Personalized and Precision Medicine