Implementation and Validation of the Use of N-Gram Models for Classifying Social Determinants of Health Status from Real World Unstructured Clinical Notes
Author(s)
Kumar V1, Mummert A2, Darbeloff T1, Rasouliyan L1, Althoff A1, Black D1, Chang S1, Long S1
1OMNY Health, Atlanta, GA, USA, 2OMNY Health, Sacramento, CA, USA
Presentation Documents
OBJECTIVES: Social determinants of health (SDoH) are increasingly leveraged in health research. However, availability of structured SDoH data is limited in electronic health records. Our objective was to implement and validate the use of n-gram models using unstructured clinical notes to classify patient risk status across five key SDoH domains.
METHODS: Deidentified clinical notes from three hospital systems (2017-2022) in the OMNY Health real-world data platform were examined for the presence of indicative phrases previously published for four SDoH insecurity domains (economics [EI]; housing [HI]; social isolation [SI]; and transportation [TI]). For a fifth domain (undereducation [ED]), n-grams were composed. To measure precision, 50 random positive occurrences (e.g., hits) from each domain-hospital combination were manually annotated for accuracy. Models and included notes were iteratively refined until model precision met a pre-defined threshold (80%). After finalization, patient counts and percentages were calculated. Recall was estimated using the overlap between hits and presence of corresponding ICD-10 diagnostic codes for each SDoH domain.
RESULTS: Clinical notes from 9.34 million patients were included. Overall patient counts and percentages for domain-specific hits were as follows: EI (628K; 6.6%); HI (100K; 1.1%); SI (154K; 1.6%); TI (91K; 1.0%); and ED (9.0K; 0.1%). Precision and recall were as follows: EI (87%, 60%); HI (95%, 52%); SI (82%, 24%); TI (87%, 25%) and ED (90%, 7%).
CONCLUSIONS: We found that 5-10% of patients had positive hits in an SDoH domain when using unstructured clinical notes and n-grams. Drawbacks relative to transformer-based techniques may include lower precision/recall ceilings, due to factors including false-positive hits from negated n-grams and exact term mismatch. Recall may be artificially reduced due to underutilization of available ICD-10 Z codes. This approach offers a method to generate a patient-level indicator of SDoH risk status based on information collected during a clinical encounter that can be leveraged in health research.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 6, S1 (June 2024)
Code
MSR82
Topic
Economic Evaluation, Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Distributed Data & Research Networks, Electronic Medical & Health Records, Novel & Social Elements of Value
Disease
No Additional Disease & Conditions/Specialized Treatment Areas