Implementation and Validation of the Use of N-Gram Models for Classifying Social Determinants of Health Status from Real World Unstructured Clinical Notes

Author(s)

Kumar V1, Mummert A2, Darbeloff T1, Rasouliyan L1, Althoff A1, Black D1, Chang S1, Long S1
1OMNY Health, Atlanta, GA, USA, 2OMNY Health, Sacramento, CA, USA

OBJECTIVES: Social determinants of health (SDoH) are increasingly leveraged in health research. However, availability of structured SDoH data is limited in electronic health records. Our objective was to implement and validate the use of n-gram models using unstructured clinical notes to classify patient risk status across five key SDoH domains.

METHODS: Deidentified clinical notes from three hospital systems (2017-2022) in the OMNY Health real-world data platform were examined for the presence of indicative phrases previously published for four SDoH insecurity domains (economics [EI]; housing [HI]; social isolation [SI]; and transportation [TI]). For a fifth domain (undereducation [ED]), n-grams were composed. To measure precision, 50 random positive occurrences (e.g., hits) from each domain-hospital combination were manually annotated for accuracy. Models and included notes were iteratively refined until model precision met a pre-defined threshold (80%). After finalization, patient counts and percentages were calculated. Recall was estimated using the overlap between hits and presence of corresponding ICD-10 diagnostic codes for each SDoH domain.

RESULTS: Clinical notes from 9.34 million patients were included. Overall patient counts and percentages for domain-specific hits were as follows: EI (628K; 6.6%); HI (100K; 1.1%); SI (154K; 1.6%); TI (91K; 1.0%); and ED (9.0K; 0.1%). Precision and recall were as follows: EI (87%, 60%); HI (95%, 52%); SI (82%, 24%); TI (87%, 25%) and ED (90%, 7%).

CONCLUSIONS: We found that 5-10% of patients had positive hits in an SDoH domain when using unstructured clinical notes and n-grams. Drawbacks relative to transformer-based techniques may include lower precision/recall ceilings, due to factors including false-positive hits from negated n-grams and exact term mismatch. Recall may be artificially reduced due to underutilization of available ICD-10 Z codes. This approach offers a method to generate a patient-level indicator of SDoH risk status based on information collected during a clinical encounter that can be leveraged in health research.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

MSR82

Topic

Economic Evaluation, Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Distributed Data & Research Networks, Electronic Medical & Health Records, Novel & Social Elements of Value

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×