Time-Dependent Profiling of Distinct Stages Prior to Breast Cancer Onset Using Free-Text Diagnosis Names
Author(s)
Lorenzo R1, Holmes B1, Green F2, Loving J1
1Syapse, San Francisco, CA, USA, 2Syapse, San Diego, CA, USA
Presentation Documents
OBJECTIVES: Early detection of breast cancer (BC) is crucial in determining patient outcomes. Modeling the patient journey prior to BC diagnosis is therefore an important task. Patient diagnoses are often available as free text, and difficult to represent for predictive analytics. We introduce the use of sentence transformers, paired alongside a novel association through unsupervised clustering to yield highly relevant patient journey representations.
METHODS: We generated a vocabulary of 9,915 diagnoses from patient visits at most one year before a BC diagnosis, inclusive of the BC diagnosis visit. We used the Biomed-Roberta sentence transformer to vectorize these diagnoses. We clustered using silhouette scoring for optimal cluster number, and found centroids. These were again clustered to group similar concepts to clinically-relevant categories.
Patients were selected, either 6 months or 3 weeks before BC diagnosis by randomized, equally-weighted patient assignment. Diagnoses up to a year prior were vectorized. We created an XGBoost model trained using these vectors to classify the two groups (75/25 train/test split).RESULTS: Expert review established cluster quality and confirmed all breast cancer diagnoses in a single cluster. In the BC diagnosis cluster, all units were breast-related, and 228/237 were breast cancers. Non-BC members were breast deformities or genetic susceptibility to BC. Max silhouette score was 0.87. XGBoost classified 23,521 patients as 6-month or 3-week with an accuracy of 75%, F-score of 0.73. Relevant clusters to BC diagnosis included limb pain and nausea.
CONCLUSIONS: We showed signal separating patients at critical time points prior to BC diagnosis. This signal was found using the relative position of patient diagnosis in vector space; we have demonstrated that valuable insights into patient status and progress can be found using unsupervised clustering. This work, while early, establishes a technique that we are developing towards early-prediction capabilities.
Conference/Value in Health Info
Value in Health, Volume 26, Issue 6, S2 (June 2023)
Code
RWD14
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Electronic Medical & Health Records
Disease
No Additional Disease & Conditions/Specialized Treatment Areas