Improving Access to German Health Claims Data Through Synthetic Data Generation: A Methodological Comparison of Different Approaches

Speaker(s)

Heidler T1, Schultze M2, Kafatos G3, Behera B4, Lienau C5, Unger A5, Balko V5, Brandenburg J5, Wang Z6, Großer P7, Hilbert A8, Kossack N1, Pignot M2
1WIG2 GmbH, Leipzig, SN, Germany, 2ZEG - Berlin GmbH, Berlin, BE, Germany, 3Amgen Ltd., Uxbridge, UK, 4Amgen Research (Munich) GmbH, Munich, BY, Germany, 5AstraZeneca GmbH, Hamburg, HH, Germany, 6Medicines and Healthcare products Regulatory Agency, London, G, UK, 7Limebit GmbH, Berlin, BE, Germany, 8ai4medicine, Berlin, BE, Germany

OBJECTIVES: In Germany, privacy concerns limit access to health claims data, hindering comprehensive health research. Synthetic data, designed to preserve privacy without compromising data quality, presents a promising solution. However, literature lacks a thorough evaluation of synthetic data generation methods for health claims data, particularly German data. This study aims to evaluate different methodologies for generating longitudinal relational synthetic databases from health claims data using systemic lupus erythematosus (SLE) as a test case due to its clinical complexity and data richness. Furthermore, a framework is established for evaluating their quality and privacy compliance and to compare their practical applications and scalability for larger, heterogeneous populations.

METHODS: We are developing a comprehensive evaluation framework to evaluate synthetic data generation methods, including Generative Adversarial Networks (GANs), Adversarial Random Forests (ARFs), and Bayesian Networks (BNs). These models will be trained on health claims data from the WIG2 benchmark database, encompassing patients diagnosed with SLE between 2014 and 2021. The generated data will be assessed for privacy, fidelity, scalability, and utility in terms of generating real-world evidence (RWE).

RESULTS: We have developed a multifaceted evaluation framework to assess key aspects of synthetic health claims data generation. Initial assessments indicate that synthetic data generation can capture certain aspects of the claims data satisfactorily but becomes particularly complex when replicating from a relational database structure. This holistic framework can assess and quantify the trade-offs between scalability, privacy, fidelity, and utility.

CONCLUSIONS: High-quality, privacy-compliant synthetic data can promote data access and innovation, particularly for complex diseases like SLE. Establishing a comprehensive evaluation framework for synthetic data is a critical step towards democratizing access to German health claims data. Our ongoing efforts aim to balance privacy with utility, ensuring scalable and practical applications for broader health research.

Code

RWD196

Topic

Real World Data & Information Systems

Topic Subcategory

Data Protection, Integrity, & Quality Assurance

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)