Sourcing Real-World Data for Research
Doug Foster, MBA; Nitin Karandikar, MS, Advanced Data Sciences, San Francisco, CA, USA
Introduction
Demand for data in healthcare is at an unprecedented level. Pharmaceutical and biotechnology companies, in particular, have increased their focus on nonclinical trial data, or real-world data (RWD), to strengthen operations. Research estimates that an average large pharmaceutical company can save $300 million by adopting real-world evidence (RWE) analytics across its whole value chain.1 Every major drug company has a department focused on the use of healthcare data across multiple diseases as a result.2 These departments solidified the industries’ initial interest in using healthcare data for operations into a steadfast practice.
"Research estimates that an average large pharmaceutical company can save $300 million by adopting real-world evidence analytics across its whole value chain."
Fortunately, healthcare is a data-rich industry. The US healthcare system is estimated to have created a total of 2314 exabytes of data in 2020.3 This translates to approximately 30% of the world’s data by volume. By 2025, the compound annual growth rate of data for healthcare will reach 36%. For context, that’s 6% faster than manufacturing, 10% faster than financial services, and 11% faster than media and entertainment.4
And now healthcare data are more accessible than ever before. The historical barriers that created silos (eg, limited incentives to share, security and privacy concerns, technical inconsistencies to name a few) have largely been reduced either through federal legislation or technological advances. One of the most evident examples is the ONC Final Rule of the Cures Act (2016) that mandates the adoption of standardized application programming interfaces (APIs) to allow individuals access to structured electronic health information using smartphone applications. It also includes a requirement that patients can access all of their electronic health information, structured or unstructured, at no cost.5 This is not to say that accessing healthcare data is easy, it’s just easier.
Choosing the right data sources is critical to avoid wasted time and money on RWD projects. This article focuses on RWD which, as defined by the US Food and Drug Administration, means healthcare information derived from multiple sources outside of typical clinical research settings, including electronic medical records (EMRs), claims and billing data, product and disease registries, and patient-generated data gathered by personal devices and health applications.6 More specifically, this article summarizes the pros and cons of 2 types of RWD sources: (1) data from entities involved with the delivery of care (“primary stakeholder”), and (2) data from commercial entities (“secondary stakeholder”).
“Primary Stakeholder” Data
Primary stakeholder data sources are the local data stores for those that create healthcare data including providers, labs, insurance companies, and patients. The data storage systems they use are EHRs for clinical data, picture archiving and communication system (PACS) for images, lab information systems for lab data, claims databases for claims, and others (Table).
Table. Data Storage Systems in Healthcare
Primary stakeholder data sources are regarded as the source of truth in healthcare even though they are not always accurate and frequently incomplete. They are regarded this way because the creators of these data are delivering care to the patients whether that means treatment, diagnosis, or billing. Additionally, the union of all primary stakeholder data sources is the most comprehensive view of the patient that the healthcare system currently offers. Primary stakeholder data sources represent the greatest opportunity for volume and detail that can, in theory, be used by any type of query whether it’s broad or narrow in scope.
"By 2025, the compound annual growth rate of data for healthcare will reach 36%—that’s 6% faster than manufacturing, 10% faster than financial services, and 11% faster than media and entertainment."
Sourcing data from primary stakeholder data sources is therefore good for highly customized data pulls. And since it connects to the same systems that providers use, bidirectional integrations to primary stakeholder sources also offer the opportunity to integrate into workflows and audit source files. These types of connections are not uncommon. Mayo Clinic as of 2020 had licensed access to its de-identified patient data to 16 companies as just one example.7
That being said, working with primary stakeholder sources can be difficult and time-consuming. Covered entities are generally careful about sharing patient data even though an increasing number are doing so. The security, privacy, and technical liabilities that come with taking ownership of the agreement, and technical implementation challenges, are not to be underestimated.
Additionally, primary stakeholder data can be extraordinarily heterogeneous and unstructured. Institutions, clinics, and other types of care settings frequently adhere to different data standards diluting the benefit of any single standard. And it is estimated that the vast majority of primary stakeholder data are unstructured creating a barrier to syntactic and semantic interoperability.8
Secondary Stakeholder Data
Secondary stakeholder data sources are third parties that aggregate data, or the permissions for the data, and license, or grant, access to a consumer. Secondary stakeholder sources fall into 3 categories (Figure 1):
• Commercial patient registries
• Data vendors
• Data marketplaces
Figure 1. Primary stakeholder data source share by organization type
Commercial Patient Registries
Patient registries created from EHR integrations have been around for a long time; however, they have gained considerable scale over the past 20 years. Specialty societies, academic centers, government agencies, and patient advocacy groups have led the initiative to aggregate patient data for research purposes, and to provide access for a fee. There are well over 120 registries, 90% of which are offered by specialty societies.9 The registries aggregate data from many different sources, usually EHRs, to create a specialized database for a particular specialty, therapeutic area, indication, or patient profile. The size of these registries ranges from a few patients to tens of millions of patients, and not all are commercial. Examples of specialty society registries that have been commercialized include the American Academy of Ophthalmology’s (AAO) IRIS,10 American College of Cardiology’s (ACC) and Veradigm’s Cardiology Registry (formerly PINNACLE),11 and the American Society of Clinical Oncology’s (ASCO) CancerLinQ, among others.12
Registries are good for retrospective research on specific patient populations or therapeutic areas. In general, these registries are used only for research and not for commercial purposes. There are exceptions, such as when the registry sponsor partners with a for-profit data analytics company to prepare the data for commercial research projects. Verana Health is a good example of a ‘public-private partnership’ where several nonprofit specialty societies (AAO, AAN, and AUA) work with Verana Health to curate and commercialize the data in their registries.
The disadvantages of registries are that the consumer has very little control over the content and curation strategies. Additionally, registries are not always open to the public. Historically they have been used for noncommercial research rather than commercial initiatives, and some maintain that approach. The National Amyotrophic Lateral Sclerosis Registry, for example, reviews applications and permits only organizations that attest that the use of the data is aligned with the society’s mission.
"Demand for healthcare data is increasing as access gets easier. The chief ramification of this change is that organizations across the healthcare industry can now create and implement a 'healthcare data strategy.'"
Data Vendors
Increasingly both for-profit and nonprofit organizations are aggregating data from various healthcare settings including health systems, claims clearinghouses, and labs for commercial purposes. These data are de-identified to comply with Health Insurance Portability and Accountability Act (HIPAA) and offer some of the largest datasets available as they are able to combine single data types across many locations. Examples of data vendors include Definitive Healthcare, Clarify Health, and IQVIA for claims data; Concert.AI, Flatiron Health, and others for clinical data.
Data vendors are an efficient data source for retrospective analyses of reasonably well-normalized datasets. The advantage of working with data vendors is that they bring together many disparate data sources to a common database and typically have something for everyone. Replicating their scale and heterogeneity is oftentimes insurmountable in comparison to a home-grown initiative. They are commercial enterprises, which typically (but not always) means they move quickly to work with data consumers.
The disadvantages to working with a data vendor are that the consumer has very little control over which data are collected, how they are pooled with other data, the ability to audit the data at the source, and there are no workflow integrations with the data sources. Additionally, the costs tend to be very high (Figure 2).
Figure 2. Healthcare data company formations (cumulative and absolute)
Data Marketplaces
Marketplaces are essentially brokers connecting buyers and sellers of data. These companies aggregate both supply and demand for the data and make the connection. Examples here include HealthVerity, Veradigm (an affiliate of multiple EHR companies), and Prognos Health. The benefit to these marketplaces is that they offer consumers a relatively fast (although not inexpensive) way to purchase de-identified data, and also a way for sellers to monetize their data assets. They are good for retrospective analysis of reasonably well-normalized and curated datasets.
Marketplaces are a good source of data when speed is a priority. These datasets may be similarly broad compared to data vendors and also ready for an initial analysis through a web portal. This means that analytics can start almost immediately (which in healthcare means a few days or weeks to secure data privacy permissions).
What you get in speed can lead to deficiencies in detail. Many of the data marketplace datasets are what-you-see-is-what-you-get. There is essentially no backward compatibility or ability to integrate into provider workflows. This means that retrospective data that needs to be accessed quickly is a good fit for data marketplaces.
Ramifications
Demand for healthcare data is increasing as access gets easier. The chief ramification of this change is that organizations across the healthcare industry can now create and implement a “healthcare data strategy.” A “healthcare data strategy” sets the plan for any organization to include empirical evidence to improve the development and launch of products and services to care for patients.
Choosing the right data source and sourcing technology efficiently are paramount when crafting a healthcare data strategy. Key considerations such as data type, quality, latency, and functionality will affect one’s ability to use healthcare data for an organization’s objectives. Each variable brings its own advantages and disadvantages, many of which can be analyzed prospectively.
Now there are many reliable options—and more on the way—setting the stage for an exciting new era for the biopharma industry.
References
1. Culbertson N. The Skyrocketing Volume of Healthcare Data Makes Privacy Imperative. Forbes. August 6, 2021. Accessed August 31, 2023. https://www.forbes.com/sites/forbestechcouncil/2021/08/06/the-skyrocketing-volume-of-healthcare-data-makes-privacy-imperative/?sh=11b3efb16555
2. Wiederrecht G. The Convergence of Healthcare and Technology. RBC Capital Markets. Accessed August 31, 2023. https://www.rbccm.com/en/gib/healthcare/episode/the_healthcare_data_explosion
3. McKinsey & Company. Creating Value From Next-Generation Real-World Evidence. July 23, 2020. Accessed August 31, 2023. https://www.mckinsey.com/industries/life-sciences/our-insights/creating-value-from-next-generation-real-world-evidence
4. Office of the National Coordinator for Health Information Technology (ONC). Accessed August 31, 2023. https://www HealthIT.gov
5. Hirschler B. Big Pharma, Big Data: Why Drugmakers Want Your Health Records. Reuters. March 1, 2018. Accessed August 31, 2023. https://www.reuters.com/article/us-pharmaceuticals-data/big-pharma-big-data-why-drugmakers-want-your-health-records-idUSKCN1GD4MM
6. US Food & Drug Administration. Framework for FDA’s Real-World Evidence Program. December 2018. Accessed August 31, 2023. https:// www.fda.gov/media/120060/download
7. Ross C. At Mayo Clinic, sharing patient data with companies fuels AI innovation—and concerns about consent. STAT News. June 3, 2020. Accessed August 31, 2023. https://www.statnews.com/2020/06/03/mayo-clinic-patient-data-fuels-artificial-intelligence-consent-concerns/
8. Hoyt RE, Bernstam EV, Hersh WR. Chapter 1: Overview of health informatics. In: Hoyt RE, Hersh WR, ed. Health Informatics: Practical Guide. 7th ed. Informatics Education. 2018:1-6.
9. CMSS Primer for the Development and Maturation of Specialty Society Clinical Data Registries. Council of Medical Specialty Societies; 2016. https://cmss.org/wp-content/uploads/2016/02/CMSS_Registry_Primer_1.2-1.pdf
10. American Academy of Ophthalmology. IRIS registry. Accessed August 31, 2023. https://www.aao.org/iris-registry
11. American College of Cardiology. Partner registries. Accessed August 31, 2023. https://cvquality.acc.org/NCDR-Home/registries/outpatient-registries
12. American Society of Clinical Oncology. CancerLinQ. Accessed August 31, 2023. https://www.cancerlinq.org/