Next Generation Data Sharing


Harmonize the data ecosystem. An improved data ecosystem will enhance and extend existing work being performed on the NCATS Data Translator system, which integrates clinical and translational data at scale for mechanistic discovery, as well as other emergent systems such as the NIH Commons. We will apply our strengths and existing activities to make data FAIR-TLC: Findable, Accessible, Interoperable, and Reusable, as well as Traceable, Licensable, and Connected. We will assist contributors and users to develop and apply data standards, Common Data Elements (CDEs), and other commonly utilized data models such as FHIR and OHDSI. We will extend and supplement infrastructure, training, and collaborative environments to enable data to be shared openly, so that groups can collaborate on its harmonization based on specific needs or standards. The data ecosystem will provision CTSA-wide quality assurance reports and data quality assessment, as well as gold-standard datasets and synthetic clinical data sets. Fundamentally, we aim to develop an open-science ethos and unite CTSA community data sharing with broader global efforts.


  • Data inventory & API registry for translational knowledge integration and discovery
  • Support robust data sharing via technologies such as Synapse
  • Work with iDTF/ACT to build consensus on shared data models & ontologies
  • Develop licensing standards and computable data use agreements; permissions navigation
  • Develop data quality assessment standards


To improve the interchangeability and utility of clinical data across the CTSA network, partnering with related CTSA programs such as ACT and TIN.


Three main projects have been established:

Clinical Data Harmonization

The Data Harmonization Working Group in CD2H hosted a two-day (May 20-21) workshop in Baltimore on data harmonization and federated query.  Presentations with video and slides are available in this folder.  The first part of the meeting overviewed common data models prevalent among CTSA hubs, outlining their structure, evolution, governance, and operations.  OMOP/OHSDI, PCORNet, and ACT were featured.  Additional presentations were made about HL7 FHIR, including its integration with i2b2 repositories, its role in the NCATS/FDA/CDC data model harmonization efforts, and its potential as a canonical hub for translational research data interoperability.   The importance of maintaining linkages from the traditional common data models into and out of FHIR repositories was emphasized.

The second group of presentations illustrated a federated research query system in South Carolina, built around FHIR queries and data repositories.  Colleagues from the CDC overviewed the challenges of public health data surveillance, and the role that FHIR can play in federate query of clinical organizations.  The CTSA ACT consortium and its leveraging of SHIRE for federated query was expanded upon.  Efforts in large research consortia to incorporate data elements of importance to translational research, such as the eMERGE efforts with genomics, were highlighted.   Finally, practical open-source FHIR servers from Google and the Cerner Bunsen project were described.

Panels and breakout groups further explored these themes, and the practical issues confronting CTSA hubs.  A full manuscript about the meeting will be forthcoming.  Among the concluding points of convergence were:

  • Establish a translational research connection with the HL7 and FHIR development community, such as a Translational Research Working Group
  • Systematically study how traditional common data models can be represented in FHIR, and explicitly identify gaps for components that cannot be represented presently
  • Identify and package FHIR education materials appropriate for CTSA informatics teams with appropriate context, introduction, and brief description
  • Enumerate, describe, and compare existing FHIR-based data repositories from IT vendors and open-source development groups, for application in translation science
  • Convene a task group to examine sustainability strategies for data management infrastructure, and change management across CTSA organizations

More recently, all of these task groups have been created. 

For detailed status on these projects go to: 

Health Open Terminologies (HOT) on FHIR

Virtually all clinical and translational science data must be bound to controlled terminologies or ontologies to be interoperable, interpretable, and computable. The integration of data from different, potentially disparate domains requires a federated, shared suite of terminology resources based on a common set of API’s, representational formats and underpinning semantics. The goal of the HOT-FHIR ecosystem project is to establish a unifying framework and scaffolding that allows terminological resources to be integrated, merged and extended to meet the requirements of the translational community. 

For detailed status on this project go to: 

EHR Data to Human Phenotype Ontology

The Human Phenotype Ontology (HPO) is a freely available and open source logically defined vocabulary for describing human abnormal phenotypes. The HPO has become the de facto standard for computational phenotype analysis in genomics and rare disease, being used by the NIH Undiagnosed Diseases Network, the 100,000 Genomes project, and many other academic, clinical, and commercial entities. The HPO currently contains 14,184 terms (February, 2019).

A phenotype-driven approach opens up entirely new ways of mining EHR data for correlations that might be important in understanding disease pathophysiology, gender or age-differences, and biomarkers. It is important to develop clever ways of analyzing the data. We expect that many phenotype abnormalities might be highly correlated in all disease states, and thus identifying such an “obvious” correlation would not be an interesting result. For instance, Abnormal hematocrit and Abnormal hemoglobin level are expected to be highly correlated. Here, we propose adapting the approach taken to characterize synergy networks in expression data which was developed to find gene-gene interactions that are specifically associated with a phenotype (such as a particular cancer). The method is based on an information theoretic analysis of multivariate synergy that decomposes sets of genes into submodules each of which contains synergistically interacting gene. The method can be extended to phenotype to search for pairs of markers (HPO terms) that show mutual information conditional upon the presence of a specific diagnosis (e.g., an ICD9 code, or possible an eMERGE classification). The result would be a data driven way of defining pairs of features that show a surprising correlation in the presence of a disease — this might lead to the discovery of potential biomarkers (in this case, if one finds some HPO term in a person with some disease, then “synergy” would suggest the other HPO term of the pair would be more likely to be present than expected by chance). We also believe this might be a good opportunity to engage CTSA hubs in data exploration or the use of this approach/resulting derived data for DREAM challenges.

A detailed Implementation protocol is available in this GoogleDoc.

For detailed status on this project go to: