CD2H Phase 2 Proposal
Project Title: Science of translational science research platform
Dave Eichmann, email@example.com, Iowa
Keith Herzog, firstname.lastname@example.org, NU
Kristi Holmes, email@example.com, NU
There are numerous source of metadata regarding research activity that CTSA hubs currently duplicate effort in acquiring, linking and analyzing. This project provides a shared data platform for hubs to collaboratively manage these resources, and avoid redundant effort. In addition to the shared resources, participating CTSA hubs will be provided private schemas for their own use, as well as support in integrating these resources into their local environments.
This proposed project builds upon multiple components completed in the first phase, specifically: a) data aggregation and indexing work of research profiles and their ingest into and improvements to CTSAsearch by Iowa (http://labs.cd2h.org/search); b) NCATS 4DM, a map of translational science; and c) metadata requirements analysis and ingest of from a number of other Phase 1 projects, including educational resources from DIAMOND and N-lighten, development resources from GitHub, and data resources from DataMed (bioCADDIE) and DataCite. This work also builds on other related work on data sources, workflows, and reporting from the team, including disambiguated PubMed (U of Iowa), ORCID data and integrations, NIH RePORT, Federal RePORTER, and other data sources and tools.
GitHub repo: N/A
Organizations expend substantial effort maintaining local databases of effectively the same data – people, publications, grants, etc. and the challenge of scholar disambiguation and longitudinal data collection and tracking remains unsolved. A shared data environment in the form of a warehouse of research data was strongly endorsed by participants in the most recent PEA Community meeting. Collaborative population and maintenance of common data would reduce local hub effort, improve data quality, and serve as an exemplar of collaborative activity for the CTSA program and NIH programs overall. Substantial effort has been spent on this topic by hubs establishing priorities and developing manual and semi-automated processes which can help to guide efforts toward automation.
The 4DM Project (Drug Discovery, Development and Deployment Map) created by NCATS has generated substantial interest in understanding the interdependencies of translational research and the entities involved. The 4DM prototype will be extended to incorporate relevant backing data from the data warehouse to display when selecting a vertex in the visualization graph. Ultimately, we can leverage these data for a variety of purposes at hubs, including workflows for improved data quality, process efficiency, automation, benchmarking, etc. We will first examine relevance for longitudinal scholar data tracking and reporting.
Initial steps in this project include:
- Configuration of a core warehouse instance – potentially situated in the NIH/NCATS cloud environment
- Migration of existing schemas into the warehouse and instantiation of maintenance processes to keep them up-to-date
- Configuration of local schemas for each CTSA hub and other interested parties
- Creation for example solutions for ingest/extract using JDBC, REST, and tools such as teiid (an open source data federation platform).
The PEA working group has the core of such a warehouse in place in support of the data enrichment elements of CTSAsearch and the phase 1 projects enumerated above.
We will then integrate the 4DM user interface and data model with the service interfaces and data model present in CTSAsearch. We also propose several evaluative efforts to assess data quality and currency, workflows, and application. We also intend to more fully explore globally unique identifiers for researchers and their work products, as well as disambiguation of these entities at an acceptable level of data quality and capable of operating at CTSA-scale. We will work with the education committee, the workforce DTF, the CTSA evaluators, and training grant PIs and support staff to establish clear scope, requirements, and establish a vehicle for sharing results.
The complexity of the translational landscape is challenging to grasp – both for patients (and their caregivers) and for experts deeply enmeshed in translational science. This landscape study will help to identify data and reporting needs, opportunities for progress, and a clear opportunity for incorporating CD2H initiatives/data into hub-level scholar workflows. A common environment for shared access to data/metadata regarding research reduces effort, improves quality, and can serve as substantial infrastructure for multiple current and future CTSA projects. In particular, it will serve as a means of disseminating data to the community.This work will result in a hybrid 4DM-CTSAsearch tool which can provide significant aid to translational information seekers. This has the potential to be of huge benefit to the local hub data and reporting needs to improve processes around data collection, analysis, and reporting and can result in massive savings of time and effort for the hubs. Moreover, standard processes open up the possibility of data, analysis, and visualizations (e.g., interactive, online dashboards) as a service and improved reproducibility overall.
Expected outputs (6 months):
We expect at 6 months to complete a landscape analysis of data, tools, and workflows and establish an operational warehouse with a core set of common data and multiple hubs in active use. We will engage in active requirements gathering with the community and with NCATS, as well as leverage the personas to help ensure better UX overall. We will develop a disambiguation environment available to the community and in use by multiple CTSA hubs and two 4DM prototypes: one wrapping CTSAsearch services, and one mapping 4DM-style user interfaces onto CTSAsearch. Our goal is to generate sufficient feedback from the community to drive decision-making on potential next steps.