CD2H response to the NIH Strategic Plan for Data Science RFI

The CD2H response to the NIH Strategic Plan for Data Science RFI is outlined below.  We welcome ongoing opportunities for discussion and collaboration on these and many other issues facing the data science community and appreciate the opportunity to submit this response.

1) The appropriateness of the goals of the plan and of the strategies and implementation tactics proposed to achieve them

We applaud NIH for prioritizing the development of an effective data science Strategic plan and for reaching out to experts in the community for feedback and collaboration. We are excited that the Strategic Plan envisions a new data science paradigm within biomedical research. However, the plan currently covers such an ambitious range of issues that it is as-yet too general to meaningfully critique or understand how it will be implemented.  While some issues presented within the Strategic Plan seem simple, our experience is that these details, particularly with regard to data integration, are complex and likely to raise critical issues.  We encourage NIH Chief Data Strategist, the Scientific Data Council and the Data Science Policy Council to continuously engage with the community, so that strategy and implementation details can be realized as part of a collaborative and reasonably nimble implementation plan (perhaps via RFI). Such an implementation plan should be grounded in a landscape analysis so that existing software can be maximally leveraged and available funds devoted to addressing crucial gaps.

While competitions (e.g., challenges, hackathons, and funding projects with overlapping scope) are deemed valuable, it is critical that the outcomes of these efforts meet basic interoperability standards and/or contribute to addressing gaps. A great example of this lack of clarity around expected interactions are the “Commons” resources. While pursuing multiple approaches can be critical in creatively addressing long-standing issues, the lack of cross-team understanding around requirements and scope can hinder future interoperability. Examples are disease/anatomy specific resources like FaceBase, DkNET, etc – all of which could have been built off of similar infrastructure components.

Through the Strategic Plan, Implementation Plan, and NIH leadership, we hope to see steps to bridge differences between Institute & Centers and thus begin to develop some common ground through the broader NIH community. Essential to this effort is to build from the Holdren memo and other broad policies to facilitate meaningful data sharing and reuse. To date, it has too frequently been the case that shared data becomes siloed within specific systems, and therefore, the full value of shared data has not been achieved. The FAIR standard is a first step to address these issues, however, there is still significant work that remains to achieve the intent.

A first step within this effort is for NIH to unpack what they and others mean by “FAIR”. Critical to advancing this effort will be funding “blue collar” work and tools needed for Interoperability and Reusability — arguably the hardest parts of FAIR. Such activities include ontology maintenance, identifier management and mapping tools, licensing, extensive biocuration, and contributor roles/research outputs (see below), which have traditionally done poorly in grant peer-review funding. We urge NIH to consider the activities needed to achieve the potential of FAIR and to avoid “FAIR-washing,” — the implied adherence to principles but without actual implementation or validation.

2) Opportunities for NIH to partner in achieving these goals

NIH. Ironically, one of the most important partners is NIH itself. NCBI and other NIH infrastructure resources require modernizing to support current data science needs; however, there are few mechanisms to support collaboration with the community to perform these modernization tasks. Moreover, NCBI has not typically embraced standards that it did not invent, no matter how well adopted or useful. NIH should:

  • Prioritize collaboration using community best practices, standards, or resources
  • Insist on transparent and public development, trackers, and attribution – just like we expect of all our other data science colleagues.

Other federal and international partners. NIH should partner more with FDA, USDA, NSF, EPA, and other federal research infrastructure programs; for example, FDA’s drug knowledgebases, NSF’s CyVerse, etc. Many such programs have a more modern approach to designing and sustaining research infrastructure and much of it is either relevant or fairly domain neutral. The significant resources in Europe(EBI, SIB) and Japan(DDBJ) should be leveraged, not duplicated. For genomic data and standards, the GA4GH is getting some traction and can help with standardization across genomic resources.

Component partners in the Commons. The Commons, as currently conceived and as mentioned repeatedly in the Plan, is not currently designed to support the Plan’s objectives. The Commons program should be refocused to support interoperability and connectivity of existing resources to meet the following requirements:

Data-ingest needs:

  1. Shared tooling for pushing data into the Commons according to the same models and standards. We and others have preliminary work on some of these types of tools and standards, but these are not currently part of the Commons program.
  2. Tooling for assessing data quality, whether syntactic (identifier syntax, field completion, timestamps, evidence) or biological (no ovary expression in male subjects, data at wrong stage of life, etc.). Again, many such tools exist but are not currently part of the Commons.

Data access and export needs:

Communities and individual end users have a need to see a view over the data that is tailored to their needs. User interface tools and APIs need to be built over the Commons to support a) organism specific/scoped – zebrafish, mouse, vertebrates, etc.; b) disease specific/scoped – diabetes, neurological disorders, rare diseases; c) data type specific – genomic data, proteomic data, metabolomic data; d) context specific/scoped – data generated by a person, a lab, an institution, a funded program, an NIH institute. These views need to interact with Commons and Workflow engines (including provenance and integration aspects) within to generate, use, and view derived datasets. This creates an ecosystem approach where those that clean, improve, or derive data can give those enhancements back for other to use and to view. Again, many such resources exist, but they are not included in the Commons.

Collaborate with industry more on data sharing, licensing, cloud solutions, and sustainability – see below.

3) Additional concepts that should be included in the plan

As indicated in the Plan, the NIH has funded data science in the same way it has hypothesis-driven mechanistic science. However, traditional funding vehicles may not always be appropriate for Data Science:

  • Data scientists often require deeper and more durable collaborations – to get data, to reuse and extend infrastructure that takes significant effort to build, to evolve and test other’s algorithms, etc.
  • Traditional outcome measures (eg. T&P, grants) are not well suited to the diverse work data scientists do.
  • Review panels currently exclude non-PIs and thereby a large portion of the workforce uniquely suited to evaluate data science proposals (see also evaluation section).
  • Data and compute infrastructure is expensive to build and maintain; they should be improved and sustained as a collaborative community partnership rather than subject to the current funding mechanisms that favor overlapping and competing tools. Unfortunately, despite mention of this difference, there is virtually no discussion of sustainability implementation details, despite Goal 5 being about sustainability.

Incentive and attribution needs to be structured to encourage high-quality continuously improved data; tools and infrastructure are needed to support this. Industry partnerships can improve sustainability but require finding the best balance between openness and revenue. This is a classic tragedy of the commons/anticommons: sustainability requires IT knowledge combined with advanced economics, business intelligence, entrepreneurship and robust QC pipelines so that commercial interests know what they are getting. There is a sprinkling of such notions throughout the Plan, but this should be more robust and clear.

When building data science infrastructure, social barriers are often bigger than technical ones. NIH could build funding into projects (including cross-IC) to promote collaboration and interoperability. Infrastructure programs such as NCI’s ITCR have collaboration budgets built in to support integration work. Perhaps NIH should consider this as a review criteria and prioritize how such interoperability activities will be achieved and evaluated in infrastructure programs. Further, we often are in the situation where high-quality data resources wane in interoperability because they do not have the remit and/or adequate financial support to keep pace with modern standards; MOD databases are just one example.

On a related topic, we recommend that the plan make explicit provision for biocuration, specifically to modernize current processes, improve standards, and expand training. While data “cleaning” is perceived as mundane, increasingly data scientists are coming to grips with challenges and limits in automating tasks such as generating metadata, provenance, and referencing other data. Biocuration and the people that perform this critical work are often undervalued, whether it is those that perform record-by-record curation (such as in the Gene Ontology) or database-by-database curation via ETL mechanisms that harmonize data at scale (such as in the Monarch Initiative, NCATS Data Translator, Bgee, etc.).

4) Performance measures and milestones that could be used to gauge the success of elements of the plan and inform course corrections

We believe it would be beneficial to:

  • Further describe the desired impact of linking the NIH Data Commons and existing, widely-used NIH databases/data repositories,
  • Ensure new NIH data resources are connected to other NIH systems upon implementation,
  • Understand the community and resources to design effective coordination strategies

In the Implementation Plan we believe that existing and nascent projects would benefit from discussion of backbone work such as shared data formats, API standards, common ETL and curation tools, etc. Also, it is unclear how NIH will measure the effectiveness and impact of the infrastructure. We suggest that NIH include a professional evaluator and combine existing rubrics such as FAIR, FAIR-TLC (doi:10.5281/zenodo.203295) into a best-of-breed instrument. Additionally, data, software, and expertise need to be evaluated together, and infrastructure planned accordingly.

That the Plan proposes different criteria for different types of software and data resources suggests a lack of understanding of the landscape of existing resources and users. The Plan appropriately recognizes clinical data inclusion as key goal, and the resources such as the Translator aim to do exactly this. What is missing from the Plan is the approach to crossing the “chasm of semantic despair” (coined by Chute), wherein we understand how to relate biomedical research to improved clinical processes, outcomes, and insights. The NIH CDE library is promoted, but there is no effort described to evaluate and improve it in order to more effectively cross of the chasm. Substantial work is needed to evaluate CDEs, LOINC, and clinical terminologies, notably to ensure interoperability with basic research vocabularies and standards.

Goal 5 is about policies for stewardship and sustainability, but there is little detail about how to achieve this. It states that NIH will promote community-guided development of open data-use licenses. We applaud this, as we ourselves have been working hard in this area as data integrators are currently blocked from redistributing seemingly open data, and the answer is not as simple as declaring everything should be CC0 (see our letter to NIH at: doi:10.6084/m9.figshare.4972709.v1 and our project)

Better metrics will help support stewardship and sustainability. Disproportionate emphasis is placed on publication metrics as a stand-in for value, utility, or impact. It would be better to directly identify, track, and assess the role that datasets and other assets play in biomedical research, improved outcomes, and healthier communities. Through investments in data systems and their interoperability, these connections can be made directly through the systems themselves, enabling researchers the ability to trace impacts and outcomes across the spectrum of biomedical research.

Finally, data science is inherently collaborative. Recognition and proper credit of the data science workforce is critical to catalyzing lasting change. Systems can support recognition of these efforts through better contributor roles that provide a more nuanced view of the work required to support good data practices, as well as better recognition of the diverse array of outputs created in the research process, extending far beyond publication.

5) Any other topic the respondent feels is relevant for NIH to consider in developing this strategic plan

Goal 1 is about “Efficient and Effective Biomedical Research Data Infrastructure”, but focuses almost entirely on cloud storage. While cloud computing is important, in the absence of meaningful integration, simply moving data to the cloud achieves neither efficiency nor effectiveness. By contrast, the work described above to support interoperability and data quality is agnostic to whether or where data is in the cloud. Further, the Plan makes no mention of the technologies that will be needed to support security and matching different types of users to different types of access – something that industry and clinical systems have as part of their modus operandi. The NIH Data Commons is used as an example of working towards efficiency and efficacy, but the data quality or interoperability issues described above are not adequately addressed in the Commons effort to date and the Commons should focus on building connectivity between existing systems and data more.

Goal 4 is about education. Currently, most research infrastructure programs don’t specifically include trainees, as infrastructure development and the team science required for it has not been considered as “science.” Despite this, many universities now have interdisciplinary data science programs that support this notion. This is a missed opportunity. Similarly, clinical professions lack much in the way of data science training and an understanding of how the data they capture in the clinical will be utilized. If we want to support translational science across all career levels and all professions, we need to build in basic data stewardship and ethics relating to information science into all educational programs.

We concur that it would be beneficial to train NIH program staff in data science as well. On the NCATS Data Translator, program staff are technically participating in the project – this has been a huge asset in guiding the work and revealing fundamental areas of synergy and gaps.

We also concur that there are enormous opportunities for libraries – as a central nexus of most university settings – to participate in enriching the data-science ecosystem for biomedical research. However, most library staff are not trained in data science nor research, and while approaches to evolving this capability are beyond the scope of this response, we wholeheartedly agree that they should be prioritized.

Data science is a very young field, and many that have come to it have done so via non-traditional mechanisms. Review panels currently exclude non-PIs and thereby a large portion of the workforce uniquely suited to evaluate data science proposals. Moreover, a more inclusive peer review would ideally go much further than just assessing a proposal’s merit; it could evaluate the implementation as well. Finally, we should leverage new metrics for understanding degree of collaboration and interoperability across people and resources.