CD2H Phase 2 Proposal
Project Title: Reusable Data best practice portal
CTSA hubs produce many valuable datasets which can be shared across broader research community. A significant gap exists between the data providers and the data discovery portals where researchers look for relevant datasets (including the generic Google Dataset search and many biomedical specific data portals). Data providers typically lack sufficient guides to expose their dataset metadata to these data portals. This web-based data-sharing best-practice will serve as the interactive widgets to guide through the important steps of data-sharing, which will make providers’ datasets discoverable through multiple data portals.
This effort builds on several Phase 1 efforts, including schema viewer prototyping work by the Data Discovery team for a Data Sharing Best Practices Portal, which helps to lay the foundation for building the metadata-authoring section of this proposed best practice web portal. We will also leverage the outputs from the (Re)usable Data Project, which provides a detailed criteria for evaluating data licensing information (NCATS Translator & the Data Commons) which describes information about data reusability criteria as it related to the FAIR data principles.
The biomedical and informatics communities have largely endorsed the spirit and basic components of the FAIR Data Principles. Biomedical data producers, including CTSA hubs, need actionable best-practice guidance on how to make their data discoverable and reusable, and bring the practical benefits of data sharing to researcher’s own research projects, as well as the research community as a whole. This project will target the specific steps during the data-sharing process, such as data-hosting, data-licensing and metadata-authoring.
A collaborative guidebook
The data-sharing best practice guidebook will be created in the open and documented in Github. Community members will be able to read, edit, and add to the guidebook as it is being created. In addition to providing recommended practices for different scenarios, the guidebook will also document comprehensive information about the different options and standards which might fit users’ particular data use-cases.
A interactive web portal
This data-sharing best practice web portal will translate essential principles from the guidebook into interactive and easy-to-follow web widgets. The portal will be particularly useful for the CTSA end-users who are new to data-sharing. It will cover essential steps like data-hosting, data-licensing and metadata-authoring, as well as providing entry points to the guidebook for more detailed documentation.
A metadata-authoring web portal
Schema.org provides an established way to share the metadata of most any entities across the web to maximize their discoverability. It’s currently used by all major search engines (Google, Bing, etc) and millions of individual websites. The biomedical community has started to adopt the same mechanism to solve the discoverability issues for the purpose of reusable data. This adoption is still in its early stage, so needed tools and documentation are largely incomplete or missing. The proposed metadata authoring portal is designed to fill this gap and will include the following components:
- A schema editor to help users create their own schemas based on existing core class types from schema.org (e.g. the Dataset schema).
- A schema viewer to visualize developed schema (defined in JSON-LD) into a user-friendly HTML rendering.
- A JSON-LD based mechanism to associate proper ontology/terminology to the specific schema field.
A metadata validation tool to provide immediate feedback on whether selected metadata are valid against the schema and how their metadata will be rendered in the search portals (e.g. GoogleData and the CD2H discovery storefront).
As the companion of the data-sharing best practice guidebook, this best-practice web portal can be treated as the “tutorial” to the comprehensive “full documentation”. It can also serve an interactive channel to obtain user-feedbacks during the development of the guidebook. It can also help make it easier for data stewards and others to follow best practices for metadata, resulting in improved quality of metadata, enhanced discoverability, and improved dissemination and resulting impact for digital objects (e.g., datasets).
This work can help to support many of the initiatives related to the People group (discoverability and attribution work happening in CD2H and make it easier to implement at the institution level); will also help build hub-level capacity.
Expected outputs (6 months):
- For CTSA hub end-users, we will deliver an interactive best-practice web portal which can guide data-providers to make their data more discoverable
- For CD2H, the web portal will be one of mechanisms to receive feedbacks during the best-practice guidebook development