CD2H Phase 2 Proposal
Project Title: menRva: an interdisciplinary open research repository
Matt Carson, email@example.com, NU
Kristi Holmes, firstname.lastname@example.org, NU
Proper collection, indexing, and preservation is vital to the discovery and dissemination of research output in scientific research. However, many research communities continue to battle the problem of “silos” at the institutional level that hinder discovery of research output. We propose to build infrastructure that can be easily deployed and managed either locally or on a cloud-based platform to collect, record, preserve, and disseminate a wide range of digital works across the translational community to enhance their visibility, promote people and their expertise, support attribution of their work, aid the discovery and accessibility by the international scientific community, and support open and FAIR-TLC science. At the same time, we will use this tool to promote good data practice workflows, incorporate standards and persistent identifiers, and account for privacy standards required for translational research.
Data discovery work led by the Scripps team in Phase 1
Examples of other relevant foundational work includes:
- A study on data management needs by NU’s Galter and Mudd Libraries determined that an index of records describing researchers’ datasets is widely desired.
- Preliminary efforts over the past year to survey investigators about their needs and perceptions about data discovery, preservation, workflows, etc. and feature requirements gathering
- a survey of the repository landscape across medical schools in the US (forthcoming paper)
- Analysis of user-supplied metadata in a health sciences institutional repository (Pastva, 2018)
- DigitalHub: A Repository Focused on the Future. (Ilik, et al., 2018)
- Almost half of references in reports on new and emerging nondrug health technologies are grey literature (Farrah and Mierzwinski-Urban, 2019)
Infrastructure that can be easily deployed and managed locally to collect, record, preserve, and disseminate a wide range of digital works across the translational workforce* is critical to enhance their visibility, promote people and their expertise, support attribution of their work, aid the discovery and accessibility by the international scientific community, and support open and FAIR-TLC science. Such an initiative requires a trusted framework for digital objects, good data practice workflows; incorporation of standards and persistent identifiers; incorporation of privacy considerations, and strategies to support implementation and integration as well as incentivize individuals participating in such an ecosystem. Here we are developing an integrated, born-interoperable repository and data catalog to empower researchers as they engage in good data practices around research data management, licensing, preservation, credit, discovery, and reuse of these digital artefacts and data at their hubs. This next-generation repository is being built in partnership with CERN on their Invenio platform. Invenio is a a safe, secure, scalable, and RESTful architecture to power the repositories of CERN and many other organizations in a mature open source community. The software will be made openly available for local level implementation by hubs and we are working in parallel to make a cloud instance available for broad use.
*e.g., datasets, protocols, education or engagement materials, technical reports, supplemental materials, survey instruments
We propose development of an integrated, born-interoperable repository and data catalog to empower researchers as they engage in good data practices around research data management, licensing, preservation, credit, discovery, and reuse of these digital artefacts and data at their hubs.
We will leverage Invenio, a software platform created by the European Organization for Nuclear Research (CERN) and used as the group’s main repository for more than 10 years. The Invenio team recently released a modernized, safe, and scalable version of the code. Invenio 3.0 offers a number of advantages:
- Safe: Created with security and long-term preservation in mind.
- Scalable: Invenio is fast. Designed to manage 100M+ records and petabytes of files; data can be archived independently of the size.
- RESTful: Born for the web, is JSON-native and provides RESTful APIs out of the box that will allow building apps on top of it.
- Open: Invenio is 100% open source licensed under MIT license. Invenio supports open standards for open science.
- A robust community: Large team of developers, active open source community, TIND (CERN spinoff) uses a SAAS-model for Invenio, used by many organizations, and the underlying technology (Python, Flask) is widely supported
Current collaborative efforts with CERN support a Next Generation Repositories (NGR) as envisioned by the Confederation of Open Access Repositories (COAR), of which Northwestern’s Galter Library is a member, along with other CTSA institution libraries. The NGR serves as a foundation for a distributed, globally networked infrastructure for scholarly communication and will support deployment of value added services, making the system more research-centric and supportive of innovation. NU has a strong collaborative relationship with COAR and CERN and are collaborating with several universities and COAR to launch a NGR group in the United States with the aim of platform-agnostic, cross-repository interoperability.
Invenio will be further enhanced with a range of user-focused features such as easy user profile creation, record upload and metadata tagging options, incorporation of persistent identifiers, connection with site’s authentication system, attribution for contributions, HTML signposting, and robust social and impact components. This system will help make research more findable, accessible, interoperable, and reusable in accordance with FAIR data principles.
In parallel, we will expand our ongoing landscape assessment to capture what hubs need and want in a local data repository+index. What problems can this tool solve? How can it improve overall research workflow? We will also work with sites to create opportunities to participate in focus groups to examine wireframes and mockups, and later beta versions of the data index. Finally, we will work with other members of the community to identify opportunities to maximize discovery across different tools, maximize the use of these tools locally, develop a series of use cases and best practices, etc.
Through this tool, researchers can contribute information about their datasets and other digital objects, tag the records with subject terms and other keywords to maximize discoverability, offer citation and license information for datasets, respond to requests for dataset access, and monitor analytics of the records. By following nationally and internationally recognized metadata standards for data records, researchers can also ensure that their records will be discoverable and interoperable, capable of being shared with meta-repositories through standardized metadata harvesting protocols.
While there are other related initiatives ongoing in the data catalog and data repository communities, to date these efforts have been disjoined on campuses and do not offer a “one-stop” simple workflow to support the registration, deposit, management, preservation, and discovery of data assets. These tools are often built on challenging technology stacks and offer little (if any) support for best data practices overall. Moreover, there is a great need for a distributed, globally networked infrastructure on top of which layers of value added services can be deployed, making it more research-centric, open to and supportive of innovation and attribution, while also collectively managed by the scholarly community.
Also by collaborating with other repositories and data catalogs, we can build a community of practice and work together to support teams on our shared journey toward good data practices, interoperability, and discoverability.
Expected outputs (6 months):
The deliverables for this project include:
- Version 1 of the repository with key first-round features, NGR updates (in progress and in collaboration), incorporation of metadata best practices (per CD2H), incorporation of attribution work to date, and other features.
- A local NU instance
- AWS-based instance
- GitHub repo where software can be downloaded, features requested, etc.
- Ongoing requirements gathering, landscape analysis, and testing – with results disseminated broadly
- Collaborations with other data catalog/repository stewards to identify shared priorities, opportunities for collaboration, and strategies to enable and support interoperability, data discovery, and sustainability and support for local projects.
- Engagement materials/activities (1-pager, demo video on YouTube, conference presentations, etc.)
- Roadmap for next phase of work.
Activities and outcomes to “change the data culture” on academic medical campuses, including:
- A “lingua data” of common terms, processes, and metadata.
- Resources to support federal and private funding agency requirements for data management, openness, and retention.
- Incorporation of appropriate national and local data management standards into routine workflows
- Development of services to collaborate with data owners to develop a workflow for the registration, deposit, management, and preservation of data assets.
- Enhanced training/support opportunities for researchers on policies, practices, and infrastructure for data management, citation, licensing, sharing, and preservation.
Other relevant work:
The European Open Science Cloud (EOSC), especially two new reports (Nov 2018) which provide excellent discussion and a roadmap for the technical and cultural aspects required to support open science
- Prompting an EOSC in practice, a Report of the Commission 2nd High Level Expert Group on the European Open Science Cloud
- Turning FAIR into reality, a Report of the Commission FAIR Data Expert Group (FAIR Data EG)
Other EOSC publications addressing incentives, skills, etc.
- Recommendation on access to and preservation to scientific information
- Working group report: Providing researchers with the skills and competencies they need to practice Open Science