Data Quality Methods and Tools to Support CTSA Hub Data Sharing

CD2H Phase 2 Proposal

Project Title: Data Quality Methods and Tools to Support CTSA Hub Data Sharing

Point Person:

Kari Stephens, kstephen@uw.edu, University of Washington

Timothy Bergquist, trberg@uw.edu, University of Washington

 

Elevator pitch:

Electronic Health Record (EHR) data must be tested for data quality when being shared for research. Data quality is typically measured in three categories: Conformance, Completeness, and Plausibility (Kahn et al., 2016 eGEMS). Many CTSA institutions have harmonized their EHR data to the Observational Medical Outcomes Partnership (OMOP) data model, yet no publicly available tool with a standard operating procedure (SOP) exists to easily assess and visualize data quality tests, particularly across institutions. This project will launch a publically available data quality testing tool and SOP, configurable to any database environment for N OMOP datasets.

 

Project history:

During phase 1, we have laid the groundwork to enable DREAM challenges using data from various CTSA sites, including using the UW DQe-c tool to test data quality. Namely, we upgraded the DQe-c tool prototype to scale to be optimized for v5 OMOP across multiple database platforms, although it is scalable to multiple data models (i.e., i2b2, PCORnet CDM). The result of this effort is a more robust working DQe-c tool prototype, currently being used to evaluate OMOP data under consideration for inclusion in DREAM challenges. DQe-c has been presented by invitation to the OHDSI consortium at a national meeting by Tim Bergquist, where it received significant positive feedback and interest in use outside CD2H. Tim will also be presenting DQe-c abstract at the Rocky Bioinformatics Conference in Dec 2018, in both an oral presentation and a poster.

 

This tool was originally developed within the UW CTSA to support the building and maintenance of an OMOP based regional data network. The DARTNet Institute has partnered in its development with Dr. Stephens and has continued interest in its utility, for use with their OMOP repositories. Two peer-reviewed papers detail DQe-c.

 

Estiri, H., Lovins, T., Afzalan, N., & Stephens, K. A. (2016). Applying a Participatory Design Approach to Build EHR-based Data Profiling Tools. AMIA Summits Translational Science Proceedings, July 20, 2016, 60-7. PMCID: PMC5001743

 

Estiri, H., & Stephens, K. A., Klann, J. G., & Murphy, S. N. (2018). Exploring Completeness in Clinical Data Research Networks with DQe-c. Journal of the American Medical Informatics Association, 25, 17-24. PMCID: in progress

 

GitHub repo:

 

Project description:

EHR data must be tested for data quality when being shared for research. Data quality is typically measured in three categories: Conformance, Completeness, and Plausibility (Kahn et al., 2016 eGEMS). Harmonized datasets need to conform to an established standard format and vocabulary before any analysis can be done. They need to have a bare minimum threshold of completeness (i.e., what percentage of values are null or empty). They also need to prove a certain level of plausibility (i.e., do the data make sense for what is expected, are they believable and credible). To date, most data sharing networks have developed internal protocols and tools to manage data harmonization, but no publicly available tool with a standard operating procedure exists to easily assess and visualize data quality tests across institutions. Therefore, data quality remains a problem that is inconsistently tackled and only by high level analytic teams if available.

 

Proposed Solution:

We propose to finalize an “open source” software tool, DQe-c, that will perform data quality tests on datasets from across a data sharing network and visually present the data quality metrics for cross-institution comparison. We will also offer an easy to use set of documentation to outline a standard operating procedure for sites to follow that will detail a base level of data quality tests and instructions for using DQe-c. Deliverables will specifically include:

  1. DQe-c downloadable tool configured to OMOP v5, for both single and multiple instances of data repositories
  2. DQe-c standard operating procedure (SOP) set of documentation, including instructions to configure the tool to other data models
  3. A set of base level data quality tests operationalized into DQe-c and the SOP that can also be a succinct set of recommended data quality tests for any researcher looking to use electronic health data.

 

Benefit:

Our solutions will enable different sites to assess their data quality in a standardized manner, working with the same code and visualization dashboard to collectively and consistently establish data quality measure thresholds. Namely, CTSAs will benefit immediately in the following specific ways: 1) by summarizing our set of tests as a core set of recommendations for data quality testing, CTSAs can both adopt these recommended tests through their own proprietary methods within biomedical informatics cores and/or give out these recommendations to researchers who received these datasets, 2) CTSA affiliates will be able to download a tool for use within their own environments to test data quality within a specific data model repository or across multiple repositories.

Expected outputs (6 months):

  • Data quality testing tool (DQe-c) available to CTSA hubs and affiliates
  • Data quality testing tool standard operating procedures and documentation supporting local configuration
  • List of recommended minimum level data quality tests to begin assurance that data are worthy of sharing