Journal:Extending an open-source tool to measure data quality: Case report on Observational Health Data Science and Informatics (OHDSI)

From LIMSWiki
Revision as of 14:55, 11 August 2020 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Extending an open-source tool to measure data quality: Case report on Observational Health Data Science and Informatics (OHDSI)
Journal BMJ Health & Care Informatics
Author(s) Dixon, Brian E.; Wen, Chen; French, Tony; Williams, Jennifer L.; Duke, Jon D.; Grannis, Shaun J.
Author affiliation(s) Indiana University–Purdue University Indianapolis, Regenstrief Institute, Georgia Tech Research Institute
Primary contact Email: bedixon at regenstrief dot org
Year published 2020
Volume and issue 27 (1)
Article # e100054
DOI 10.1136/bmjhci-2019-100054
ISSN 2632-1009
Distribution license Creative Commons Attribution-NonCommercial 4.0 International
Website https://informatics.bmj.com/content/27/1/e100054
Download https://informatics.bmj.com/content/bmjhci/27/1/e100054.full.pdf (PDF)

Abstract

Introduction: As the health system seeks to leverage large-scale data to inform population outcomes, the informatics community is developing tools for analyzing these data. To support data quality assessment within such a tool, we extended the open-source software Observational Health Data Sciences and Informatics (OHDSI) to incorporate new functions useful for population health.

Methods: We developed and tested methods to measure the completeness, timeliness, and entropy of information. The new data quality methods were applied to over 100 million clinical messages received from emergency department information systems for use in public health syndromic surveillance systems.

Discussion: While completeness and entropy methods were implemented by the OHDSI community, timeliness was not adopted as its context did not fit with the existing OHDSI domains. The case report examines the process and reasons for acceptance and rejection of ideas proposed to an open-source community like OHDSI.

Introduction

Observational research requires an information infrastructure that can gather, integrate, manage, analyze, and apply evidence to decision-making and operations in an enterprise. In healthcare, we currently seek to develop, implement, and operationalize learning health systems in which an expanding universe of electronic health data can be transformed into evidence through observational research and applied to clinical decisions and processes within health systems.[1][2]

Leveraging large-scale health data is challenging because clinical data generally derive from myriad smaller systems across diverse institutions and are captured for various intended uses through varying business processes. The result is variable data quality, limiting the utility of data for decision-making and application. To ensure data are fit for use at both the granular patient-level and the broader aggregate population-level, it is important to assess, monitor, and improve data quality.[3][4]

A growing body of knowledge documents abundant data quality challenges in healthcare. Liaw et al. examined the completeness and accuracy of emergency department information system (EDIS) data for identifying patients with select chronic diseases (e.g., type 2 diabetes mellitus, cardiovascular disease, and chronic obstructive pulmonary disease). They found that information on the target diseases was missing from EDIS discharge summaries in 11%–20% of cases.[5] Furthermore, an audit confirmed just 61% of diagnoses found in a query of the EDIS for the target conditions. Studies among integrated delivery networks and multiple provider organizations show similar results. A study of data from multiple laboratory information systems (LIS) transmitting electronic messages to public health departments found low completeness for a number of data critical to surveillance processes.[6]

Given poor data quality in health information systems, researchers as well as national organizations advocate for developing tools to enable standardized assessment, monitoring, and improvement of data quality.[3][4][7][8] For example, in the report from a National Science Foundation workshop on the learning health system, key research questions called for developing methods to curate data, compute fitness-for-use measures from the data themselves, and infer the strength of a data set based on its provenance.[9] Similar questions were posed by the National Academy of Medicine in its report on the role of observational studies in the learning health system.[10]

In this case report, we describe our experience extending an open-source tool, designed to facilitate observational studies, to support assessment of data quality for use cases in public health surveillance. First, we describe the tool and our use case within the discipline of public health. Next, we describe the data quality measurement enhancements we developed for the tool. Finally, we discuss our efforts to integrate the enhancements into the open-source tool for the benefit of others.

Methods

Observational Health Data Sciences and Informatics (OHDSI)

OHDSI (pronounced ‘Odyssey’) is a multistakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics.[11] The OHDSI collaborative consists of researchers and data scientists across academic, industry, and government organizations who seek to standardize observational health data for analysis and develop tools to support large-scale analytics across a range of use cases. The collaborative grew out of the Observational Medical Outcomes Partnership[12][13], with an initial focus on medical product safety surveillance. The OHDSI portfolio also includes work on comparative effectiveness research, as well as personalized risk prediction.[14][15]

To date, the collaborative has produced a body of knowledge on methods for analyzing large-scale health data. These methods have been embodied through a suite of tools available as open access software (available at https://www.ohdsi.org/analytic-tools/) that researchers and industry scientists can leverage in their work. The common data model (CDM), which harmonizes data across electronic medical record systems, is one example.[12] Another example is ACHILLES, which is a profiling tool for database characterization and data quality assessment.[16] Once data have been transformed into the CDM, ACHILLES can profile data characteristics, such as the age of an individual at first observation and gender stratification. The ACHILLES tool operationalizes the Kahn framework[17], a generic framework for data quality that consists of three components: conformance, completeness, and plausibility.

Extending OHDSI in support of syndromic surveillance

Our project sought to extend the OHDSI tools to support syndromic surveillance, an applied area within public health that focuses on monitoring clusters of symptoms and clinical features of an undiagnosed disease or health event in near real-time allowing for early detection as well as rapid response.[18] A public health measure for the U.S. "meaningful use" program, syndromic surveillance has been adopted by a number of state and large city health departments.[19] Although adopted and used, syndromic data quality can be poor and could benefit from monitoring and improvement strategies.[20][21][22]

Based on a thorough review of the literature as well as focus groups with syndromic surveillance experts, we focused on developing three data quality metrics that did not already exist within OHDSI. First, we developed methods for calculating the completeness of key data useful for surveillance, including age, race, and gender. Second, we built methods for measuring the timeliness with which syndromic data had been captured into the OHDSI environment. Third, we developed methods for analyzing the information entropy of the patient’s chief complaint or reason for visit. Each metric was developed and tested using the instance of OHDSI at the Regenstrief Institute. We further sought to commit our code to the OHDSI project, coordinating our development efforts with the OHDSI community.

Extending OHDSI requires developing scripts to retrieve data from the CDM, developing scripts to analyze the retrieved data, and enhancing the interface that displays the retrieved or analyzed data. Retrieving data from the CDM involves constructing Structured Query Language (SQL) scripts that query the OHDSI data store. At Regenstrief, the OHDSI data store is an Oracle database configured to support the CDM (see Figure 1). Once retrieved, data can be displayed to users in ATLAS, a unified interface for data and analytics. Modifying the ATLAS WebAPI enables developers to simply display data retrieved from the CDM or perform analyses of the data, which are then displayed to the user as reports.


Fig1 Dixon BMJHealthCareInfo2020 27-1.png

Figure 1. Technical architecture for the data analytics environment. Data are sent from the source hospitals to the health information exchange. The data are replicated at the Regenstrief Institute, where they are extracted, transformed and loaded into the common data model. Once in the OMOP data store, the data can be queried by researchers and assessed for data quality. ETL, extract, transform, load; INPC, Indiana Network for Patient Care; INPCR, INPC for research; PHESS, Public Health Emergency Surveillance System; OHDSI, Observational Health Data Sciences and Informatics; OMOP, Observational Medical Outcomes Partnership

To test the functions we developed for OHDSI, we extracted, transformed, and loaded data from admission, discharge, and transfer messages received from 124 hospitals for the Indiana Public Health Emergency Surveillance System, Indiana’s syndromic surveillance system[23] (see Figure 1). The messages spanned the years 2011–2014 and represented 9,014,601 emergency department encounters for 5,407,055 unique patients. Once transformed into the CDM, the data were loaded into the OHDSI database. The patient’s chief complaint is stored in the CDM as an observation.

The syndromic data were retrieved and analyzed using the ATLAS tool. A cohort was defined as all patients with an encounter between January 1, 2011 and December 31, 2014, where the patient possessed an observation type of "chief complaint" (CONCEPT_ID=38000282). Only the first chief complaint observation for a patient was returned. Once extracted from the OHDSI database, the cohort was analyzed using the added functionality in ATLAS and available to users in reports for review.

Functionality developed to facilitate syndromic data quality assessment

Completeness

Based on prior work[3][6][24], public health agencies strongly desire to have complete data on age, gender, ethnicity, and race. This is because public health agencies are tasked with examining and reporting on health disparities. Therefore, we modified ATLAS to calculate the completeness of these data fields as defined by the CDM. Completeness was measured as the proportion of patients with a corresponding value stored in the OHDSI database for each field. We further modified the ATLAS WebAPI to visualize the completeness measures. Figure 2 depicts completeness of data for race, ethnicity, and gender stratified by age.


Fig2 Dixon BMJHealthCareInfo2020 27-1.png

Figure 2. Screenshot of the OHDSI ATLAS tool displaying data completeness of the age variable for a population. OHDSI, Observational Health Data Sciences and Informatics



References

  1. Dixon, B.E.; Whipple, E.C.; Lajiness, J.M. et al. (2016). "Utilizing an integrated infrastructure for outcomes research: A systematic review". Health Information and Libraries Journal 33 (1): 7–32. doi:10.1111/hir.12127. PMID 26639793. 
  2. Institute of Medicine (2011). Digital Infrastructure for the Learning Health System: The Foundation for Continuous Improvement in Health and Health Care: Workshop Series Summary. National Academies Press. doi:10.17226/12912. ISBN 9780309225014. 
  3. 3.0 3.1 3.2 Dixon, B.E.; Rosenman, M.; Xia, Y. et al. (2013). "A vision for the systematic monitoring and improvement of the quality of electronic health data". Studies in Health Technology and Informatics 192: 884–8. doi:10.3233/978-1-61499-289-9-884. PMID 23920685. 
  4. 4.0 4.1 Weiskopf, N.G.; Bakken, S.; Hripcsak, G. et al. (2013). "A Data Quality Assessment Guideline for Electronic Health Record Data Reuse". EGEMS 5 (1): 14. doi:10.5334/egems.218. PMC PMC5983018. PMID 29881734. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5983018. 
  5. Liaw, S.-T.; Chen, H.-Y.; Maneze, D. et al. (2012). "Health reform: Is routinely collected electronic information fit for purpose?". EGEMS 24 (1): 57–63. doi:10.1111/j.1742-6723.2011.01486.x. PMID 22313561. 
  6. 6.0 6.1 Dixon, B.E.; Siegel, J.A.; Oemig, T.V. et al. (2013). "Electronic health information quality challenges and interventions to improve public health surveillance data and practice". Public Health Reports 128 (6): 546–53. doi:10.1177/003335491312800614. PMC PMC3804098. PMID 24179266. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3804098. 
  7. Martin, E.G.; Law, J.; Ran, W. et al. (2017). "Evaluating the Quality and Usability of Open Data for Public Health Research: A Systematic Review of Data Offerings on 3 Open Data Platforms". Journal of Public Health Management and Practice 23 (4): e5-e13. doi:10.1097/PHH.0000000000000388. PMID 26910872. 
  8. Botts, N.; Bouhaddou, O.; Bennett, J. et al. (2014). "Data Quality and Interoperability Challenges for eHealth Exchange Participants: Observations from the Department of Veterans Affairs' Virtual Lifetime Electronic Record Health Pilot Phase". AMIA Annual Symposium Proceedings 2014: 307–14. PMC PMC4419918. PMID 25954333. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4419918. 
  9. Friedman, C.; Rubin, J.; Brown. J. et al. (2015). "Toward a science of learning systems: a research agenda for the high-functioning Learning Health System". JAMIA 22 (1): 43-50. doi:10.1136/amiajnl-2014-002977. PMC PMC4433378. PMID 25342177. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4433378. 
  10. Institute of Medicine (2013). Observational Studies in a Learning Health System: Workshop Summary. National Academies Press. doi:10.17226/18438. ISBN 9780309290845. 
  11. Hripcsak, G.; Duke, J.D.; Shah, N.H. et al. (2015). "Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers". Studies in Health Technology and Informatics 216: 574–8. PMC PMC4815923. PMID 26262116. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4815923. 
  12. 12.0 12.1 Overhage, J.M.; Ryan, P.B.; Reich, C.G. et al. (2012). "Validation of a common data model for active safety surveillance research". JAMIA 19 (1): 54–60. doi:10.1136/amiajnl-2011-000376. PMC PMC3240764. PMID 22037893. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3240764. 
  13. Stang, P.E.; Ryan, P.B.; Racoosin, J.A. et al. (2010). "Advancing the science for active surveillance: Rationale and design for the Observational Medical Outcomes Partnership". Annals of Internal Medicine 153 (9): 600-6. doi:10.7326/0003-4819-153-9-201011020-00010. PMID 21041580. 
  14. Duke, J.D.; Ryan, P.B.; Suchard, M.A. et al. (2017). "Risk of angioedema associated with levetiracetam compared with phenytoin: Findings of the observational health data sciences and informatics research network". Epilepsia 58 (8): e101-e106. doi:10.1111/epi.13828. PMC PMC6632067. PMID 28681416. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6632067. 
  15. Boland, M.R.; Shahn, Z.; Madigan, D. et al. (2015). "Birth month affects lifetime disease risk: A phenome-wide method". JAMIA 22 (5): 1042–53. doi:10.1093/jamia/ocv046. PMC PMC4986668. PMID 26041386. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4986668. 
  16. Huser, V.; DeFalco, F.J.; Schuemie, M. et al. (2016). "Multisite Evaluation of a Data Quality Tool for Patient-Level Clinical Data Sets". EGEMS 4 (1): 1239. doi:10.13063/2327-9214.1239. PMC PMC5226382. PMID 28154833. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5226382. 
  17. Kahn, M.G.; Callahan, T.J.; Barnard, J. et al. (2016). "A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data". EGEMS 4 (1): 1244. doi:10.13063/2327-9214.1244. PMC PMC5051581. PMID 27713905. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5051581. 
  18. Hoyt, R.E.; Hersh, W.R. (2018). Health Informatics: Practical Guide (7th ed.). Lulu. ISBN 9781387642410. 
  19. Williams, K.S.; Shah, G.H. (2016). "Electronic Health Records and Meaningful Use in Local Health Departments: Updates From the 2015 NACCHO Informatics Assessment Survey". Journal of Public Health Management and Practice 22 (Suppl. 6): S27–S33. doi:10.1097/PHH.0000000000000460. PMC PMC5050007. PMID 27684614. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5050007. 
  20. Doroshenko, A.; Cooper, D.; Smith, G. et al. (2005). "Evaluation of syndromic surveillance based on National Health Service Direct derived data--England and Wales". MMWR Supplements 54: 117–22. PMID 16177702. 
  21. Beuhler, J.W.; Sonricker, A.; Paladini, M. et al. (2008). "Syndromic Surveillance Practice in the United States: Findings from a Survey of State, Territorial, and Selected Local Health Departments". Advances in Disease Surveillance 6: 1–20. http://faculty.washington.edu/lober/www.isdsjournal.org/htdocs/volume6.php. 
  22. Ong, M.-S.; Magrabi, F.; Coiera, E. (2013). "Syndromic surveillance for health information system failures: A feasibility study". JAMIA 20 (3): 506-12. doi:10.1136/amiajnl-2012-001144. PMC PMC3628054. PMID 23184193. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3628054. 
  23. Grannis, S.; Wade, M.; Gibson, J. et al. (2006). "The Indiana Public Health Emergency Surveillance System: Ongoing progress, early findings, and future directions". AMIA Annual Symposium Proceedings 2006: 304–8. PMC PMC1839268. PMID 17238352. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1839268. 
  24. Dixon, B.E.; Lai, P.T.S.; Grannis, S.J. et al. (2013). "Variation in information needs and quality: Implications for public health surveillance and biomedical informatics". AMIA Annual Symposium Proceedings 2013: 670–9. PMC PMC3900209. PMID 24551368. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900209. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation and grammar.