Journal:Extending an open-source tool to measure data quality: Case report on Observational Health Data Science and Informatics (OHDSI)

From LIMSWiki
Jump to navigationJump to search
Full article title Extending an open-source tool to measure data quality: Case report on Observational Health Data Science and Informatics (OHDSI)
Journal BMJ Health & Care Informatics
Author(s) Dixon, Brian E.; Wen, Chen; French, Tony; Williams, Jennifer L.; Duke, Jon D.; Grannis, Shaun J.
Author affiliation(s) Indiana University–Purdue University Indianapolis, Regenstrief Institute, Georgia Tech Research Institute
Primary contact Email: bedixon at regenstrief dot org
Year published 2020
Volume and issue 27 (1)
Article # e100054
DOI 10.1136/bmjhci-2019-100054
ISSN 2632-1009
Distribution license Creative Commons Attribution-NonCommercial 4.0 International
Download (PDF)


Introduction: As the health system seeks to leverage large-scale data to inform population outcomes, the informatics community is developing tools for analyzing these data. To support data quality assessment within such a tool, we extended the open-source software Observational Health Data Sciences and Informatics (OHDSI) to incorporate new functions useful for population health.

Methods: We developed and tested methods to measure the completeness, timeliness, and entropy of information. The new data quality methods were applied to over 100 million clinical messages received from emergency department information systems for use in public health syndromic surveillance systems.

Discussion: While completeness and entropy methods were implemented by the OHDSI community, timeliness was not adopted as its context did not fit with the existing OHDSI domains. The case report examines the process and reasons for acceptance and rejection of ideas proposed to an open-source community like OHDSI.


Observational research requires an information infrastructure that can gather, integrate, manage, analyze, and apply evidence to decision-making and operations in an enterprise. In healthcare, we currently seek to develop, implement, and operationalize learning health systems in which an expanding universe of electronic health data can be transformed into evidence through observational research and applied to clinical decisions and processes within health systems.[1][2]

Leveraging large-scale health data is challenging because clinical data generally derive from myriad smaller systems across diverse institutions and are captured for various intended uses through varying business processes. The result is variable data quality, limiting the utility of data for decision-making and application. To ensure data are fit for use at both the granular patient-level and the broader aggregate population-level, it is important to assess, monitor, and improve data quality.[3][4]

A growing body of knowledge documents abundant data quality challenges in healthcare. Liaw et al. examined the completeness and accuracy of emergency department information system (EDIS) data for identifying patients with select chronic diseases (e.g., type 2 diabetes mellitus, cardiovascular disease, and chronic obstructive pulmonary disease). They found that information on the target diseases was missing from EDIS discharge summaries in 11%–20% of cases.[5] Furthermore, an audit confirmed just 61% of diagnoses found in a query of the EDIS for the target conditions. Studies among integrated delivery networks and multiple provider organizations show similar results. A study of data from multiple laboratory information systems (LIS) transmitting electronic messages to public health departments found low completeness for a number of data critical to surveillance processes.[6]

Given poor data quality in health information systems, researchers as well as national organizations advocate for developing tools to enable standardized assessment, monitoring, and improvement of data quality.[3][4][7][8] For example, in the report from a National Science Foundation workshop on the learning health system, key research questions called for developing methods to curate data, compute fitness-for-use measures from the data themselves, and infer the strength of a data set based on its provenance.[9] Similar questions were posed by the National Academy of Medicine in its report on the role of observational studies in the learning health system.[10]

In this case report, we describe our experience extending an open-source tool, designed to facilitate observational studies, to support assessment of data quality for use cases in public health surveillance. First, we describe the tool and our use case within the discipline of public health. Next, we describe the data quality measurement enhancements we developed for the tool. Finally, we discuss our efforts to integrate the enhancements into the open-source tool for the benefit of others.


Observational Health Data Sciences and Informatics (OHDSI)

OHDSI (pronounced ‘Odyssey’) is a multistakeholder, interdisciplinary collaborative to bring out the value of health data through large-scale analytics.[11] The OHDSI collaborative consists of researchers and data scientists across academic, industry, and government organizations who seek to standardize observational health data for analysis and develop tools to support large-scale analytics across a range of use cases. The collaborative grew out of the Observational Medical Outcomes Partnership[12][13], with an initial focus on medical product safety surveillance. The OHDSI portfolio also includes work on comparative effectiveness research, as well as personalized risk prediction.[14][15]

To date, the collaborative has produced a body of knowledge on methods for analyzing large-scale health data. These methods have been embodied through a suite of tools available as open access software (available at that researchers and industry scientists can leverage in their work. The common data model (CDM), which harmonizes data across electronic medical record (EMR) systems, is one example.[12] Another example is ACHILLES, which is a profiling tool for database characterization and data quality assessment.[16] Once data have been transformed into the CDM, ACHILLES can profile data characteristics, such as the age of an individual at first observation and gender stratification. The ACHILLES tool operationalizes the Kahn framework[17], a generic framework for data quality that consists of three components: conformance, completeness, and plausibility.

Extending OHDSI in support of syndromic surveillance

Our project sought to extend the OHDSI tools to support syndromic surveillance, an applied area within public health that focuses on monitoring clusters of symptoms and clinical features of an undiagnosed disease or health event in near real-time allowing for early detection as well as rapid response.[18] A public health measure for the U.S. "meaningful use" program, syndromic surveillance has been adopted by a number of state and large city health departments.[19] Although adopted and used, syndromic data quality can be poor and could benefit from monitoring and improvement strategies.[20][21][22]

Based on a thorough review of the literature as well as focus groups with syndromic surveillance experts, we focused on developing three data quality metrics that did not already exist within OHDSI. First, we developed methods for calculating the completeness of key data useful for surveillance, including age, race, and gender. Second, we built methods for measuring the timeliness with which syndromic data had been captured into the OHDSI environment. Third, we developed methods for analyzing the information entropy of the patient’s chief complaint or reason for visit. Each metric was developed and tested using the instance of OHDSI at the Regenstrief Institute. We further sought to commit our code to the OHDSI project, coordinating our development efforts with the OHDSI community.

Extending OHDSI requires developing scripts to retrieve data from the CDM, developing scripts to analyze the retrieved data, and enhancing the interface that displays the retrieved or analyzed data. Retrieving data from the CDM involves constructing Structured Query Language (SQL) scripts that query the OHDSI data store. At Regenstrief, the OHDSI data store is an Oracle database configured to support the CDM (see Figure 1). Once retrieved, data can be displayed to users in ATLAS, a unified interface for data and analytics. Modifying the ATLAS WebAPI enables developers to simply display data retrieved from the CDM or perform analyses of the data, which are then displayed to the user as reports.

Fig1 Dixon BMJHealthCareInfo2020 27-1.png

Figure 1. Technical architecture for the data analytics environment. Data are sent from the source hospitals to the health information exchange. The data are replicated at the Regenstrief Institute, where they are extracted, transformed and loaded into the common data model. Once in the OMOP data store, the data can be queried by researchers and assessed for data quality. ETL, extract, transform, load; INPC, Indiana Network for Patient Care; INPCR, INPC for research; PHESS, Public Health Emergency Surveillance System; OHDSI, Observational Health Data Sciences and Informatics; OMOP, Observational Medical Outcomes Partnership

To test the functions we developed for OHDSI, we extracted, transformed, and loaded data from admission, discharge, and transfer messages received from 124 hospitals for the Indiana Public Health Emergency Surveillance System, Indiana’s syndromic surveillance system[23] (see Figure 1). The messages spanned the years 2011–2014 and represented 9,014,601 emergency department (ED) encounters for 5,407,055 unique patients. Once transformed into the CDM, the data were loaded into the OHDSI database. The patient’s chief complaint is stored in the CDM as an observation.

The syndromic data were retrieved and analyzed using the ATLAS tool. A cohort was defined as all patients with an encounter between January 1, 2011 and December 31, 2014, where the patient possessed an observation type of "chief complaint" (CONCEPT_ID=38000282). Only the first chief complaint observation for a patient was returned. Once extracted from the OHDSI database, the cohort was analyzed using the added functionality in ATLAS and available to users in reports for review.

Functionality developed to facilitate syndromic data quality assessment


Based on prior work[3][6][24], public health agencies strongly desire to have complete data on age, gender, ethnicity, and race. This is because public health agencies are tasked with examining and reporting on health disparities. Therefore, we modified ATLAS to calculate the completeness of these data fields as defined by the CDM. Completeness was measured as the proportion of patients with a corresponding value stored in the OHDSI database for each field. We further modified the ATLAS WebAPI to visualize the completeness measures. Figure 2 depicts completeness of data for race, ethnicity, and gender stratified by age.

Fig2 Dixon BMJHealthCareInfo2020 27-1.png

Figure 2. Screenshot of the OHDSI ATLAS tool displaying data completeness of the age variable for a population. OHDSI, Observational Health Data Sciences and Informatics


Timeliness is a critical data quality metric, as timely information about population health is necessary to inform responses to potential disease outbreaks. Therefore, we modified ATLAS to calculate the timeliness of records added to the OHDSI CDM database. Timeliness was measured as the difference, in days, between the date of an observation about a given patient stored in the source EMR system and the date when the observation was created within the CDM data store. This measure essentially represents the "delay" (measured in days) between when data were first generated and when data were added to the OHDSI instance running at Regenstrief. To enhance ATLAS, we added a new data element to the CDM. Specifically, we created a column labelled "row_created_db_time" in the "observation" table. This field enables calculation of the difference between this date timestamp and the observation date. ATLAS was further modified to display the timeliness metric as a line chart visualization that displays the average "delay" over time for observations in the cohort.

Information entropy

A final characteristic of data quality we developed for OHDSI was information entropy. Information entropy is the average rate at which information is produced by a stochastic source of data. We hypothesized the metric would be useful for monitoring changes in the information communicated by a data source (e.g., hospital, ED) to a health department. Shannon's definition of entropy, when applied to an information source, can determine the minimum channel capacity required to reliably transmit the source as encoded binary digits. The formula can be derived by calculating the mathematical expectation of the amount of information contained in a digit from the information source. We used the metric to examine the amount of information represented in a patient’s chief complaint, which can also be referred to as the reason for visit. If monitored over time, changes in entropy may signal a change in the information coming from a given health facility. Detection of a change might indicate an emerging health threat. Entropy of chief complaints is depicted in Figure 3.

Fig3 Dixon BMJHealthCareInfo2020 27-1.png

Figure 3. Information entropy of patient chief complaints aggregated across multiple emergency departments from 2011 through 2014.


Making enhancements in OHDSI available to others

Because OHDSI is a community collaborative built around a set of open-source tools and ideas, we sought to ensure the functionality developed to support syndromic surveillance was available to others. To that end, we engaged with the community when developing each function. Our lead developer (CW) engaged the CDM and Vocabulary Development Working Group, as well as the ATLAS & WebAPI Working Group and the Architecture Working Group to facilitate discussion and adoption of the new functions. The CDM and architecture groups were necessary, as we requested a new data element to be created. New feature requests were submitted to each group. Requests were scheduled for discussion at a regular conference call, which were documented on the working group wiki site.[25] After approval of the change request, CW developed and tested the code locally within the Regenstrief development environment. Investigators BED and SJG reviewed the new functions and reports. Once developed, the OHDSI team reviewed then merged the code into the OHDSI GitHub repository. Our functions were then available to others for immediate use during the next release of the OHDSI tools.

In the end, functions to calculate completeness of certain demographic fields, as well as information entropy of the chief complaint field, were adopted by the OHDSI community. Users with ATLAS and the WebAPI (V.2.3 and higher) can run a full cohort analysis, which generates the completeness and entropy measures as standard reports. The changes extend the existing tool set, as well as more fully operationalize the framework for data quality of Kahn et al.[17] adopted by OHDSI.

Timeliness was ultimately rejected by the OHDSI community and therefore is not part of ATLAS or the WebAPI. The discussion and decision of the OHDSI community for this proposed functionality can be found online.[26] While testing revealed the timeliness measurement could be performed and visualized, the community did not perceive the function as valuable to the broader OHDSI community. Most uses of OHDSI center on observational studies that utilize EMR data extracted retrospectively at regular time intervals (e.g., quarterly) from their source. Therefore, timeliness in most cases will be of little interest since it is a fixed difference between the date of the ETL process and the date of the observation.

While epidemiologists need to monitor the timeliness with which data are reported to public health, this assessment is pertinent to the operational syndromic system and data feeds. Once extracted from Health Level 7 (HL7) messages, transformed to the CDM, and loaded into an OHDSI platform, timeliness also becomes fixed and difficult for the epidemiologist to interpret or act on. In our examinations of timeliness for the millions of encounters, there was a singular, linear trend for timeliness based on the date of the ED visit. It was impossible to detect any kind of broken data feed or system downtime using the timeliness report in ATLAS. Tools to assess timeliness are better suited upstream in the data collection and management process within a public health department.

Lessons for the broader informatics community

This case illustrates an important theory relevant to biomedical informatics applications: data quality as "fit for use" in a biomedical context. Information science theory defines data quality as a set of dimensions characterizing how well data are fit for use by consumers.[27][28] These dimensions include, among others, accuracy, granularity, completeness, and timeliness. When the context of data use changes, what constitutes good data quality (e.g., which characteristics are important to the user) will change concurrently. This case study illustrates "fit for use" for the data characteristics of completeness and timeliness. With respect to completeness, the context of use for epidemiologists, as well as observational researchers, is the same. In both cases, the user is interested in the proportion of patients or observations with a missing value in the record. Therefore, the OHDSI community saw value in adopting this data characteristic as a component of the OHDSI tool set. Because the contexts of use are different for public health surveillance versus observational research, a timeliness measure did not have value and was therefore rejected from the OHDSI tools.

The case further illustrates the importance of involving a diverse group of end users in the development of system functionality. In this case, the investigators engaged practicing surveillance experts who would presumably be the end users of the new functions in OHDSI in accordance with informatics best practices.[29] However, the team did not engage the existing user base of the OHDSI platform. Initial conversations with key members of OHDSI leadership indicated that all three functions would be of interest to the community. Yet, when conversations moved to actual change proposals, the community identified clear reasons why the timeliness component would not be of interest. The lesson for others is that a broader set of users is necessary to ensure new functions for a system will meet the needs of everyone and not just those for whom a new form, decision support rule, or analysis might be initially targeted to serve.

This project sought to extend an existing open-source platform for use by a new community of users who also care deeply about data quality. There remains high value in adapting existing infrastructure and tools to support expanded use cases rather than to just create independent tools for use by a niche group. However, doing so requires careful consideration of new and existing users. Since our project began, OHDSI has begun to more systematically address data quality challenges, as illustrated by the recently released The Book of OHDSI.[30] The book reviews data quality challenges, general data quality theory, and it also profiles the tools available in OHDSI for addressing data quality. We are hopeful OHDSI and the book will continue to advance data quality theory and practice. Public health and other subdisciplines in biomedical informatics need the support to transform data into knowledge and action.


The authors thank the epidemiologists in local and state health departments, as well as employees of the National Syndromic Surveillance Program, for their input and feedback on the functionalities developed for assessing surveillance data quality. We further thank the active, engaged members of the OHDSI community for their efforts to review and discuss the ideas and code our team brought to the community.


BED and SJG conceived of and designed the project. JDD contributed to the study concept as well as its execution. CW and TF provided technical guidance on the project. CW created and tested all of the code developed for the project, and she served as the team liaison to the Observational Health Data Sciences and Informatics community. JLW served as the project manager, herding team members to move the project forward. BED wrote the initial draft of the manuscript.


Research reported in this abstract was supported by the National Library of Medicine of the National Institutes of Health under Award Number R21LM012219. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Conflict of interest

None declared.


  1. Dixon, B.E.; Whipple, E.C.; Lajiness, J.M. et al. (2016). "Utilizing an integrated infrastructure for outcomes research: A systematic review". Health Information and Libraries Journal 33 (1): 7–32. doi:10.1111/hir.12127. PMID 26639793. 
  2. Institute of Medicine (2011). Digital Infrastructure for the Learning Health System: The Foundation for Continuous Improvement in Health and Health Care: Workshop Series Summary. National Academies Press. doi:10.17226/12912. ISBN 9780309225014. 
  3. 3.0 3.1 3.2 Dixon, B.E.; Rosenman, M.; Xia, Y. et al. (2013). "A vision for the systematic monitoring and improvement of the quality of electronic health data". Studies in Health Technology and Informatics 192: 884–8. doi:10.3233/978-1-61499-289-9-884. PMID 23920685. 
  4. 4.0 4.1 Weiskopf, N.G.; Bakken, S.; Hripcsak, G. et al. (2013). "A Data Quality Assessment Guideline for Electronic Health Record Data Reuse". EGEMS 5 (1): 14. doi:10.5334/egems.218. PMC PMC5983018. PMID 29881734. 
  5. Liaw, S.-T.; Chen, H.-Y.; Maneze, D. et al. (2012). "Health reform: Is routinely collected electronic information fit for purpose?". EGEMS 24 (1): 57–63. doi:10.1111/j.1742-6723.2011.01486.x. PMID 22313561. 
  6. 6.0 6.1 Dixon, B.E.; Siegel, J.A.; Oemig, T.V. et al. (2013). "Electronic health information quality challenges and interventions to improve public health surveillance data and practice". Public Health Reports 128 (6): 546–53. doi:10.1177/003335491312800614. PMC PMC3804098. PMID 24179266. 
  7. Martin, E.G.; Law, J.; Ran, W. et al. (2017). "Evaluating the Quality and Usability of Open Data for Public Health Research: A Systematic Review of Data Offerings on 3 Open Data Platforms". Journal of Public Health Management and Practice 23 (4): e5-e13. doi:10.1097/PHH.0000000000000388. PMID 26910872. 
  8. Botts, N.; Bouhaddou, O.; Bennett, J. et al. (2014). "Data Quality and Interoperability Challenges for eHealth Exchange Participants: Observations from the Department of Veterans Affairs' Virtual Lifetime Electronic Record Health Pilot Phase". AMIA Annual Symposium Proceedings 2014: 307–14. PMC PMC4419918. PMID 25954333. 
  9. Friedman, C.; Rubin, J.; Brown. J. et al. (2015). "Toward a science of learning systems: a research agenda for the high-functioning Learning Health System". JAMIA 22 (1): 43-50. doi:10.1136/amiajnl-2014-002977. PMC PMC4433378. PMID 25342177. 
  10. Institute of Medicine (2013). Observational Studies in a Learning Health System: Workshop Summary. National Academies Press. doi:10.17226/18438. ISBN 9780309290845. 
  11. Hripcsak, G.; Duke, J.D.; Shah, N.H. et al. (2015). "Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers". Studies in Health Technology and Informatics 216: 574–8. PMC PMC4815923. PMID 26262116. 
  12. 12.0 12.1 Overhage, J.M.; Ryan, P.B.; Reich, C.G. et al. (2012). "Validation of a common data model for active safety surveillance research". JAMIA 19 (1): 54–60. doi:10.1136/amiajnl-2011-000376. PMC PMC3240764. PMID 22037893. 
  13. Stang, P.E.; Ryan, P.B.; Racoosin, J.A. et al. (2010). "Advancing the science for active surveillance: Rationale and design for the Observational Medical Outcomes Partnership". Annals of Internal Medicine 153 (9): 600-6. doi:10.7326/0003-4819-153-9-201011020-00010. PMID 21041580. 
  14. Duke, J.D.; Ryan, P.B.; Suchard, M.A. et al. (2017). "Risk of angioedema associated with levetiracetam compared with phenytoin: Findings of the observational health data sciences and informatics research network". Epilepsia 58 (8): e101-e106. doi:10.1111/epi.13828. PMC PMC6632067. PMID 28681416. 
  15. Boland, M.R.; Shahn, Z.; Madigan, D. et al. (2015). "Birth month affects lifetime disease risk: A phenome-wide method". JAMIA 22 (5): 1042–53. doi:10.1093/jamia/ocv046. PMC PMC4986668. PMID 26041386. 
  16. Huser, V.; DeFalco, F.J.; Schuemie, M. et al. (2016). "Multisite Evaluation of a Data Quality Tool for Patient-Level Clinical Data Sets". EGEMS 4 (1): 1239. doi:10.13063/2327-9214.1239. PMC PMC5226382. PMID 28154833. 
  17. 17.0 17.1 Kahn, M.G.; Callahan, T.J.; Barnard, J. et al. (2016). "A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data". EGEMS 4 (1): 1244. doi:10.13063/2327-9214.1244. PMC PMC5051581. PMID 27713905. 
  18. Hoyt, R.E.; Hersh, W.R. (2018). Health Informatics: Practical Guide (7th ed.). Lulu. ISBN 9781387642410. 
  19. Williams, K.S.; Shah, G.H. (2016). "Electronic Health Records and Meaningful Use in Local Health Departments: Updates From the 2015 NACCHO Informatics Assessment Survey". Journal of Public Health Management and Practice 22 (Suppl. 6): S27–S33. doi:10.1097/PHH.0000000000000460. PMC PMC5050007. PMID 27684614. 
  20. Doroshenko, A.; Cooper, D.; Smith, G. et al. (2005). "Evaluation of syndromic surveillance based on National Health Service Direct derived data--England and Wales". MMWR Supplements 54: 117–22. PMID 16177702. 
  21. Beuhler, J.W.; Sonricker, A.; Paladini, M. et al. (2008). "Syndromic Surveillance Practice in the United States: Findings from a Survey of State, Territorial, and Selected Local Health Departments". Advances in Disease Surveillance 6: 1–20. 
  22. Ong, M.-S.; Magrabi, F.; Coiera, E. (2013). "Syndromic surveillance for health information system failures: A feasibility study". JAMIA 20 (3): 506-12. doi:10.1136/amiajnl-2012-001144. PMC PMC3628054. PMID 23184193. 
  23. Grannis, S.; Wade, M.; Gibson, J. et al. (2006). "The Indiana Public Health Emergency Surveillance System: Ongoing progress, early findings, and future directions". AMIA Annual Symposium Proceedings 2006: 304–8. PMC PMC1839268. PMID 17238352. 
  24. Dixon, B.E.; Lai, P.T.S.; Grannis, S.J. et al. (2013). "Variation in information needs and quality: Implications for public health surveillance and biomedical informatics". AMIA Annual Symposium Proceedings 2013: 670–9. PMC PMC3900209. PMID 24551368. 
  25. OHDSI. "Welcome to OHDSI". Retrieved 05 December 2017. 
  26. chen-regen (15 August 2017). "Add observation.row_created_db_time column". OHDSI/CommonDataModel. GitHub. Retrieved 22 December 2017. 
  27. Wang, R.Y.; Strong. D.M. (1996). "Beyond Accuracy: What Data Quality Means to Data Consumers". Journal of Management Information Systems 12 (4): 5–33. doi:10.1080/07421222.1996.11518099. 
  28. Batini, C.; Cappiello, C.; Francalanci, C. et al. (2009). "Methodologies for data quality assessment and improvement". ACM Computing Surveys 41 (3): 16. doi:10.1145/1541880.1541883. 
  29. Holden, R.J.; Voida, S.; Savoy, A. et al. (2016). "Human Factors Engineering and Human–Computer Interaction: Supporting User Performance and Experience". In Finnel, J.; Dixon, B.. Clinical Informatics Study Guide. Springer. pp. 287–307. doi:10.1007/978-3-319-22753-5_13. ISBN 9783319227535. 
  30. Abedtash, H.; Ascha, M.; Beno, M. et al.. "The Book of OHDSI". GitHub. Retrieved 2019. 


This presentation is faithful to the original, with only a few minor changes to presentation and grammar.