Journal:Improving data quality in clinical research informatics tools

From LIMSWiki
Jump to navigationJump to search
Full article title Improving data quality in clinical research informatics tools
Journal Frontiers in Big Data
Author(s) AbuHalimeh, Ahmed
Author affiliation(s) University of Arkansas at Little Rock
Primary contact Email: aaabuhalime at ualr dot edu
Editors Ehrlinger, Lisa
Year published 2022
Volume and issue 5
Article # 871897
DOI 10.3389/fdata.2022.871897
ISSN 2624-909X
Distribution license Creative Commons Attribution 4.0 International
Download (PDF)


Maintaining data quality is a fundamental requirement for any successful and long-term data management project. Providing high-quality, reliable, and statistically sound data is a primary goal for clinical research informatics. In addition, effective data governance and management are essential to ensuring accurate data counts, reports, and validation. As a crucial step of the clinical research process, it is important to establish and maintain organization-wide standards for data quality management to ensure consistency across all systems designed primarily for cohort identification, allowing users to perform an enterprise-wide search on a clinical research data repository to determine the existence of a set of patients meeting certain inclusion or exclusion criteria. Some of the clinical research tools are referred to as de-identified data tools.

Assessing and improving the quality of data used by clinical research informatics tools are both important and difficult tasks. For an increasing number of users who rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment to preserve the value of the data. In clinical research informatics, better data quality translates into better research results and better patient care. However, achieving high-quality data standards is a major task because of the variety of ways that errors might be introduced in a system and the difficulty of correcting them systematically. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among data resources such as format, syntax, and semantic inconsistencies. The second category is related to poor extract, transform, and load (ETL) and data mapping processes.

In this paper, we describe a real-life case study on assessing and improving the data quality within a healthcare organization. This paper compares between the results obtained from two de-identified data systems—TranSMART Foundation's i2b2 and Epic's SlicerDicer—and discuss the data quality dimensions specific to the clinical research informatics context, and the possible data quality issues between the de-identified systems. This work closes by proposing steps or rules for maintaining data quality among different systems to help data managers, information systems teams, and informaticists at any healthcare organization to monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes.

Keywords: clinical research data, data quality, research informatics, informatics, management of clinical data


Data is the building block in all research, as research results are only as good as the data upon which the conclusions were formed. However, researchers may receive minimal training on how to use the de-identified data systems and methods common to clinical research today for achieving, assessing, or controlling the quality of research data.[1][2] De-identified data systems are defined as systems or tools that allow users to drag and drop search terms from a hierarchical ontology into a Venn diagram-like interface. Investigators can then perform an initial analysis on the de-identified cohort. However, de-identified data systems have no features to indicate or assist in identifying the quality of data in the system; these systems only provide counts.

Another issue involves the level of knowledge a clinician has about informatics in general and clinical informatics in particular. Without knowledge of informatics concepts, clinicians may not be able to identify quality issues in informatics systems. This requires some background in informatics, the science of how to use data, information, and knowledge to improve human health and the delivery of healthcare services[3], as well as clinical informatics, the the application of informatics and information technology to deliver healthcare services. For example, clinicians increasingly need to turn to patient portals, electronic medical records (EMRs), telehealth tools, healthcare apps, and a variety of data reporting tools[3] as part of achieving higher-quality health outcomes.

The case presented in this paper focuses on the quality of data obtained from two de-identified systems: TranSMART Foundation's i2b2 and Epic's SlicerDicer. The purpose of this paper is to discuss the quality of the data (counts) generated from the two systems, understand the potential causes of data quality issues, and propose steps to improve the quality and increase the trust of the generated counts by comparing the accuracy, consistency, validity, and understandability of the outcomes from the two systems. The proposed quality improvement steps are broadly applicable and contribute towards adding generic and essential steps to automate data curation and data governance to tackle various data quality problem. These steps should help data managers, information systems teams, and informaticists at a healthcare organization monitor and sustain data quality as part of their business intelligence, data governance, and data democratization processes.

The remainder of this paper is organized as follows. In the following section, we introduce the importance of data quality to clinical research informatics, followed by details of the case study, study method, and materials used. Afterwards, findings are presented and the proposed steps to ensure data quality are discussed. At the end, conclusions are drawn and future work discussed.

Importance of data quality to clinical research informatics

Data quality refers to the degree data meets the expectations of data consumers and their intended use of the data.[4][5][6] In clinical research informatics, this depends on the parameters of the study conducted.[1][2]

The significance of data quality lies in how the data is perceived and used by its consumer. Identifying data quality involves two stages: first, highlighting which characteristics (i.e., dimensions) are important (Figure 1) and second, determining how these dimensions affect the population in question.[5][6]

Fig1 AbuHalimeh FrontBigData2022 5.jpg

Fig. 1 De-identified data quality dimensions (DDQD)

This paper focuses on a subset of data quality dimensions, which we term "de-identified data quality dimensions" (DDQD). We think these dimensions, described in Table 1, are critical to maintaining data quality in de-identified systems.

Table 1. Definitions of de-identified data quality dimensions (DDQD)
Quality dimension Definition
Accuracy Refers to the degree to which information accurately reflects an event or object described, i.e., how well does a piece of information reflect reality?
Completeness Refers to the extent to which data is not missing and of sufficient amount for the task at hand, i.e., does it fulfill data consumers' expectations and is the needed amount known?
Consistency Refers to the extent to which data is applicable and helpful to the task at hand, i.e., does information stored in one place match relevant data stored elsewhere?
Timeliness Refers to the extent to which data is sufficiently relevant for the task at hand, i.e., is the most up-to-data available when you need it?
Validity Refers to information that doesn't conform to a specific format or doesn't follow business rules, i.e., is the information in a specific format, does it follow business rules, or is it in a usable format?
Understandability Refers to the degree the data can be comprehended, i.e., can the user understand the data easily?

The impact of quality data and management is in performance and efficiency gains and the ability to extract new understanding. On the other hand, poor clinical research informatics data quality can cause inefficiencies and other problems throughout an organization. This impact includes the quality of research outcomes, healthcare services, and decision-making.

Quality is not a simple scalar measure but can be defined on multiple dimensions, with each dimension yielding different meanings to different information consumers and processes.[5][6] Each dimension can be measured and assessed differently. Data quality assessment implies providing a value for each dimension that clearly says something about how much of the dimension or quality feature is achieved to enable adequate understanding and management. Data quality and the discipline of informatics are undoubtedly interconnected. Data quality depends on how data are collected, processed, and presented; this is what makes data quality important and sometimes complicated because data collection and processing varies from one study to another. Clinical informatics data can include different data formats and types and can come from different resources.

Case study goals

The primary goal is to compare, identify, and understand discrepancies in a patient count in TranSMART Foundation's i2b2[7] compared to Epic's SlicerDicer.[8] The secondary goal is to create a data dictionary that clinical researchers can easily understand. For example, if they wanted a count of patients with asthma, they would know what diagnoses were used to identify patients, where these diagnoses were captured, and that this count matches existing clinical knowledge.

The case described below is from a healthcare organization that wanted to have the ability to ingest other sources of research-specific data, such as genomic information, and their existing products did not have a way to do that. After deliberation, i2b2 was chosen as the data model for their clinical data warehouse. Prior to going live with users, however, it was essential to validate that the data in their clinical data management system (CDMS) was accurate.



The clinical validation process involved a clinical informatician, data analyst, and extract, transform, and load (ETL) developer.


Many healthcare organizations use at least one of the three major Epic databases: Chronicles, Clarity, and Caboodle. The data source used to feed the i2b2 and SlicerDicer tools was the Caboodle database.


The tools used to perform the study were i2b2 and SlicerDicer:

  • i2b2: Informatics for Integrating Biology and the Bedside (i2b2) is an open-source clinical data warehousing and analytics research platform managed by TranSMART Foundation. i2b2 enables sharing, integration, standardization, and analysis of heterogeneous data from healthcare and clinical research sources.[7]
  • SlicerDicer: SlicerDicer is a self-service reporting tool with Epic systems that allows physicians ready access to clinical data that is customizable by patient populations for data exploration. SlicerDicer allows the user to choose and search a specific patient population to answer questions about diagnoses, demographics, and procedures performed.[8]

Method description

The study was designed in such a way as to better compare, identify, and understand discrepancies in a patient count between i2b2 and SlicerDicer. We achieved this goal by choosing a task based on the nature of the tools. The first step was to run the same query in order to look at patient demographics (e.g., race, ethnicity, gender) and identify different aggregations with race and ethnicity in i2b2 compared with SlicerDicer, which was more granular (as shown in Table 2). For example, Cuban and Puerto Rican values in SlicerDicer were included in the "Other Hispanic" or "Latino" category in i2b2. The discrepancies are shown in Table 2.

Table 2. Patient demographic counts. * Note that if patients have more than one entry for race data, SlicerDicer counts them in all of the selected fields.
Patient counts i2b2 SlicerDicer % difference
American Indian or Alaska Native 1,434 1,579 9%
Asian 7,051 7,480 6%
Asian Indian 917
Chinese 177
Filipino 148
Japanese 30
Korean 62
Other Asian 6,146
Black or African American 238,638 242,871 2%
Native Hawaiian or Other Pacific Islander 2,990 3,430 13%
Native Hawaiian 170
Guamanian or Chamorro 21
Samoan 14
Other Pacific Islander 2,582
Multiple Race *
Other 99,081 107,759 8%
Unknown 31,733
Decline to Answer 176
White 659,140 670,182 2%
Hispanic or Latino 61,237 64,354 5%
Other Hispanic or Latino 21,021
Mexican, Mexican American, or Chicano/a 2,263
Puerto Rican 118
Cuban 41
Non-Hispanic or Latino 263,119 281,091 6%
Unknown 733,886 769,097 5%
None of the Above 730,480
Decline to Answer 300
Female 536,450 545,895 2%
Male 555,851 568,026 2%
Unknown 548 615 11%
Other 5

The second step was to run the same query to explore diagnoses using J45*, the ICD-10 code for asthma, and E10*, the ICD-10 code for type 1 diabetes. The query results are shown in Table 3.

Table 3. Patients count based on diagnosis codes.
Patient counts i2b2 SlicerDicer % difference
Asthma (J45*)
Diagnosis 14,500 22,958 39.48%
Billing diagnosis 20,429 22,265 8.25%
Type 1 diabetes (E10*)
Diagnosis 1,900 2,202 13.71%
Billing diagnosis 1,869 2,025 7.70%

A percentage difference calculator was implemented to find the percent difference between i2b2 counts and SlicerDicer counts >0. The percentage difference, as described in the formula below, is usually calculated when you want to know the difference in percentage between two numbers. It is useful for estimating the quality of the counts coming from the two tools. The threshold for accepted quality in this study was below two percent difference.

V1 = i2b2 counts and V2 = SlicerDicer counts; these counts are plugged into the below formula:

A paired t-test is used to investigate the difference between two counts from i2b2 and SlicerDicer for the same query.

Findings and hypotheses

All the results obtained from comparing the counts between i2b2 and SlicerDicer are listed in Tables 2 and 3.

However, when diagnoses were explored, larger discrepancies were noted. There are two diagnosis fields in i2b2: one for diagnosis and one for billing diagnosis. Using J45* as the ICD-10 code for asthma resulted in 22,265 patients when using the billing diagnosis code in SlicerDicer but only 20,429 in i2b2. The discrepancy using diagnosis was even larger. Patient count results for type 1 diabetes diagnosis code E10* using both diagnosis and billing diagnosis are also shown in Table 3.

The best approach to understanding the reasons for this discrepancy was by looking at the diagnosis options in SlicerDicer to build a hypothesis on where this discrepancy might come from. Next, the SQL code for the Caboodle-to-i2b2 ETL process was examined. From these examinations, the following hypotheses were considered:

H0: There is no discrepancy in the data elements used to pull the data.

H1: There is a discrepancy in the data elements used to pull the data.

A paired sample t-test was implemented on the counts obtained from ib2b and SlicerDicer using different data points. The p-value was equal to 0, [P(x ≤ –Infinity) = 0]; in all cases that means that the chance of type I error (rejecting a correct H0) is small: 0 (0%). The smaller the p-value the more it supports H1. For example, results of the paired t-test indicated that there is a significant medium difference between i2b2 (M = 14,500, SD = 0) and SlicerDicer (M = 23,958, SD = 0), t(0) = Infinity, p < 0.001; results of the paired t-test indicated that there is a significant medium difference between i2b2 (M = 155,434, SD = 0) and Slicerdicer (M = 1,579, SD = 0), t(0) = Infinity, p < 0.001.

Since the p-value < α, H0 is rejected and the i2b2 population's average is considered to be not equal to the SlicerDicer population's average. In other words, the difference between the averages of i2b2 and SlicerDicer is big enough to be statistically significant.

The paired t-test results supported the alternative hypothesis and revealed that there is a discrepancy in the data elements used to pull the data.

Also, the percentage difference calculator results, which were used to estimate the quality of the counts coming from the two tools, showed that a majority exceeded the threshold for accepted quality in this study (below 2%), as shown in Tables 2 and 3. The percentage difference results showed and provided strong evidence for a crucial quality issue in the counts obtained.

When examining the SQL code for the Caboodle-to-i2b2 ETL process, the SQL code results showed the code only looked at billing and encounter diagnosis, and everything that was not a billing diagnosis was simply labeled "diagnosis." SlicerDicer, and even Caboodle, include other diagnosis sources such as "medical history," "hospital problem," and "problem list." This was included in the data dictionary so that researchers would understand what sources i2b2 was using and that if they wanted data beyond that, they would have to request data from Caboodle.


The revealed discrepancies lead to major information quality measures such as data inconsistency and data accuracy, which both affect the believability and the validity of the data. The discrepancies noted above are likely due to several factors. First, SlicerDicer counts patients for every race selected, whereas i2b2 only takes the first race field. This is because two data models were used to pattern race and ethnicity variables in i2b2 to both the 1997 OMB race categories and the 2003 OMB variables, the latter containing a more granular set of race and ethnicity categories. The mapping then was done to "bundle" the other races to a more general set of categories. This could be the reason why there is a reduction of concepts because perhaps the mapping is incomplete.

Secondly, the purpose of the ETL process is to load the warehouse with integrated and cleansed data. Data quality focuses on the contents of the individual records to ensure the data loaded into the target destination is accurate, reliable, and consistent. As such, the ETL code should be evaluated to ensure the data extracted generally matches what researchers expect. In our case, this means understanding what diagnosis most researchers are interested in: they are expecting encounter diagnosis instead of data that includes problem lists and medical history. Thirdly, a leading cause for data quality issues is formatting differences or conversion errors[9][10], which was also the case with this dataset.

Lastly, data loss could be present in the ETL process, a significant challenge in ETL processes because of the nature of the source systems. Data losses arise from the disparities among the source operational systems. Source systems are diverse and disparate because of the increased amount of data, modification of data formats, and modification and derivation of new data elements.

In general, data integration with heterogeneous systems is not an easy task. This is mainly due to the fact that many data exchange channels must be developed in order to allow an exchange of data between the systems[11] and to solve problems related to the provision of interoperability between systems on the level of data.[12]

Steps to ensure informatics quality

To improve the data quality generated from de-identified systems, which mainly contains counts, and to solve any data quality issues related to the provision of interoperability between the used tools on the level of data, we propose the following steps to better ensure quality results:

  1. Make data “fit for use.” In order to do this, data governance bodies must clearly define major data concepts/variables included in the de-identified systems and standardize their collection and monitoring processes; this can increase clinical data reliability and reduce the inconsistency of data quality among systems involved.[5][6]
  2. Define data elements by developing a data dictionary. This is a fundamental part of any data quality plans; the lack of clear definitions of source data and controlled data collection procedures often raises concerns about the quality of data provided in such environments and, consequently, about the evidence level of related findings.[13] Developing a data dictionary is essential to ensuring data quality, especially in de-identified systems where all data elements are aggregated in a specific way, and there are not enough details about each concept. A data dictionary will serve as a guidebook to define the major data concepts. To do this, organizations must determine what metadata is helpful to the researchers when they use the de-identified data systems. In addition, identifying more targeted data concepts and process workflows can help reduce some of the time and effort for researchers when working with large amounts of data, ultimately improving overall data quality.
  3. Apply strong ETL practices. Applying good ETL practices such as data cleansing mechanisms to get the data to a place that acts well with data from other sources is also essential.
  4. Choose smart ETL architecture. Choose an architecture that allows you to update components of your ETL process when data and systems change or are updated to prevent any data loss and to ensure data integrity and consistency.
  5. Apply data lineage techniques. This will help in understanding where data originated from, when it was loaded, how it was transformed, and how it is essential for the integrity of the downstream data and the process that moves it to any of the de-identified systems.
  6. Establish rigorous processes for monitoring and cleansing data. Discovering and acting upon suspicious data and taking the appropriate cleansing action is preventative and useful.
  7. Evaluate and revise queries regularly. Users need to revise their queries and refine results as they combine data variables.
  8. Have critical knowledge support available in-house or on-demand. Having a clinical informaticist available can be beneficial to the process. They can ensure that your data reflects what is seen in clinical practice or help explain questionable data with their knowledge of clinical workflows and how that data is collected, especially if your analyst has no clinical background.


The success of any de-identified data tool depends largely on the quality of the data used and the mapping process which is intertwined with the ETL components. By extension, the ETL process is a crucial component in determining the quality of the data generated by an information system.

This case study proved that the discrepancies in the data used in the data pull process led to major information quality issues, such as data inconsistency and data accuracy, which in turn affects the believability and the validity of the data, which are also major data quality measures.

From this case study, our biggest contribution is a proposed set of steps that together form guidelines for a methodology of manual and automated procedures and tools used to manage data quality and data governance in a multifaceted, diverse information environment such as healthcare organizations, as well as to enhance data quality among data housed in de-identified data systems.

Future work will focus on more clinical informatics tools such as TriNetX, and other types of medical data, to assess the quality of the counts obtained from those tools.


Author contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.


  1. 1.0 1.1 Nahm, Meredith (2012), Richesson, Rachel L.; Andrews, James E., eds., "Data Quality in Clinical Research", Clinical Research Informatics (London: Springer London): 175–201, doi:10.1007/978-1-84882-448-5_10, ISBN 978-1-84882-447-8, Retrieved 2022-07-26 
  2. 2.0 2.1 Zozus, Meredith Nahm; Kahn, Michael G.; Weiskopf, Nicole G. (2019), Richesson, Rachel L.; Andrews, James E., eds., "Data Quality in Clinical Research" (in en), Clinical Research Informatics (Cham: Springer International Publishing): 213–248, doi:10.1007/978-3-319-98779-8_11, ISBN 978-3-319-98778-1, 
  3. 3.0 3.1 American Medical Informatics Association. "Informatics: Research and Practice". Why Informatics?. Retrieved April 2021. 
  4. Pipino, Leo L.; Lee, Yang W.; Wang, Richard Y. (1 April 2002). "Data quality assessment". Communications of the ACM 45 (4): 211–218. doi:10.1145/505248.506010. ISSN 0001-0782. 
  5. 5.0 5.1 5.2 5.3 Halimeh, A.A. (December 2011). "Integrating information quality in visual analytics". University of Arkansas Little Rock. 
  6. 6.0 6.1 6.2 6.3 AbuHalimeh, A.; Tudoreanu, M.E. (2014). "Subjective Information Quality in Data Integration: Evaluation and Principles". In Yeoh, William; Talburt, John R.; Zhou, Yinle. Information Quality and Governance for Business Intelligence:. Advances in Business Strategy and Competitive Advantage. IGI Global. pp. 44–65. doi:10.4018/978-1-4666-4892-0.ch003. ISBN 978-1-4666-4892-0. 
  7. 7.0 7.1 "About Us". Informatics for Integrating Biology and the Bedside (i2b2). TranSMART Foundation. Retrieved April 2021. 
  8. 8.0 8.1 "SlicerDicer". Epic UserWeb. Epic. Retrieved April 2021. 
  9. Azeroual, Otmane; Saake, Gunter; Abuosba, Mohammad (5 March 2019). "ETL Best Practices for Data Quality Checks in RIS Databases" (in en). Informatics 6 (1): 10. doi:10.3390/informatics6010010. ISSN 2227-9709. 
  10. Souibgui, Manel; Atigui, Faten; Zammali, Saloua; Cherfi, Samira; Yahia, Sadok Ben (2019). "Data quality in ETL process: A preliminary study" (in en). Procedia Computer Science 159: 676–687. doi:10.1016/j.procs.2019.09.223. 
  11. Berkhoff, K.; Ebeling, B.; Lübbe, S. (2012). "Integrating Research Information into a Software for Higher Education Administration – Benefits for Data Quality and Accessibility". In Jeffery, K.G.; Dvořák, J.. e-Infrastructures for Research and Innovation Linking Information Systems to Improve Scientific Knowledge Production. Proceedings of the 11th International Conference on Current Research Information Systems. euroCRIS. pp. 167-176. ISBN 9788086742335. 
  12. Marek, Macura (2014). "Integration Of Data From Heterogeneous Sources Using ETL Technology" (in en). Computer Science 15 (2): 109. doi:10.7494/csci.2014.15.2.109. ISSN 1508-2806. 
  13. Spengler, Helmut; Gatz, Ingrid; Kohlmayer, Florian; Kuhn, Klaus A.; Prasser, Fabian (1 July 2020). "Improving Data Quality in Medical Research: A Monitoring Architecture for Clinical and Translational Data Warehouses". 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS) (Rochester, MN, USA: IEEE): 415–420. doi:10.1109/CBMS49503.2020.00085. ISBN 978-1-7281-9429-5. 


This presentation is faithful to the original, with only a few minor changes to presentation, grammar, and punctuation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version—by design—lists them in order of appearance.