Journal:Characterizing and managing missing structured data in electronic health records: Data analysis

Full article title	Characterizing and managing missing structured data in electronic health records: Data analysis
Journal	JMIR Medical Informatics
Author(s)	Beaulieu-Jones, Brett K.; Lavage, Daniel R.; Snyder, John W.; Moore, Jason H.; Pendergrass, Sarah A.; Bauer, Christopher R.
Author affiliation(s)	University of Pennsylvania, Geisinger
Primary contact	Email: cbauer at geisinger dot edu
Year published	2018
Volume and issue	6 (1)
Page(s)	e11
DOI	10.2196/medinform.8960
ISSN	2291-9694
Distribution license	Creative Commons Attribution 4.0 International
Website	http://medinform.jmir.org/2018/1/e11/
Download	http://medinform.jmir.org/2018/1/e11/pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results.

Objective: The objective of this study was to demonstrate how the mechanism of "missingness" can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered.

Methods: We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on four mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling).

Results: Our results showed that several methods, including variations of multivariate imputation by chained equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation.

Conclusions: The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.

Keywords: imputation, missing data, clinical laboratory test results, electronic health records

Introduction

Justification

Missing data present a challenge to researchers in many fields, and this challenge is growing as datasets increase in size and scope. This is especially problematic for electronic health records (EHRs), where missing values frequently outnumber observed values. EHRs were designed to record and improve patient care and streamline billing rather than act as resources for research^[1]; thus, there are significant challenges to using these data to gain a better understanding of human health. As EHR data become increasingly used as a source of phenotypic information for biomedical research^[2], it is crucial to develop strategies for coping with missing data.

Clinical laboratory assay results are a particularly rich data source within the EHR, but they also tend to have large amounts of missing data. These data may be missing for many different reasons. Some tests are used for routine screening, but screening may be biased. Other tests are only conducted if they are clinically relevant to very specific ailments. Patients may also receive care at multiple health care systems, resulting in information gaps at each institution. Age, sex, socioeconomic status, access to care, and medical conditions can all affect how comprehensive the data are for a given patient. Accounting for the mechanisms that cause data to be missing is critical, since failure to do so can lead to biased conclusions.

Background

Aside from the uncertainty associated with a variable that is not observed, many analytical methods, such as regression or principal components analysis, are designed to operate only on a complete dataset. The easiest way to implement these procedures is to remove variables with missing values or remove individuals with missing values. Eliminating variables is justifiable in many situations, especially if a given variable has a large proportion of missing values, but doing so may restrict the scope and power of a study. Removing individuals with missing data is another option known as complete-case analysis. This is generally not recommended unless the fraction of individuals that will be removed is small enough to be considered trivial, or there is good reason to believe that the absence of a value is due to random chance. If there are systematic differences between individuals with and without observations, complete-case analysis will be biased.

An alternative approach is to fill in the fields that are missing data with estimates. This process, called imputation, requires a model that makes assumptions about why only some values were observed. "Missingness" mechanisms fall somewhere in a spectrum between three scenarios (Figure 1).

Figure 1: Two general paradigms are commonly used to describe missing data. Missing data are considered ignorable if the probability of observing a variable has no relation to the value of the observed variable and are considered nonignorable otherwise. The second paradigm divides missingness into three categories: missing completely at random (MCAR: the probability of observing a variable is not dependent on its value or other observed values), missing at random (MAR: the probability of observing a variable is not dependent on its own value after conditioning on other observed variables), and missing not at random (MNAR: the probability of observing a variable is dependent on its value, even after conditioning on other observed variables). The x-axis indicates the extent to which a given value being observed depends on other values of other observed variables. The y-axis indicates the extent to which a given value being observed depends on its own value.

When data are missing in a manner completely unrelated to both the observed and unobserved values, they are considered to be missing completely at random (MCAR).^[3]^[4] When data are MCAR, the observed data represent a random sample of the population, but this is rarely encountered in practice. Conversely, data missing not at random (MNAR) refers to a situation where the probability of observing a data point depends on the value of that data point.^[5] In this case, the mechanism responsible for the missing data is biased and should not be considered ignorable.^[6] For example, rheumatoid factor is an antibody detectable in blood, and the concentration of this antibody is correlated with the presence and severity of rheumatoid arthritis. This test is typically performed only for patients with some indication of rheumatoid arthritis. Thus, patients with high rheumatoid factor levels are more likely to have rheumatoid factor measures.

References

↑ Steinbrook, R.. "Health Care and the American Recovery and Reinvestment Act". New England Journal of Medicine 360 (11): 1057–1060. doi:10.1056/NEJMp0900665. PMID 19224738.
↑ Flintoft, L.. "Disease genetics: Phenome-wide association studies go large". Nature Reviews Genetics 15 (1): 2. doi:10.1038/nrg3637. PMID 24322724.
↑ Wells, B.J.; Chagin, K.M.; Nowacki, A.S.; Kattan, M.W.. "Strategies for handling missing data in electronic health record derived data". EGEMS 1 (3): 1035. doi:10.13063/2327-9214.1035. PMC PMC4371484. PMID 25848578. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4371484.
↑ Bounthavong, M.; Watanabe, J.H.; Sullivan, K.M.. "Approach to addressing missing data for electronic medical records and pharmacy claims data research". Pharmacotherapy 35 (4): 380–7. doi:10.1002/phar.1569. PMID 25884526.
↑ Bhaskaran, K.; Smeeth, L.. "What is the difference between missing completely at random and missing at random?". International Journal of Epidemiology 43 (4): 1336-9. doi:10.1093/ije/dyu080. PMC PMC4121561. PMID 24706730. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4121561.
↑ Rubin, D.B.. "Inference and missing data". Biometrika 63 (3): 581–592. doi:10.1093/biomet/63.3.581.

Notes

This presentation is faithful to the original, with only a few minor changes to grammar, spelling, and presentation, including the addition of PMCID and DOI when they were missing from the original reference.