Journal:Characterizing and managing missing structured data in electronic health records: Data analysis

From LIMSWiki
Revision as of 17:05, 6 March 2018 by Shawndouglas (talk | contribs) (Created stub. Saving and adding more.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Full article title Characterizing and managing missing structured data in electronic health records: Data analysis
Journal JMIR Medical Informatics
Author(s) Beaulieu-Jones, Brett K.; Lavage, Daniel R.; Snyder, John W.; Moore, Jason H.; Pendergrass, Sarah A.; Bauer, Christopher R.
Author affiliation(s) University of Pennsylvania, Geisinger
Primary contact Email: cbauer at geisinger dot edu
Year published 2018
Volume and issue 6 (1)
Page(s) e11
DOI 10.2196/medinform.8960
ISSN 2291-9694
Distribution license Creative Commons Attribution 4.0 International
Website http://medinform.jmir.org/2018/1/e11/
Download http://medinform.jmir.org/2018/1/e11/pdf (PDF)

Abstract

Background: Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results.

Objective: The objective of this study was to demonstrate how the mechanism of "missingness" can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered.

Methods: We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on four mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling).

Results: Our results showed that several methods, including variations of multivariate imputation by chained equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation.

Conclusions: The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.

Keywords: imputation, missing data, clinical laboratory test results, electronic health records

Introduction

Justification

Missing data present a challenge to researchers in many fields, and this challenge is growing as datasets increase in size and scope. This is especially problematic for electronic health records (EHRs), where missing values frequently outnumber observed values. EHRs were designed to record and improve patient care and streamline billing rather than act as resources for research[1]; thus, there are significant challenges to using these data to gain a better understanding of human health. As EHR data become increasingly used as a source of phenotypic information for biomedical research [2], it is crucial to develop strategies for coping with missing data.[2]

References

  1. Steinbrook, R.. "Health Care and the American Recovery and Reinvestment Act". New England Journal of Medicine 360 (11): 1057–1060. doi:10.1056/NEJMp0900665. 
  2. Flintoft, L.. "Disease genetics: Phenome-wide association studies go large". Nature Reviews Geneticcs 15 (1): 2. doi:10.1038/nrg3637. 

Notes

This presentation is faithful to the original, with only a few minor changes to grammar, spelling, and presentation, including the addition of PMCID and DOI when they were missing from the original reference.