Journal:Epidemiological data challenges: Planning for a more robust future through data standards

From LIMSWiki
Revision as of 18:51, 27 April 2020 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Epidemiological data challenges: Planning for a more robust future through data standards
Journal Frontiers in Public Health
Author(s) Fairchild, Geoffrey; Tasseff, Byron; Khalsa, Hari; Generous, Nicholas; Daughton, Ashlynn R.;
Velappan, Nileena; Priedhorsky, Reid; Deshpande, Alina
Author affiliation(s) Los Alamos National Laboratory
Primary contact Email: gfairchild at lanl dot gov
Editors Efird, Jimmy T.
Year published 2018
Volume and issue 6
Article # 336
DOI 10.3389/fpubh.2018.00336
ISSN 2296-2565
Distribution license Creative Commons Attribution 4.0 International
Website https://www.frontiersin.org/articles/10.3389/fpubh.2018.00336/full
Download https://www.frontiersin.org/articles/10.3389/fpubh.2018.00336/pdf (PDF)

Abstract

Accessible epidemiological data are of great value for emergency preparedness and response, understanding disease progression through a population, and building statistical and mechanistic disease models that enable forecasting. The status quo, however, renders acquiring and using such data difficult in practice. In many cases, a primary way of obtaining epidemiological data is through the internet, but the methods by which the data are presented to the public often differ drastically among institutions. As a result, there is a strong need for better data sharing practices. This paper identifies, in detail and with examples, the three key challenges one encounters when attempting to acquire and use epidemiological data: (1) interfaces, (2) data formatting, and (3) reporting. These challenges are used to provide suggestions and guidance for improvement as these systems evolve in the future. If these suggested data and interface recommendations were adhered to, epidemiological and public health analysis, modeling, and informatics work would be significantly streamlined, which can in turn yield better public health decision-making capabilities.

Keywords: data, computational epidemiology, public health, disease modeling, informatics, disease surveillance

Introduction

At the heart of disease surveillance and modeling are epidemiological data. These data are generally presented as a time series of cases, T, for a geographic region, G, and for a demographic, D. The type of cases presented may vary depending on the context. For example, T may be a time series of confirmed or suspected cases, or it might be hospitalizations or deaths; in some circumstances, it may be a summation of some combination of these (e.g., confirmed + suspected cases). G is most commonly a political boundary; it might be a country, state/province, county/district, city, or sub-city region, such as a postal code or United States (U.S.) Census Bureau census tract. Depending on the context, D may simply be the the entire population of G, or it might be stratified by age, sex, race, education, or other relevant factors.

Epidemiological data have a variety of uses. From a public health perspective, they can be used to gain an understanding of population-level disease progression. This understanding can in turn be used to aid in decision-making and allocation of resources. Recent outbreaks like Ebola and Zika have demonstrated the value of accessible epidemiological data for emergency preparedness and the need for better data sharing.[1] These data may influence vaccine distribution[2], and hospitals can anticipate surge capacity during an outbreak, allowing them to obtain extra temporary help if necessary.[3][4]

From a modeler's perspective, high-quality reference data (also commonly referred to as "ground truth data") are needed to enable prediction and forecasting.[5] These data can be used to parameterize compartmental models[6] as well as stochastic agent-based models[7][8][9][10][11], and they can also be used to train and validate machine learning and statistical models.[12][13][14][15][16][17][18][19]

The internet has become the predominant way to publish, share, and collect epidemiological data. While data standards exist for observational studies[20] and clinical research[21], for example, no such standards exist for the publication of the kind of public health-related epidemiological data described above. Despite the strong need to share and consume data, there are many legal, technical, political, and cultural challenges in implementing a standardized epidemiological data framework.[22][23] As a result, the methods by which data are presented to the public often differ significantly among data-sharing institutions (e.g., public health departments, ministries of health, data collection or aggregation services). Moreover, these problems are not unique to epidemiological data; the issues described in this paper are common across many different disciplines.

First, epidemiological data on the internet are presented to the user through a variety of interfaces. These interfaces vary widely not only in their appearance but also in their functionality. Some data are openly available through clear modern web interfaces, complete with well-documented programmer-friendly application programming interfaces (APIs), while others are displayed as static web pages that require error-prone and brittle web scraping. Still others are offered as machine-readable documents (e.g., comma-separate values [CSV], Microsoft Excel, Extensible Markup Language [XML], Adobe PDF). Finally, some necessitate contacting a human, who then prepares and sends the requested data manually.

Second, there are many data formats. Data containers (e.g., CSV, JavaScript Object Notation [JSON]) and element formats (e.g., timestamp format, location name format) may differ. Character encodings[24] (e.g., ASCII, UTF-8) and line endings[25] (e.g., \r\n, \n) may also differ. Compounding these issues, formats can change over time (e.g., renaming or reordering spreadsheet columns). More broadly, these challenges are closely tied to schema, data model, and vocabulary standardization.

Finally, there are differences among institutions in their reporting habits; even within a single institution, there are often reporting nuances among diseases. For example, one context may be reported monthly (e.g., Q fever in Australia), while another context is reported weekly (e.g., influenza in the U.S.) or even more finely (e.g., 2014 West African Ebola outbreak). Furthermore, what is meant by “weekly” in one context may be different than another context (e.g., CDC epi weeks vs. irregular reporting intervals in Poland, as described later).

Together, these challenges make large-scale public health data analysis and modeling significantly more difficult and time-consuming. Gathering, cleaning, and eliciting relevant data often require more time than the actual analysis itself. This paper discusses these three key technical challenges involving public health-related epidemiological data, in detail and with examples that were identified through detailed analysis of data deposition practices around the globe. Building from this analysis, we offer a framework of best practices comprised of modern standards that should be adhered to when releasing epidemiological data to the public. Such a framework will enable a more robust future for accurate and high-confidence epidemiological data and analysis.


References

  1. Chretien, J.P.; Rivers, C.M.; Johansson, M.A. (2016). "Make Data Sharing Routine to Prepare for Public Health Emergencies". PLoS One 13 (8): e1002109. doi:10.1371/journal.pmed.1002109. PMC PMC4987038. PMID 27529422. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987038. 
  2. Centers for Disease Control and Prevention (2018). "Allocating and Targeting Pandemic Influenza Vaccine During an Influenza Pandemic". U.S. Department of Health and Human Services. https://asprtracie.hhs.gov/technical-resources/resource/2846/guidance-on-allocating-and-targeting-pandemic-influenza-vaccine. 
  3. Nap, R.E.; Andriessen, M.P.; Meessen, N.E. et al. (2007). "Pandemic influenza and hospital resources". Emerging Infectious Diseases 13 (11): 1714-9. doi:10.3201/eid1311.070103. PMC PMC3375786. PMID 18217556. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3375786. 
  4. Hota, S.; Fried, E.; Burry, L. et al. (2010). "Preparing your intensive care unit for the second wave of H1N1 and future surges". Critical Care Medicine 38 (4 Suppl.): e110–9. doi:10.1097/CCM.0b013e3181c66940. PMID 19935417. 
  5. Moran, K.R.; Fairchild, G.; Generous, N. et al. (2016). "Epidemic Forecasting is Messier Than Weather Forecasting: The Role of Human Behavior and Internet Data Streams in Epidemic Forecast". Journal of Infectious Diseases 214 (Suppl. 4): S404-S408. doi:10.1093/infdis/jiw375. PMC PMC5181546. PMID 28830111. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5181546. 
  6. Hethcore, H.W. (2000). "The Mathematics of Infectious Diseases". SIAM Review 42 (4): 599–653. doi:10.1137/S0036144500371907. 
  7. Eubank, S.; Guclu, H.; Kumar, V.S. et al. (2004). "Modelling disease outbreaks in realistic urban social networks". Nature 429 (6988): 180–4. doi:10.1038/nature02541. PMID 15141212. 
  8. Busset, K.R.; Chen, J.; Feng, X. et al. (2009). "EpiFast: A fast algorithm for large scale realistic epidemic simulations on distributed memory systems". Proceedings of the 23rd international conference on Supercomputing: 430–39. doi:10.1145/1542275.1542336. 
  9. Chao, D.L.; Halstead, S.B.; Halloran, M.E. et al. (2012). "Controlling dengue with vaccines in Thailand". PLoS Neglected Tropical Diseases 6 (10): e1876. doi:10.1371/journal.pntd.0001876. PMC PMC3493390. PMID 23145197. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3493390. 
  10. Grefenstette, J.J.; Brown, S.T.; Rosenfeld, R. et al. (2013). "FRED (a Framework for Reconstructing Epidemic Dynamics): An open-source software system for modeling infectious diseases and control strategies using census-based populations". BMC Public Health 13: 940. doi:10.1186/1471-2458-13-940. PMC PMC3852955. PMID 24103508. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3852955. 
  11. McMahon, B.H.; Manore, C.A.; Hyman, J.M. et al. (2014). "Coupling Vector-host Dynamics with Weather Geography and Mitigation Measures to Model Rift Valley Fever in Africa". Mathematical Modelling of Natural Phenomena 9 (2): 161–77. doi:10.1051/mmnp/20149211. PMC PMC4398965. PMID 25892858. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4398965. 
  12. Viboud, C.; Boëlle, P.Y.; Carrat, F. et al. (2003). "Prediction of the spread of influenza epidemics by the method of analogues". American Journal of Epidemiology 158 (10): 996-1006. doi:10.1093/aje/kwg239. PMID 14607808. 
  13. Polgreen, P.M.; Chen, Y.; Pennock, D.M. et al. (2008). "Using internet searches for influenza surveillance". Clinical Infectious Diseases 47 (11): 1443-8. doi:10.1086/593098. PMID 18954267. 
  14. Ginsberg, J.; Mohebbi, M.H.; Patel, R.S. et al. (2009). "Detecting influenza epidemics using search engine query data". Nature 457 (7232): 1012-4. doi:10.1038/nature07634. PMID 19020500. 
  15. Signorini, A.; Segre, A.M.; Polgreen, P.M. et al. (2011). "The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic". PLoS One 6 (5): e19467. doi:10.1371/journal.pone.0019467. PMC PMC3087759. PMID 21573238. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3087759. 
  16. Shaman, J.; Karspeck, A.; Yang, W. et al. (2013). "Real-time influenza forecasts during the 2012-2013 season". Nature Communications 4: 2837. doi:10.1038/ncomms3837. PMC PMC3873365. PMID 24302074. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3873365. 
  17. Generous, N.; Fairchild, G.; Deshpande, A. et al. (2014). "Global disease monitoring and forecasting with Wikipedia". PLoS Computational Biology 10 (11): e1003892. doi:10.1371/journal.pcbi.1003892. PMC PMC4231164. PMID 25392913. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4231164. 
  18. Hickmann, K.S.; Fairchild, G.; Priedhorsky, R. et al. (2015). "Forecasting the 2013-2014 influenza season using Wikipedia". PLoS Computational Biology 11 (5): e1004239. doi:10.1371/journal.pcbi.1004239. PMC PMC4431683. PMID 25974758. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4431683. 
  19. Fairchild, G.; Del Valle, S.Y.; De Silva, L. et al. (2015). "Eliciting Disease Data from Wikipedia Articles". Proceedings of the 2015 International AAAI Conference on Weblogs and Social Media: 26–33. PMC PMC5511739. PMID 28721308. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5511739. 
  20. STROBE Initiative. "STROBE Statement". University of Bern. https://www.strobe-statement.org/. Retrieved 01 October 2018. 
  21. Clinical Data Interchange Standards Consortium. "CDISC". https://www.cdisc.org/. Retrieved 01 October 2018. 
  22. Pisani, E.; AbouZahr, C. (2010). "Sharing health data: Good intentions are not enough". Bulletin of the World Health Organization 88 (6): 462–6. doi:10.2471/BLT.09.074393. PMC PMC2878150. PMID 20539861. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2878150. 
  23. Edelstein, M.; Sane, J. (17 April 2015). "Overcoming Barriers to Data Sharing in Public Health: A Global Perspective". Chatham House. https://www.chathamhouse.org/publication/overcoming-barriers-data-sharing-public-health-global-perspective. 
  24. Zentgraf, D.C. (27 April 2015). "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text". Kunststube. http://kunststube.net/encoding/. Retrieved 23 August 2016. 
  25. Ministry of Health Isreal. "Weekly and Periodic Epidemiological Reports". https://www.health.gov.il/UnitsOffice/HD/PH/epidemiology/Pages/epidemiology_report.aspx?WPID=WPQ7&PN=1. Retrieved 04 September 2016. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.