Difference between revisions of "Journal:Bringing big data to bear in environmental public health: Challenges and recommendations"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
 
(8 intermediate revisions by the same user not shown)
Line 18: Line 18:
|website      = [https://www.frontiersin.org/articles/10.3389/frai.2020.00031/full https://www.frontiersin.org/articles/10.3389/frai.2020.00031/full]
|website      = [https://www.frontiersin.org/articles/10.3389/frai.2020.00031/full https://www.frontiersin.org/articles/10.3389/frai.2020.00031/full]
|download    = [https://www.frontiersin.org/articles/10.3389/frai.2020.00031/pdf https://www.frontiersin.org/articles/10.3389/frai.2020.00031/pdf] (PDF)
|download    = [https://www.frontiersin.org/articles/10.3389/frai.2020.00031/pdf https://www.frontiersin.org/articles/10.3389/frai.2020.00031/pdf] (PDF)
}}
{{ombox
| type      = notice
| image    = [[Image:Emblem-important-yellow.svg|40px]]
| style    = width: 500px;
| text      = This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.
}}
}}


Line 36: Line 30:
These toxicological problems are mirrored in [[Public health|public]] and [[environmental health]] more generally as huge, complex issues with inadequately curated data and insufficient analytic power. Recent research in toxicology has focused on high-throughput screening to rapidly produce quantitative data on thousands of human biological targets<ref name="ThomasTheNext19">{{cite journal |title=The Next Generation Blueprint of Computational Toxicology at the U.S. Environmental Protection Agency |journal=Toxicological Sciences |author=Thomas, R.S.; Bahadori, T.; Buckley, T.J. et al. |volume=169 |issue=2 |pages=317–32 |year=2019 |doi=10.1093/toxsci/kfz058 |pmid=30835285 |pmc=PMC6542711}}</ref>, [[data mining]] to identify relevant end-points for building predictive models for adverse toxicological outcomes<ref name="SailiSystems19">{{cite journal |title=Systems Modeling of Developmental Vascular Toxicity |journal=Current Opinion in Toxicology |author=Saili, K.S.; Franzosa, J.A.; Baker, N.C. et al. |volume=15 |issue=1 |pages=55–63 |year=2019 |doi=10.1016/j.cotox.2019.04.004 |pmid=32030360 |pmc=PMC7004230}}</ref>, and application of cutting-edge [[machine learning]] (ML) and [[Artificial intelligence|artificial]] or augmented intelligence (AI) techniques.<ref name="LuechtefeldBig18">{{cite journal |title=Big-data and Machine Learning to Revamp Computational Toxicology and Its Use in Risk Assessment |journal=Toxicology Research |author=Luechtefeld, T.; Rowlands, C.; Hartung, T. |volume=7 |issue=5 |pages=732-744 |year=2018 |doi=10.1039/c8tx00051d |pmid=30310652 |pmc=PMC6116175}}</ref> Collectively, these technologies facilitate enhanced mechanistic insights and may obviate the need for inefficient testing in animal models, but they are still not considered mainstream approaches nor are they widely accepted by regulatory agencies.
These toxicological problems are mirrored in [[Public health|public]] and [[environmental health]] more generally as huge, complex issues with inadequately curated data and insufficient analytic power. Recent research in toxicology has focused on high-throughput screening to rapidly produce quantitative data on thousands of human biological targets<ref name="ThomasTheNext19">{{cite journal |title=The Next Generation Blueprint of Computational Toxicology at the U.S. Environmental Protection Agency |journal=Toxicological Sciences |author=Thomas, R.S.; Bahadori, T.; Buckley, T.J. et al. |volume=169 |issue=2 |pages=317–32 |year=2019 |doi=10.1093/toxsci/kfz058 |pmid=30835285 |pmc=PMC6542711}}</ref>, [[data mining]] to identify relevant end-points for building predictive models for adverse toxicological outcomes<ref name="SailiSystems19">{{cite journal |title=Systems Modeling of Developmental Vascular Toxicity |journal=Current Opinion in Toxicology |author=Saili, K.S.; Franzosa, J.A.; Baker, N.C. et al. |volume=15 |issue=1 |pages=55–63 |year=2019 |doi=10.1016/j.cotox.2019.04.004 |pmid=32030360 |pmc=PMC7004230}}</ref>, and application of cutting-edge [[machine learning]] (ML) and [[Artificial intelligence|artificial]] or augmented intelligence (AI) techniques.<ref name="LuechtefeldBig18">{{cite journal |title=Big-data and Machine Learning to Revamp Computational Toxicology and Its Use in Risk Assessment |journal=Toxicology Research |author=Luechtefeld, T.; Rowlands, C.; Hartung, T. |volume=7 |issue=5 |pages=732-744 |year=2018 |doi=10.1039/c8tx00051d |pmid=30310652 |pmc=PMC6116175}}</ref> Collectively, these technologies facilitate enhanced mechanistic insights and may obviate the need for inefficient testing in animal models, but they are still not considered mainstream approaches nor are they widely accepted by regulatory agencies.


Individual research programs generate large data sets, but without centralized coordination, standardized [[reporting]], and common storage structures, the data cannot be effectively combined and used to its full potential. The federal Tox21 research consortium, for example, has to date tested more than 9,000 chemicals to varying degrees in 1,600 assays and demonstrated environmental chemical interactions with critical human and ecologically-relevant targets.<ref name="TiceImprov13">{{cite journal |title=Improving the Human Hazard Characterization of Chemicals: A Tox21 Update |journal=Environmental Health Perspectives |author=Tice, R.R.; Austin, C.P.; Kavlock, R.J. et al. |volume=121 |issue=7 |pages=756–65 |year=2013 |doi=10.1289/ehp.1205784 |pmid=23603828 |pmc=PMC3701992}}</ref> Translational systems approaches are being employed by this and other programs (e.g., Horizon 2020, EUToxRisk, CEFIC LRI, and OpenTox) to produce diverse data streams and predict chemical effects on human health and disease outcomes.<ref name="KleinstreuerPheno14">{{cite journal |title=Phenotypic Screening of the ToxCast Chemical Library to Classify Toxic and Therapeutic Mechanisms |journal=Nature Biotechnology |author=Kleinstreuer, N.C.; Yang, J.; Berg, E.L. et al. |volume=32 |issue=6 |pages=583–91 |year=2014 |doi=10.1038/nbt.2914 |pmid=24837663}}</ref> At the same time, there have been substantial efforts to develop and deploy sensors and satellite systems that yield additional large and complex data sets that provide information about chemical exposures.<ref name="DonsWear17">{{cite journal |title=Wearable Sensors for Personal Monitoring and Estimation of Inhaled Traffic-Related Air Pollution: Evaluation of Methods |journal=Environmental Science and Technology |author=Dons, E.; Laeremans, M.; Orjuela, J.P. et al. |volume=51 |issue=3 |pages=1859-1867 |year=2017 |doi=10.1021/acs.est.6b05782 |pmid=28080048}}</ref><ref name="RingConsensus19">{{cite journal |title=Consensus Modeling of Median Chemical Intake for the U.S. Population Based on Predictions of Exposure Pathways |journal=Environmental Science and Technology |author=Ring, C.L.; Arnot, J.A.; Bennett, D.H. et al. |volume=53 |issue=2 |pages=719–32 |year=2019 |doi=10.1021/acs.est.8b04056 |pmid=30516957 |pmc=PMC6690061}}</ref><ref name="WeichenthalAPict19">{{cite journal |title=A Picture Tells a thousand…exposures: Opportunities and Challenges of Deep Learning Image Analyses in Exposure Science and Environmental Epidemiology |journal=Environmental International |author=Weichenthal, S.; Hatzopoulou, M.; Brauer, M. |volume=122 |pages=3–10 |year=2019 |doi=10.1016/j.envint.2018.11.042 |pmid=30473381}}</ref>
Individual research programs generate large data sets, but without centralized coordination, standardized [[reporting]], and common storage structures, the data cannot be effectively combined and used to its full potential. The federal Tox21 research consortium, for example, has to date tested more than 9,000 chemicals to varying degrees in 1,600 assays and demonstrated environmental chemical interactions with critical human and ecologically-relevant targets.<ref name="TiceImprov13">{{cite journal |title=Improving the Human Hazard Characterization of Chemicals: A Tox21 Update |journal=Environmental Health Perspectives |author=Tice, R.R.; Austin, C.P.; Kavlock, R.J. et al. |volume=121 |issue=7 |pages=756–65 |year=2013 |doi=10.1289/ehp.1205784 |pmid=23603828 |pmc=PMC3701992}}</ref> Translational systems approaches are being employed by this and other programs (e.g., Horizon 2020, EUToxRisk, CEFIC LRI, and OpenTox) to produce diverse data streams and predict chemical effects on human health and disease outcomes.<ref name="KleinstreuerPheno14">{{cite journal |title=Phenotypic Screening of the ToxCast Chemical Library to Classify Toxic and Therapeutic Mechanisms |journal=Nature Biotechnology |author=Kleinstreuer, N.C.; Yang, J.; Berg, E.L. et al. |volume=32 |issue=6 |pages=583–91 |year=2014 |doi=10.1038/nbt.2914 |pmid=24837663}}</ref> At the same time, there have been substantial efforts to develop and deploy sensors and satellite systems that yield additional large and complex data sets that provide information about chemical exposures.<ref name="DonsWear17">{{cite journal |title=Wearable Sensors for Personal Monitoring and Estimation of Inhaled Traffic-Related Air Pollution: Evaluation of Methods |journal=Environmental Science and Technology |author=Dons, E.; Laeremans, M.; Orjuela, J.P. et al. |volume=51 |issue=3 |pages=1859-1867 |year=2017 |doi=10.1021/acs.est.6b05782 |pmid=28080048}}</ref><ref name="RingConsensus19">{{cite journal |title=Consensus Modeling of Median Chemical Intake for the U.S. Population Based on Predictions of Exposure Pathways |journal=Environmental Science and Technology |author=Ring, C.L.; Arnot, J.A.; Bennett, D.H. et al. |volume=53 |issue=2 |pages=719–32 |year=2019 |doi=10.1021/acs.est.8b04056 |pmid=30516957 |pmc=PMC6690061}}</ref><ref name="WeichenthalAPict19">{{cite journal |title=A Picture Tells a thousand…exposures: Opportunities and Challenges of Deep Learning Image Analyses in Exposure Science and Environmental Epidemiology |journal=Environmental International |author=Weichenthal, S.; Hatzopoulou, M.; Brauer, M. |volume=122 |pages=3–10 |year=2019 |doi=10.1016/j.envint.2018.11.042 |pmid=30473381}}</ref> Further, epidemiologists are actively developing ML and AI approaches to enhance understanding of chemical exposures and associated disease risks.<ref name="BrandonSim20">{{cite journal |title=Simulating Exposure-Related Behaviors Using Agent-Based Models Embedded With Needs-Based Artificial Intelligence |journal=Journal of Exposure Science and Environmental Epidemiology |author=Brandon, N.; Dionisio, K.L.; Isaacs, K. et al. |volume=30 |issue=1 |pages=184–193 |year=2020 |doi=10.1038/s41370-018-0052-y |pmid=30242268 |pmc=PMC6914672}}</ref> However, in these and other such projects, efforts are largely disconnected from one another and operate independently, despite the clear potential benefits of combining and jointly analyzing such data.


Given the need to integrate and analyze large, multifactorial data sets, researchers in public health and the environmental health sciences (EHS) would greatly benefit from the ability to collect, process, [[Data analysis|analyze]], and make inferences on data using ML and AI. However, in these fields, a general lack of relevant knowledge among many researchers; sparse, distributed, or inaccessible data; and an inadequate framework for sharing and disseminating innovations impede efforts to implement these approaches. Here, we discuss three specific areas with room for improvement in the public health/EHS field: sharing and collecting data, expanding researchers' knowledge base, and recognizing the benefits AI/ML can bring to current problems. Recommendations are provided in each of these areas to facilitate bringing big data to bear on public health and EHS challenges.


==Sharing and collecting data==
===Challenge===
A major hurdle confronting investigators conducting public health and EHS research is a lack of comprehensive human and environmental exposure and effects data that are annotated using [[Controlled vocabulary|controlled vocabularies]]. Addressing this problem is a prerequisite to applying AI and ML, as without sufficient, high-quality data and [[metadata]], the analytic methods themselves are irrelevant. Quantifying environmental exposure, such as from air, water, soil, and food, is difficult both at the micro (localized to individuals and small geographic units) and macro (national and international) levels. For instance, air pollution can vary up to eight-fold within a given city block, but most U.S. cities have only one air quality monitor.<ref name="EDFWhyNew">{{cite |url=https://www.edf.org/airqualitymaps |title=Why new technology is critical for tackling air pollution around the globe |author=Environmental Defense Fund |publisher=Environmental Defense Fund |accessdate=2019}}</ref> [[Epidemiology|Epidemiologic]] studies of air pollution health effects often must rely on disparate data that lack both temporal and spatial specificity and cannot account for the movement of people across different areas of pollution. Without continuous and advanced monitoring, and robust computer modeling methods, illnesses related to transient exposures might not be recognized as part of a significant pattern until substantial adverse health effects have occurred. This is one example where the development of AI tools in the EHS space has been hindered not by the AI technology capacity itself but instead by a lack of reliable, interconnected data.<ref name="NASLeveraging19">{{cite web |url=https://www.nap.edu/catalog/25520/leveraging-artificial-intelligence-and-machine-learning-to-advance-environmental-health-research-and-decisions |title=Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions: Proceedings of a Workshop—in Brief |author=National Academies of Sciences, Engineering, and Medicine |publisher=The National Academies Press |date=2019 |doi=10.17226/25520}}</ref> This is equally true in the medical sector with respect to patient treatment and outcomes. IBM's ambitious partnership with the MD Anderson Cancer Center to develop AI to expedite [[Clinical decision support system|clinical decision-making]] has been at a standstill after years of development due to a lack of standardized, accessible data.<ref name="JaklevicMDAnd17">{{cite web |url=http://www.healthnewsreview.org/2017/02/md-anderson-cancer-centers-ibm-watson-project-fails-journalism-related/ |title=MD Anderson Cancer Center’s IBM Watson project fails, and so did the journalism related to it |author=Jaklevic, M.C. |work=Health News Review |date=23 February 2017}}</ref>
Even when standardized data are available, finding, accessing, and processing it can be a monumental task. The absence of a uniform framework for openly sharing and storing data means that researchers devote significant time to locating relevant data. Knowledge of where to find data is often highly sector-specific, inhibiting cross-disciplinary research. For example, a climate scientist interested in public health would need knowledge of health-specific data repositories to conduct the search. Rather than waste manual effort and time in locating data, let alone integrating it, coordinated efforts could result in processes that could be automated and simplified. Ethical concerns have been voiced in regard to organizing these types of large repositories.<ref name="IencaConsid18">{{cite journal |title=Considerations for Ethics Review of Big Data Health Research: A Scoping Review |journal=PLoS One |author=Ienca, M.; Farretti, A.; Hurst, S. et al. |volume=13 |issue=10 |at=e0204937 |year=2018 |doi=10.1371/journal.pone.0204937 |pmid=30308031 |pmc=PMC6181558}}</ref> Of these, patient [[Information privacy|data privacy]] is a major concern, and breaches of patient record databases are a constant challenge. Unique patient identifier numbers and other de-identification/anonymization techniques can protect patient privacy, while allowing for meaningful research and analysis.<ref name="ElEmamAnoon15">{{cite journal |title=Anonymising and Sharing Individual Patient Data |journal=BMJ |author=El Emam, K.; Rodgers, S.; Malin, B. |volume=350 |at=h1139 |year=2015 |doi=10.1136/bmj.h1139 |pmid=25794882 |pmc=PMC4707567}}</ref> New [[encryption]]-based techniques allow for predictive modeling while maintaining the privacy of sensitive information, such as the application of homomorphic encryption to patient data in predicting cardiovascular disease.<ref name="BosPrivate14">{{cite journal |title=Private Predictive Analysis on Encrypted Medical Data |journal=Journal of Biomedical Informatics |author=Bos, J.W.; Lauter, K.; Naehrig, M. |volume=50 |pages=234–43 |year=2014 |doi=10.1016/j.jbi.2014.04.003 |pmid=24835616}}</ref> However, inconsistent [[Regulatory compliance|regulations]] and lack of practical protocols around handling sensitive information have resulted in unethical scenarios, where data is being sourced from countries where patients have minimal rights.<ref name="MittelstadtTheEthics16">{{cite journal |title=The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts |journal=Science and Engineering Ethics |author=Mittelstadt, B.D.; Floridi, L. |volume=22 |issue=2 |pages=303–41 |year=2016 |doi=10.1007/s11948-015-9652-2 |pmid=26002496}}</ref> Not only is this problematic from an ethical perspective, it also limits AI innovation to only those who have access to these obscure datasets. Specific tools developed by startups who have the luxury of sourcing data from elsewhere are often acquired by large corporations, making innovation an exclusive pursuit. Thus, the environmental public health field requires a revolution in the collection and organization of environmental exposure and effects data as a first step in democratizing information access and building better models to improve predictions.
===Recommendations===
Further work is clearly needed in data collection and sharing, but recent attempts in specific sectors are exemplar in the aggregation of data and development of open, accessible repositories that maintain necessary privacy standards. In 2016, over 50 contributing researchers from global institutions proposed the “Findable, Accessible, Interoperable, and Reusable” [[Journal:The FAIR Guiding Principles for scientific data management and stewardship|(FAIR) Guiding Principles]] for scientific data management and stewardship.<ref name="WilkinsonTheFAIR16">{{cite journal |title=The FAIR Guiding Principles for scientific data management and stewardship |journal=Scientific Data |author=Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. |volume=3 |pages=160018 |year=2016 |doi=10.1038/sdata.2016.18 |pmid=26978244 |pmc=PMC4792175}}</ref> These principles bridge the divide between human-conducted and machine-driven research behaviors. Using FAIR principles, the [[National Institutes of Health]] (NIH) is creating Data Commons, a platform for data management, and metadata cataloging for terminologies and [[Ontology (information science)|ontologies]].<ref name="MahonyHigh18">{{cite journal |title=Highlight Report: 'Big Data in the 3R's: Outlook and Recommendations', a Roundtable Summary |journal=Archives of Toxicology |author=Mahony, C.; Currie, R.; Daston, G. et al. |volume=92 |issue=2 |pages=1015–20 |year=2018 |doi=10.1007/s00204-017-2145-0 |pmid=29340744}}</ref> This framework has been one of the key drivers behind new repositories and tools such as the National Toxicology Program's Integrated Chemical Environment (ICE) portal<ref name="BellAnInteg17">{{cite journal |title=An Integrated Chemical Environment to Support 21st-Century Toxicology |journal=Environmental Health Perspectives |author=Bell, S.M.; Phillips, J.; Sedykh, A. et al. |volume=125 |issue=5 |at=054501 |year=2017 |doi=10.1289/EHP1759 |pmid=28557712 |pmc=PMC5644972}}</ref> and the U.S. EPA's CompTox Chemicals Dashboard<ref name="WilliamsTheCompTox17">{{cite journal |title=The CompTox Chemistry Dashboard: A Community Data Resource for Environmental Chemistry |journal=Journal of Cheminformatics |author=Williams, A.J.; Grulke, C.M.; Edwards, J. et al. |volume=9 |issue=1 |at=61 |year=2017 |doi=10.1186/s13321-017-0247-6 |pmid=29185060 |pmc=PMC5705535}}</ref>, which allows FAIR principles to be applied to non-animal ''in vitro'' and ''in silico'' data, along with ''in vivo'' animal data and human exposure information. A collaboration between the U.S. [[Food and Drug Administration]] (FDA), the non-profit [[Clinical Data Interchange Standards Consortium]] (CDISC), and other stakeholders resulted in the development of study data standards for non-clinical and clinical analysis data and metadata<ref name="CDISCStandards">{{cite web |url=https://www.cdisc.org/standards |title=CDISC Standards |author=Clinical Data Interchange Standards Consortium |publisher=Clinical Data Interchange Standards Consortium |accessdate=2019}}</ref> to create common reporting formats.
These concepts are cornerstones of the 2018 U.S. Strategic Roadmap for Modernizing Safety Testing of Chemicals and Medical Products—developed by 16 U.S. federal agencies—which advocates for practices that increase confidence in new data-driven research methods.<ref name="ICCVAM_AStrat18">{{cite web |url=https://ntp.niehs.nih.gov/whatwestudy/niceatm/natl-strategy/index.html |title=A Strategic Roadmap for Establishing New Approaches to Evaluate the Safety of Chemicals and Medical Products in the United States |author=Interagency Coordinating Committee on the Validation of Alternative Methods |publisher=National Toxicology Program, NIH |doi=10.22427/NTP-ICCVAM-ROADMAP2018 |date=January 2018 |accessdate=04 January 2019}}</ref> A significant portion of the work done by these data powerhouses is retrospective data curation, often performed manually (e.g., Kleinstreuer ''et al.''<ref name="KleinstreuerACurate16">{{cite journal |title=A Curated Database of Rodent Uterotrophic Bioactivity |journal=Environmental Health Perspectives |author=Kleinstreuer, N.C.; Ceger, P.C.; Allen, D.G. et al. |volume=124 |issue=5 |pages=556–62 |year=2016 |doi=10.1289/ehp.1510183 |pmid=26431337 |pmc=PMC4858395}}</ref>). Work is ongoing to automate some aspects of the information extraction pipeline, but additional efforts to standardize reporting formats, as well as metadata terminologies in emerging research, could lighten the curation burden on these institutions and streamline data annotation and storage, allocating greater resources to the development of novel applications.
Many of the recent advances in developing openly accessible databases of environmental exposure information have come from the private sector, often in partnership with non-profit organizations and academic institutions. The [https://monarchinitiative.org/ Monarch Initiative] is one such collaboration to apply ontologies, or semantic descriptions, to disease phenotypes and enable intra- and inter-species comparisons and connection to genotypes, pathways, and experimental models. Another example is a pilot project in Oakland, California between the Environmental Defense Fund (EDF) and Google Earth Outreach, which involved attaching air quality sensors to Google Street View cars. This was recently extended to a partnership with an environmental sensor company (Aclima) to equip Street View cars with mobile air quality sensors in cities around the world.<ref name="BWAclima18">{{cite web |url=https://www.businesswire.com/news/home/20180912005440/en/Aclima-Google-Scale-Air-Quality-Mapping-Places |title=Aclima & Google Scale Air Quality Mapping to More Places Around the World |author=Aclima |work=Business Wire |date=12 September 2018 |accessdate=2019}}</ref> The sensors capture detailed air quality and emissions data at high (street-block level) resolution and temporal frequency. These data will be aggregated and made available on a Google database. Google also recently announced that it will report estimates of city-level greenhouse gas emissions and annual driving, biking, and transit ridership (data gathered via Google Maps and Waze).<ref name="MeyerGoogle18">{{cite web |url=https://www.theatlantic.com/technology/archive/2018/09/google-climate-change-greenhouse-gas-emissions/571144/ |title=Google’s New Tool to Fight Climate Change |work=The Atlantic |author=Meyer, R. |date=25 September 2018 |accessdate=2019}}</ref><ref name="GoogleEnviron">{{cite web |url=https://insights.sustainability.google/ |title=Environmental Insights Explorer |publisher=Google, Inc |accessdate=2019}}</ref> Google has also developed a new Dataset Search is an initial attempt to apply distributed search to datasets from the environmental and social sciences, government data, and news organizations.<ref name="NoyMaking18">{{cite web |url=https://www.blog.google/products/search/making-it-easier-discover-datasets/ |title=Making it easier to discover datasets |author=Noy, N. |work=The Keywork |publisher=Google, Inc |date=05 September 2018 |accessdate=2019}}</ref> By providing access to data from multiple disciplines via a single platform, researchers can conduct interdisciplinary work fundamental to environmental and public health.<ref name="VincentGoogle18">{{cite web |url=https://www.theverge.com/2018/9/5/17822562/google-dataset-search-service-scholar-scientific-journal-open-data-access |title=Google launches new search engine to help scientists find the datasets they need |author=Vincent, J. |work=The Verge |date=05 September 2018 |accessdate=2019}}</ref> Applying these powerful methods to better curate and integrate diverse sources of data will promote greater understanding of complex and dynamic systems. However, acceptance and implementation of these improvements are not yet widespread, particularly in the public and environmental health sectors. Following the lead of these innovative pilot studies, a greater emphasis needs to be placed on developing the appropriate infrastructure for effective, standardized data collection approaches, common ontologies, and uniform sharing protocols.
==Expanding researchers' knowledge base==
===Challenges===
Public health and EHS researchers are not traditionally trained in scientific computing and data science, and computer scientists do not typically apply their skill sets to EHS problems. This impedes the introduction and utilization of powerful data science techniques to public and environmental health practice and research. The American Association for Public Health, the body responsible for accrediting public health programs in the United States, currently does not include computer skills in the core competencies required of a Master's of Public Health (MPH) program. This omission is a disservice both to public health students and the field of public health research more broadly. Just as being able to read and write are fundamental skills and required baseline competencies for entering a graduate program, computer science will eventually become a similar prerequisite for comprehensive and effective analytical approaches across the biological and toxicological realm. Ultimately, the EHS and public health disciplines need a culture and skill shift. Over the next decade, scientists will need to understand the fundamentals of computer and data science in order to be successful in their field. Having this core computer science competency will be critical to the future success of transdisciplinary research in the EHS and public health disciplines.
===Recommendations===
Current public health and EHS students are interested in these issues, which means that public health schools need to integrate computer and data science into their core curriculums. While students interested in big data and public health are not at a total loss for resources, provided they are willing to seek them out, there are major gaps in the academic arena. There are a handful of existing programs that currently provide the skill set needed to apply data science to public health research. The Master of Science in Data Science is an option offered from a number of universities, but few offer, or even consider, the integration of this degree with applications in the field of public health. Within the status quo, students are presented with the option of either an MPH or an MS in Data Science, with little crossover between the two. Even programs such as Harvard's Health Data Science MS, which is offered through the School of Public Health, only requires one epidemiology course and then places the onus on the student to integrate the public health perspective into their own research.
The lack of computer skills in the core competencies of the MPH is one example of an area with obvious and immediate room for improvement. However, the need for cross-disciplinary training and communication is equally relevant from the perspective of computer scientists, who should be educated and engaged in EHS and public health applications. Students from both academic disciplines should be encouraged, if not required, to engage in coursework in the other. This idea of “cross-departmental partnerships” would provide public health students the technical skills needed to integrate computer science, AI, and ML into their work while providing insight for potential EHS projects to which computer science students could apply their skills. The thoughtful design of such a program also would foster a culture of transdisciplinary research, which will be key to finding solutions to complex environmental and public health problems. The Massachusetts Institute of Technology is laying the necessary groundwork for such a program by creating a new college for AI. With a $1 billion investment, the college is expected to start in the Fall of 2019 with the goal of integrating AI systems across academic fields.<ref name="LohrMIT18">{{cite web |url=https://www.nytimes.com/2018/10/15/technology/mit-college-artificial-intelligence.html |title=M.I.T. Plans College for Artificial Intelligence, Backed by $1 Billion |author=Lohr, S. |work=The New York Times |date=15 October 2018 |accessdate=04 January 2019}}</ref> Continuing education programs, such as the New Approach Methodology Use for Regulatory Application training series ([https://www.pcrm.org/ethical-science/animal-testing-and-alternatives/nura NURA]), are being offered to ensure that environmental public health professionals also begin to develop the skills and expertise needed to leverage big data and implement AI and ML based approaches. A highly topical example of the benefit of teaching public health students computational skills is the widely referenced [https://coronavirus.jhu.edu/map.html Johns Hopkins University resource] for tracking the spread of the novel [[SARS-COV-2]] [[coronavirus]].
While the above focus on cross-discipline training is of paramount importance for future successful applications of AI and ML in the EHS, incentives for cross-disciplinary collaboration would have a more rapid impact and also would inform the integration of computer and data science into EHS curricula. Recognition of this opportunity by current EHS leadership and appropriate investments to achieve this goal would be of substantial benefit.
==Recognizing the benefits AI and ML to public health and EHS==
===Challenge===
A downstream consequence of the challenges detailed above is that the majority of researchers in the scientific community are still unaware of the benefits that AI and ML could provide when coupled with large, annotated, integrated datasets. A lack of familiarity with AI and ML as tools means that even when presented with examples of effective predictive models, potential end-users may not adopt them due to a lack of understanding, leading to decreased confidence in their utility. Further, without substantial investment in data curation and integration, the ability to apply AI and ML to the creation of such models is severely limited.
===Recommendations===
While these approaches are not yet mainstream, there are many examples of extremely successful implementations of combining big data with AI and ML to build high-performance predictive models, and such case studies should be widely distributed and serve as the catalysts for increasing support in these research areas. Beyond establishing the cyber-infrastructure to generate and store open and accessible data, select researchers and government agencies are developing modeling approaches to effectively leverage those data. Aggregation of scholarly data into more structured computational models, such as quantitative structure activity relationship (QSAR)-based chemical predictions, demonstrates the efficacy of pipelines which turn decentralized data inputs into centralized models.<ref name="MansouriAnAuto16">{{cite journal |title=An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling |journal=SAR and QSAR in Environmental Research |author=Mansouri, K.; Grulke, C.M.; Richard, A.M. et al. |volume=27 |issue=11 |pages=939–65 |year=2016 |doi=10.1080/1062936X.2016.1253611 |pmid=27885862}}</ref> Keeping these models in siloed communities is counter-productive, as the fundamental methods of model creation rely on open data provided by researchers. Drawing on standards for collaboration and sharing within the computer sciences, the NIH and the National Institute for Standards and Technology (NIST) have organized multiple hackathons and public-private partnerships to automate data extraction efforts and to create computational models that map the biological effects of chemical exposures.<ref name="KleinstreuerPredict18">{{cite journal |title=Predictive Models for Acute Oral Systemic Toxicity: A Workshop to Bridge the Gap From Research to Regulation |journal=Computational Toxicology |author=Kleinstreuer, N.C.; Karmaus, A.; Mansouri, K. et al. |volume=8 |issue=11 |pages=21–24 |year=2018 |doi=10.1016/j.comtox.2018.08.002 |pmid=30320239 |pmc=PMC6178952}}</ref> Models resulting from such enterprises have surpassed the accuracy and efficiency of traditional, manual-labor driven animal testing<ref name="BrowneCorrect17">{{cite journal |title=Correction to Screening Chemicals for Estrogen Receptor Bioactivity Using a Computational Model |journal=Environmental Science and Technology |author=Browne, P.; Judson, R.S.; Casey, W. et al. |volume=51 |issue=16 |at=9415 |year=2017 |doi=10.1021/acs.est.7b03317 |pmid=28767231}}</ref> Projects such as the NCATS Biomedical Data Translator are designed to establish cross-cutting infrastructures to facilitate these data integration and modeling efforts.<ref name="AustinDecon19">{{cite journal |title=Deconstructing the Translational Tower of Babel |journal=Clinical and Translational Science |author=Austin, C.P.; Colvis, C.M.; Southall, N.T. et al. |volume=12 |issue=2 |at=85 |year=2019 |doi=10.1111/cts.12595 |pmid=30412342 |pmc=PMC6440561}}</ref>
If the above-mentioned hurdles can be overcome, big data, AI, and ML represent a huge opportunity for the expansion and application of effective environmental public health research, as discussed at a recent 2019 National Academies of Sciences workshop on “Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions.”<ref name="NASEmerging19">{{cite web |url=http://nas-sites.org/emergingscience/meetings/ai/ |title=Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions |work=Emerging Science for Environmental Health Decisions |author=National Academies of Sciences, Engineering, and Medicine |date=06 June 2019}}</ref> There exist a number of examples of researchers who are already bridging the fields of environmental public health, data science, and AI. The AI for Earth initiative partners Microsoft's data science acumen with researchers who have environmental expertise in the areas of agriculture, climate change, biodiversity, and water. Grants from AI for Earth have funded AI projects on projecting population health models, analyzing images for biodiversity, forecasting crop yields, projecting climate-related landslides, modeling carbon sequestration, understanding global pathogen spread, and much more.<ref name="MicrosoftAI">{{cite web |url=https://microsoft.github.io/AIforEarth-Grantees/ |title=Microsoft AI for Earth grantee gallery |publisher=Microsoft, Inc}}</ref> AI and ML have also facilitated a more nuanced and accurate understanding of climate patterns, improving the accuracy of forecasting extreme weather events to greater than 90 percent.<ref name="ChoArtificial18">{{cite web |url=https://blogs.ei.columbia.edu/2018/06/05/artificial-intelligence-climate-environment/ |title=Artificial Intelligence—A Game Changer for Climate Change and the Environment |author=Cho, R. |work=State of the Planet |date=05 June 2018 |accessdate=2019}}</ref>
Researchers already use AI to improve air pollution forecasts<ref name="FontesCanArti14">{{cite journal |title=Can Artificial Neural Networks Be Used to Predict the Origin of Ozone Episodes? |journal=The Science of the Total Environment |author=Fontes, T.; Silva, L.M.; Silva, M.P. et al. |volume=488–489 |pages=197–207 |year=2014 |doi=10.1016/j.scitotenv.2014.04.077 |pmid=24830932}}</ref><ref name="BellingerASystem17">{{cite journal |title=A Systematic Review of Data Mining and Machine Learning for Air Pollution Epidemiology |journal=BMC Public Health |author=Bellinger, C.; Jabbar, M.S.M.; Zaïane, O. et al. |volume=17 |issue=1 |at=907 |year=2017 |doi=10.1186/s12889-017-4914-3 |pmid=29179711 |pmc=PMC5704396}}</ref><ref name="BaiAir18">{{cite journal |title=Air Pollution Forecasts: An Overview |journal=International Journal of Environmental Research and Public Health |author=Bai, L.; Wang, J.; Ma, X. et al. |volume=15 |issue=4 |at=780 |year=2018 |doi=10.3390/ijerph15040780 |pmid=29673227 |pmc=PMC5923822}}</ref>, diagnose disease<ref name="XiongAuto18">{{cite journal |title=Automatic Detection of Mycobacterium Tuberculosis Using Artificial Intelligence |journal=Journal of Thoracic Disease |author=Xiong, Y.; Ba, X. Hou, A. et al. |volume=10 |issue=3 |pages=1936-1940 |year=2018 |doi=10.21037/jtd.2018.01.91 |pmid=29707349 |pmc=PMC5906344}}</ref>, monitor infectious disease<ref name="MilinovichInternet14">{{cite journal |title=Internet-based Surveillance Systems for Monitoring Emerging Infectious Diseases |journal=The Lancet Infectious Diseases |author=Milinovich, G.J.; Williams, G.M.; Clenets, A.C.A. et al. |volume=14 |issue=2 |pages=160–8 |year=2014 |doi=10.1016/S1473-3099(13)70244-5 |pmid=24290841 |pmc=PMC7185571}}</ref><ref name="Nextstrain">{{cite web |url=https://nextstrain.org/ |title=Nextstrain |author=Bedford, T.; Neher, R.}}</ref>, track antibiotic resistance<ref name="LiTracking18">{{cite journal |title=Tracking Antibiotic Resistance Gene Pollution From Different Sources Using Machine-Learning Classification |journal=Microbiome |author=Li, L.-G.; Yin, X.; Zhang, T. |volume=6 |issue=1 |at=93 |year=2018 |doi=10.1186/s40168-018-0480-x |pmid=29793542 |pmc=PMC5966912}}</ref>, assist with computational chemistry<ref name="GohDeep17">{{cite journal |title=Deep Learning for Computational Chemistry |journal=Journal of Computational Chemistry |author=Goh, G.B.; O Hodas, N.; Vishnu, A. |volume=38 |issue=16 |pages=1291–1307 |year=2017 |doi=10.1002/jcc.24764 |pmid=28272810}}</ref>, model exposure and chemical mixtures<ref name="BobbBayesian15">{{cite journal |title=Bayesian Kernel Machine Regression for Estimating the Health Effects of Multi-Pollutant Mixtures |journal=Biostatistics |author=Bobb, J.F.; Valeri, L.; Henn, B.C. et al. |volume=16 |issue=3 |pages=493–508 |year=2015 |doi=10.1093/biostatistics/kxu058 |pmid=25532525 |pmc=PMC5963470}}</ref><ref name="ParkConstruct17">{{cite journal |title=Construction of Environmental Risk Score Beyond Standard Linear Models Using Machine Learning Methods: Application to Metal Mixtures, Oxidative Stress and Cardiovascular Disease in NHANES |journal=Environmental Health |author=Park, S.K.; Zhao, Z.; Mukherjee, B. |volume=16 |issue=1 |at=102 |year=2017 |doi=10.1186/s12940-017-0310-9 |pmid=28950902 |pmc=PMC5615812}}</ref><ref name="VoPhamEmerging18">{{cite journal |title=Emerging Trends in Geospatial Artificial Intelligence (geoAI): Potential Applications for Environmental Epidemiology |journal=Environmental Health |author=VoPham, T.; Hart, J.E.; Laden, F. et al. |volume=17 |issue=1 |at=40 |year=2018 |doi=10.1186/s12940-018-0386-x |pmid=29665858 |pmc=PMC5905121}}</ref>, and improve classification of climate regions.<ref name="LissRedefining14">{{cite journal |title=Redefining Climate Regions in the United States of America Using Satellite Remote Sensing and Machine Learning for Public Health Applications |journal=Geospatial Health |author=Liss, A.; Koch, M.; Naumova, E.N. |volume=8 |issue=3 |pages=S647–59 |year=2014 |doi=10.4081/gh.2014.294 |pmid=25599636}}</ref> These innovative approaches and their successes are just the tip of the iceberg, and they point to the potential benefit of sufficiently resourced investments in the application of AI and ML in environmental public health research.
==Conclusions and recommendations==
While grassroots innovation is increasing, top-down influencers within academia, government, industry, and funding bodies need to facilitate the conditions for AI to flourish in the fields of public and environmental health sciences. The philosophy of the “Fourth Industrial Revolution” (i.e., rapid technological advancement) is wildly different than the competition involved in typical academic research, as it requires enhanced interdisciplinary collaboration and universal data sharing and organization. As others such as Rubens ''et al.'' have noted<ref name="RubensPublic14">{{cite journal |title=Public Health in the Twenty-First Century: The Role of Advanced Technologies |journal=Frontiers in Public Health |author=Rubens, M.; Ramamoorthy, V.; Saxena, A. et al. |volume=2 |at=224 |year=2014 |doi=10.3389/fpubh.2014.00224 |pmid=25426484 |pmc=PMC4226139}}</ref>, public health is generally slower than other scientific disciplines to embrace the use of advanced technologies. If we do not collectively aspire to change the framing of public health research and education, the discipline could impede its own progress. Technical talent will gravitate to the open, collective opportunities offered by private enterprise. To prevent this, change should percolate from those currently leading the field. Researchers, educators, and practitioners need to understand the ingenious applications of emerging technologies and foster such opportunities within EHS research. Students should be encouraged, and even required, to explore these fields in core curriculums. During this time of unprecedented access to technical domains, cross-disciplinary training in computer science and EHS research will empower students and professionals alike to make meaningful contributions to perceivably the greatest revolution of their lifetime. We therefore propose actionable recommendations for leaders in the public and environmental health fields to implement and create an environment that will foster the data revolution (Table 1).
[[File:Tab1 Comess FrontArtInt2020 3.jpg|1100px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="1100px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Table 1.''' Challenges and recommendations for fostering the big data revolution in environmental public health, summarized here and detailed in text</blockquote>
|-
|}
|}
==Acknowledgements==
===Author contributions===
SC, AA, and MV formulated the research question/premise, conducted research and literature reviews, and wrote the first draft of the manuscript. NK and RH contributed writing and research to sections of the manuscript. VV, NK, LJ, and RH assisted with conceptualizing the research question and contributed to manuscript editing. All authors contributed to manuscript revision and read and approved the final manuscript.
===Conflict of interest===
LJ is employed by Microsoft Corporation.
The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The reviewer TL declared a past co-authorship with one of the authors NK to the handling editor.


==References==
==References==
Line 44: Line 99:


==Notes==
==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version—by design—lists them in order of appearance.
This presentation is faithful to the original, with only a few minor changes to presentation. Some grammar and paragraph spacing was updated for improved readability. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version—by design—lists them in order of appearance. The original URL for Microsoft's AI for Earth is broken; the presumably updated and correct URL is used for this version.


<!--Place all category tags here-->
<!--Place all category tags here-->

Latest revision as of 15:44, 6 July 2020

Full article title Bringing big data to bear in environmental public health: Challenges and recommendations
Journal Frontiers in Artificial Intelligence
Author(s) Comess, Saskia; Akbay, Alexia; Vasiliou, Melpomene; Hines, Ronald N.; Joppa, Lucas; Vasiliou, Vasilis; Kleinstreuer, Nicole
Author affiliation(s) Yale University, Symbrosia Inc., U.S. EPA, Microsoft Corporation, National Institute of Environmental Health Sciences
Primary contact Email: vasilis dot vasiliou at yale dot edu and nicole dot kleinstreuer at nih dot gov
Editors Emmert-Streib, Frank
Year published 2020
Volume and issue 3
Page(s) 31
DOI 10.3389/frai.2020.00031
ISSN 2624-8212
Distribution license Creative Commons Attribution 4.0 International
Website https://www.frontiersin.org/articles/10.3389/frai.2020.00031/full
Download https://www.frontiersin.org/articles/10.3389/frai.2020.00031/pdf (PDF)

Abstract

Understanding the role that the environment plays in influencing public health often involves collecting and studying large, complex data sets. There have been a number of private and public efforts to gather sufficient information and confront significant unknowns in the field of environmental public health, yet there is a persistent and largely unmet need for findable, accessible, interoperable, and reusable (FAIR) data. Even when data are readily available, the ability to create, analyze, and draw conclusions from these data using emerging computational tools, such as augmented intelligence, artificial intelligence (AI), and machine learning, requires technical skills not currently implemented on a programmatic level across research hubs and academic institutions. We argue that collaborative efforts in data curation and storage, scientific computing, and training are of paramount importance to empower researchers within environmental sciences and the broader public health community to apply AI approaches and fully realize their potential. Leaders in the field were asked to prioritize challenges in incorporating big data in environmental public health research, including inconsistent implementation of FAIR principles in data collection and sharing; a lack of skilled data scientists and appropriate cyber-infrastructures; and limited understanding, identification, and communication of benefits. These issues are discussed and actionable recommendations are provided.

Keywords: artificial intelligence, public health, machine learning, open data, environmental health sciences, big data

Introduction

Out of the tens of thousands of individual chemicals currently in commerce (and many more mixtures, natural products, and metabolites) less than 10 percent have been screened for safety. The United States Environmental Protection Agency's (EPA's) Toxic Substances Control Act (TSCA) Chemical Substance Inventory contains roughly 85,000 chemicals[1], and the European Chemicals Agency (ECHA) Inventory lists over 100,000 unique substances (as of the most recent update in August 2017), of which approximately 22,000 are registered substances with detailed information on structure, usage, or toxicity.[2] Understanding which chemicals in the environment—both with and without safety data—pose a risk to human health requires that we more effectively leverage the data that we already have, and that we take intelligent approaches to generating new data. While the traditional means of collecting chemical safety data (animal models) are laborious and of variable accuracy and human relevance[3], such reference data can still be used to train models for prioritizing and predicting toxicity of new chemicals, provided the data are curated in a computationally accessible format and, ideally, integrated with other lines of evidence providing mechanistic information. This requires significant effort, both in collecting and extracting information as well as annotating it appropriately.

These toxicological problems are mirrored in public and environmental health more generally as huge, complex issues with inadequately curated data and insufficient analytic power. Recent research in toxicology has focused on high-throughput screening to rapidly produce quantitative data on thousands of human biological targets[4], data mining to identify relevant end-points for building predictive models for adverse toxicological outcomes[5], and application of cutting-edge machine learning (ML) and artificial or augmented intelligence (AI) techniques.[6] Collectively, these technologies facilitate enhanced mechanistic insights and may obviate the need for inefficient testing in animal models, but they are still not considered mainstream approaches nor are they widely accepted by regulatory agencies.

Individual research programs generate large data sets, but without centralized coordination, standardized reporting, and common storage structures, the data cannot be effectively combined and used to its full potential. The federal Tox21 research consortium, for example, has to date tested more than 9,000 chemicals to varying degrees in 1,600 assays and demonstrated environmental chemical interactions with critical human and ecologically-relevant targets.[7] Translational systems approaches are being employed by this and other programs (e.g., Horizon 2020, EUToxRisk, CEFIC LRI, and OpenTox) to produce diverse data streams and predict chemical effects on human health and disease outcomes.[8] At the same time, there have been substantial efforts to develop and deploy sensors and satellite systems that yield additional large and complex data sets that provide information about chemical exposures.[9][10][11] Further, epidemiologists are actively developing ML and AI approaches to enhance understanding of chemical exposures and associated disease risks.[12] However, in these and other such projects, efforts are largely disconnected from one another and operate independently, despite the clear potential benefits of combining and jointly analyzing such data.

Given the need to integrate and analyze large, multifactorial data sets, researchers in public health and the environmental health sciences (EHS) would greatly benefit from the ability to collect, process, analyze, and make inferences on data using ML and AI. However, in these fields, a general lack of relevant knowledge among many researchers; sparse, distributed, or inaccessible data; and an inadequate framework for sharing and disseminating innovations impede efforts to implement these approaches. Here, we discuss three specific areas with room for improvement in the public health/EHS field: sharing and collecting data, expanding researchers' knowledge base, and recognizing the benefits AI/ML can bring to current problems. Recommendations are provided in each of these areas to facilitate bringing big data to bear on public health and EHS challenges.

Sharing and collecting data

Challenge

A major hurdle confronting investigators conducting public health and EHS research is a lack of comprehensive human and environmental exposure and effects data that are annotated using controlled vocabularies. Addressing this problem is a prerequisite to applying AI and ML, as without sufficient, high-quality data and metadata, the analytic methods themselves are irrelevant. Quantifying environmental exposure, such as from air, water, soil, and food, is difficult both at the micro (localized to individuals and small geographic units) and macro (national and international) levels. For instance, air pollution can vary up to eight-fold within a given city block, but most U.S. cities have only one air quality monitor.[13] Epidemiologic studies of air pollution health effects often must rely on disparate data that lack both temporal and spatial specificity and cannot account for the movement of people across different areas of pollution. Without continuous and advanced monitoring, and robust computer modeling methods, illnesses related to transient exposures might not be recognized as part of a significant pattern until substantial adverse health effects have occurred. This is one example where the development of AI tools in the EHS space has been hindered not by the AI technology capacity itself but instead by a lack of reliable, interconnected data.[14] This is equally true in the medical sector with respect to patient treatment and outcomes. IBM's ambitious partnership with the MD Anderson Cancer Center to develop AI to expedite clinical decision-making has been at a standstill after years of development due to a lack of standardized, accessible data.[15]

Even when standardized data are available, finding, accessing, and processing it can be a monumental task. The absence of a uniform framework for openly sharing and storing data means that researchers devote significant time to locating relevant data. Knowledge of where to find data is often highly sector-specific, inhibiting cross-disciplinary research. For example, a climate scientist interested in public health would need knowledge of health-specific data repositories to conduct the search. Rather than waste manual effort and time in locating data, let alone integrating it, coordinated efforts could result in processes that could be automated and simplified. Ethical concerns have been voiced in regard to organizing these types of large repositories.[16] Of these, patient data privacy is a major concern, and breaches of patient record databases are a constant challenge. Unique patient identifier numbers and other de-identification/anonymization techniques can protect patient privacy, while allowing for meaningful research and analysis.[17] New encryption-based techniques allow for predictive modeling while maintaining the privacy of sensitive information, such as the application of homomorphic encryption to patient data in predicting cardiovascular disease.[18] However, inconsistent regulations and lack of practical protocols around handling sensitive information have resulted in unethical scenarios, where data is being sourced from countries where patients have minimal rights.[19] Not only is this problematic from an ethical perspective, it also limits AI innovation to only those who have access to these obscure datasets. Specific tools developed by startups who have the luxury of sourcing data from elsewhere are often acquired by large corporations, making innovation an exclusive pursuit. Thus, the environmental public health field requires a revolution in the collection and organization of environmental exposure and effects data as a first step in democratizing information access and building better models to improve predictions.

Recommendations

Further work is clearly needed in data collection and sharing, but recent attempts in specific sectors are exemplar in the aggregation of data and development of open, accessible repositories that maintain necessary privacy standards. In 2016, over 50 contributing researchers from global institutions proposed the “Findable, Accessible, Interoperable, and Reusable” (FAIR) Guiding Principles for scientific data management and stewardship.[20] These principles bridge the divide between human-conducted and machine-driven research behaviors. Using FAIR principles, the National Institutes of Health (NIH) is creating Data Commons, a platform for data management, and metadata cataloging for terminologies and ontologies.[21] This framework has been one of the key drivers behind new repositories and tools such as the National Toxicology Program's Integrated Chemical Environment (ICE) portal[22] and the U.S. EPA's CompTox Chemicals Dashboard[23], which allows FAIR principles to be applied to non-animal in vitro and in silico data, along with in vivo animal data and human exposure information. A collaboration between the U.S. Food and Drug Administration (FDA), the non-profit Clinical Data Interchange Standards Consortium (CDISC), and other stakeholders resulted in the development of study data standards for non-clinical and clinical analysis data and metadata[24] to create common reporting formats.

These concepts are cornerstones of the 2018 U.S. Strategic Roadmap for Modernizing Safety Testing of Chemicals and Medical Products—developed by 16 U.S. federal agencies—which advocates for practices that increase confidence in new data-driven research methods.[25] A significant portion of the work done by these data powerhouses is retrospective data curation, often performed manually (e.g., Kleinstreuer et al.[26]). Work is ongoing to automate some aspects of the information extraction pipeline, but additional efforts to standardize reporting formats, as well as metadata terminologies in emerging research, could lighten the curation burden on these institutions and streamline data annotation and storage, allocating greater resources to the development of novel applications.

Many of the recent advances in developing openly accessible databases of environmental exposure information have come from the private sector, often in partnership with non-profit organizations and academic institutions. The Monarch Initiative is one such collaboration to apply ontologies, or semantic descriptions, to disease phenotypes and enable intra- and inter-species comparisons and connection to genotypes, pathways, and experimental models. Another example is a pilot project in Oakland, California between the Environmental Defense Fund (EDF) and Google Earth Outreach, which involved attaching air quality sensors to Google Street View cars. This was recently extended to a partnership with an environmental sensor company (Aclima) to equip Street View cars with mobile air quality sensors in cities around the world.[27] The sensors capture detailed air quality and emissions data at high (street-block level) resolution and temporal frequency. These data will be aggregated and made available on a Google database. Google also recently announced that it will report estimates of city-level greenhouse gas emissions and annual driving, biking, and transit ridership (data gathered via Google Maps and Waze).[28][29] Google has also developed a new Dataset Search is an initial attempt to apply distributed search to datasets from the environmental and social sciences, government data, and news organizations.[30] By providing access to data from multiple disciplines via a single platform, researchers can conduct interdisciplinary work fundamental to environmental and public health.[31] Applying these powerful methods to better curate and integrate diverse sources of data will promote greater understanding of complex and dynamic systems. However, acceptance and implementation of these improvements are not yet widespread, particularly in the public and environmental health sectors. Following the lead of these innovative pilot studies, a greater emphasis needs to be placed on developing the appropriate infrastructure for effective, standardized data collection approaches, common ontologies, and uniform sharing protocols.

Expanding researchers' knowledge base

Challenges

Public health and EHS researchers are not traditionally trained in scientific computing and data science, and computer scientists do not typically apply their skill sets to EHS problems. This impedes the introduction and utilization of powerful data science techniques to public and environmental health practice and research. The American Association for Public Health, the body responsible for accrediting public health programs in the United States, currently does not include computer skills in the core competencies required of a Master's of Public Health (MPH) program. This omission is a disservice both to public health students and the field of public health research more broadly. Just as being able to read and write are fundamental skills and required baseline competencies for entering a graduate program, computer science will eventually become a similar prerequisite for comprehensive and effective analytical approaches across the biological and toxicological realm. Ultimately, the EHS and public health disciplines need a culture and skill shift. Over the next decade, scientists will need to understand the fundamentals of computer and data science in order to be successful in their field. Having this core computer science competency will be critical to the future success of transdisciplinary research in the EHS and public health disciplines.

Recommendations

Current public health and EHS students are interested in these issues, which means that public health schools need to integrate computer and data science into their core curriculums. While students interested in big data and public health are not at a total loss for resources, provided they are willing to seek them out, there are major gaps in the academic arena. There are a handful of existing programs that currently provide the skill set needed to apply data science to public health research. The Master of Science in Data Science is an option offered from a number of universities, but few offer, or even consider, the integration of this degree with applications in the field of public health. Within the status quo, students are presented with the option of either an MPH or an MS in Data Science, with little crossover between the two. Even programs such as Harvard's Health Data Science MS, which is offered through the School of Public Health, only requires one epidemiology course and then places the onus on the student to integrate the public health perspective into their own research.

The lack of computer skills in the core competencies of the MPH is one example of an area with obvious and immediate room for improvement. However, the need for cross-disciplinary training and communication is equally relevant from the perspective of computer scientists, who should be educated and engaged in EHS and public health applications. Students from both academic disciplines should be encouraged, if not required, to engage in coursework in the other. This idea of “cross-departmental partnerships” would provide public health students the technical skills needed to integrate computer science, AI, and ML into their work while providing insight for potential EHS projects to which computer science students could apply their skills. The thoughtful design of such a program also would foster a culture of transdisciplinary research, which will be key to finding solutions to complex environmental and public health problems. The Massachusetts Institute of Technology is laying the necessary groundwork for such a program by creating a new college for AI. With a $1 billion investment, the college is expected to start in the Fall of 2019 with the goal of integrating AI systems across academic fields.[32] Continuing education programs, such as the New Approach Methodology Use for Regulatory Application training series (NURA), are being offered to ensure that environmental public health professionals also begin to develop the skills and expertise needed to leverage big data and implement AI and ML based approaches. A highly topical example of the benefit of teaching public health students computational skills is the widely referenced Johns Hopkins University resource for tracking the spread of the novel SARS-COV-2 coronavirus.

While the above focus on cross-discipline training is of paramount importance for future successful applications of AI and ML in the EHS, incentives for cross-disciplinary collaboration would have a more rapid impact and also would inform the integration of computer and data science into EHS curricula. Recognition of this opportunity by current EHS leadership and appropriate investments to achieve this goal would be of substantial benefit.

Recognizing the benefits AI and ML to public health and EHS

Challenge

A downstream consequence of the challenges detailed above is that the majority of researchers in the scientific community are still unaware of the benefits that AI and ML could provide when coupled with large, annotated, integrated datasets. A lack of familiarity with AI and ML as tools means that even when presented with examples of effective predictive models, potential end-users may not adopt them due to a lack of understanding, leading to decreased confidence in their utility. Further, without substantial investment in data curation and integration, the ability to apply AI and ML to the creation of such models is severely limited.

Recommendations

While these approaches are not yet mainstream, there are many examples of extremely successful implementations of combining big data with AI and ML to build high-performance predictive models, and such case studies should be widely distributed and serve as the catalysts for increasing support in these research areas. Beyond establishing the cyber-infrastructure to generate and store open and accessible data, select researchers and government agencies are developing modeling approaches to effectively leverage those data. Aggregation of scholarly data into more structured computational models, such as quantitative structure activity relationship (QSAR)-based chemical predictions, demonstrates the efficacy of pipelines which turn decentralized data inputs into centralized models.[33] Keeping these models in siloed communities is counter-productive, as the fundamental methods of model creation rely on open data provided by researchers. Drawing on standards for collaboration and sharing within the computer sciences, the NIH and the National Institute for Standards and Technology (NIST) have organized multiple hackathons and public-private partnerships to automate data extraction efforts and to create computational models that map the biological effects of chemical exposures.[34] Models resulting from such enterprises have surpassed the accuracy and efficiency of traditional, manual-labor driven animal testing[35] Projects such as the NCATS Biomedical Data Translator are designed to establish cross-cutting infrastructures to facilitate these data integration and modeling efforts.[36]

If the above-mentioned hurdles can be overcome, big data, AI, and ML represent a huge opportunity for the expansion and application of effective environmental public health research, as discussed at a recent 2019 National Academies of Sciences workshop on “Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions.”[37] There exist a number of examples of researchers who are already bridging the fields of environmental public health, data science, and AI. The AI for Earth initiative partners Microsoft's data science acumen with researchers who have environmental expertise in the areas of agriculture, climate change, biodiversity, and water. Grants from AI for Earth have funded AI projects on projecting population health models, analyzing images for biodiversity, forecasting crop yields, projecting climate-related landslides, modeling carbon sequestration, understanding global pathogen spread, and much more.[38] AI and ML have also facilitated a more nuanced and accurate understanding of climate patterns, improving the accuracy of forecasting extreme weather events to greater than 90 percent.[39]

Researchers already use AI to improve air pollution forecasts[40][41][42], diagnose disease[43], monitor infectious disease[44][45], track antibiotic resistance[46], assist with computational chemistry[47], model exposure and chemical mixtures[48][49][50], and improve classification of climate regions.[51] These innovative approaches and their successes are just the tip of the iceberg, and they point to the potential benefit of sufficiently resourced investments in the application of AI and ML in environmental public health research.

Conclusions and recommendations

While grassroots innovation is increasing, top-down influencers within academia, government, industry, and funding bodies need to facilitate the conditions for AI to flourish in the fields of public and environmental health sciences. The philosophy of the “Fourth Industrial Revolution” (i.e., rapid technological advancement) is wildly different than the competition involved in typical academic research, as it requires enhanced interdisciplinary collaboration and universal data sharing and organization. As others such as Rubens et al. have noted[52], public health is generally slower than other scientific disciplines to embrace the use of advanced technologies. If we do not collectively aspire to change the framing of public health research and education, the discipline could impede its own progress. Technical talent will gravitate to the open, collective opportunities offered by private enterprise. To prevent this, change should percolate from those currently leading the field. Researchers, educators, and practitioners need to understand the ingenious applications of emerging technologies and foster such opportunities within EHS research. Students should be encouraged, and even required, to explore these fields in core curriculums. During this time of unprecedented access to technical domains, cross-disciplinary training in computer science and EHS research will empower students and professionals alike to make meaningful contributions to perceivably the greatest revolution of their lifetime. We therefore propose actionable recommendations for leaders in the public and environmental health fields to implement and create an environment that will foster the data revolution (Table 1).


Tab1 Comess FrontArtInt2020 3.jpg

Table 1. Challenges and recommendations for fostering the big data revolution in environmental public health, summarized here and detailed in text

Acknowledgements

Author contributions

SC, AA, and MV formulated the research question/premise, conducted research and literature reviews, and wrote the first draft of the manuscript. NK and RH contributed writing and research to sections of the manuscript. VV, NK, LJ, and RH assisted with conceptualizing the research question and contributed to manuscript editing. All authors contributed to manuscript revision and read and approved the final manuscript.

Conflict of interest

LJ is employed by Microsoft Corporation.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer TL declared a past co-authorship with one of the authors NK to the handling editor.

References

  1. Environmental Protection Agency. "TSCA Chemical Substance Inventory: About the TSCA Chemical Substance Inventory". Environmental Protection Agency. https://www.epa.gov/tsca-inventory/about-tsca-chemical-substance-inventory. Retrieved 04 January 2019. 
  2. European Chemicals Agency. "EC Inventory". European Chemicals Agency. https://echa.europa.eu/information-on-chemicals/ec-inventory. Retrieved 04 January 2019. 
  3. Hartung, T. (2009). "Toxicology for the Twenty-First Century". Nature 460 (7252): 208–12. doi:10.1038/460208a. PMID 19587762. 
  4. Thomas, R.S.; Bahadori, T.; Buckley, T.J. et al. (2019). "The Next Generation Blueprint of Computational Toxicology at the U.S. Environmental Protection Agency". Toxicological Sciences 169 (2): 317–32. doi:10.1093/toxsci/kfz058. PMC PMC6542711. PMID 30835285. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6542711. 
  5. Saili, K.S.; Franzosa, J.A.; Baker, N.C. et al. (2019). "Systems Modeling of Developmental Vascular Toxicity". Current Opinion in Toxicology 15 (1): 55–63. doi:10.1016/j.cotox.2019.04.004. PMC PMC7004230. PMID 32030360. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7004230. 
  6. Luechtefeld, T.; Rowlands, C.; Hartung, T. (2018). "Big-data and Machine Learning to Revamp Computational Toxicology and Its Use in Risk Assessment". Toxicology Research 7 (5): 732-744. doi:10.1039/c8tx00051d. PMC PMC6116175. PMID 30310652. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6116175. 
  7. Tice, R.R.; Austin, C.P.; Kavlock, R.J. et al. (2013). "Improving the Human Hazard Characterization of Chemicals: A Tox21 Update". Environmental Health Perspectives 121 (7): 756–65. doi:10.1289/ehp.1205784. PMC PMC3701992. PMID 23603828. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3701992. 
  8. Kleinstreuer, N.C.; Yang, J.; Berg, E.L. et al. (2014). "Phenotypic Screening of the ToxCast Chemical Library to Classify Toxic and Therapeutic Mechanisms". Nature Biotechnology 32 (6): 583–91. doi:10.1038/nbt.2914. PMID 24837663. 
  9. Dons, E.; Laeremans, M.; Orjuela, J.P. et al. (2017). "Wearable Sensors for Personal Monitoring and Estimation of Inhaled Traffic-Related Air Pollution: Evaluation of Methods". Environmental Science and Technology 51 (3): 1859-1867. doi:10.1021/acs.est.6b05782. PMID 28080048. 
  10. Ring, C.L.; Arnot, J.A.; Bennett, D.H. et al. (2019). "Consensus Modeling of Median Chemical Intake for the U.S. Population Based on Predictions of Exposure Pathways". Environmental Science and Technology 53 (2): 719–32. doi:10.1021/acs.est.8b04056. PMC PMC6690061. PMID 30516957. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6690061. 
  11. Weichenthal, S.; Hatzopoulou, M.; Brauer, M. (2019). "A Picture Tells a thousand…exposures: Opportunities and Challenges of Deep Learning Image Analyses in Exposure Science and Environmental Epidemiology". Environmental International 122: 3–10. doi:10.1016/j.envint.2018.11.042. PMID 30473381. 
  12. Brandon, N.; Dionisio, K.L.; Isaacs, K. et al. (2020). "Simulating Exposure-Related Behaviors Using Agent-Based Models Embedded With Needs-Based Artificial Intelligence". Journal of Exposure Science and Environmental Epidemiology 30 (1): 184–193. doi:10.1038/s41370-018-0052-y. PMC PMC6914672. PMID 30242268. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6914672. 
  13. Environmental Defense Fund, "Why new technology is critical for tackling air pollution around the globe", {{{website{{{}}}}}} (Environmental Defense Fund), https://www.edf.org/airqualitymaps. Retrieved 2019 
  14. National Academies of Sciences, Engineering, and Medicine (2019). "Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions: Proceedings of a Workshop—in Brief". The National Academies Press. doi:10.17226/25520. https://www.nap.edu/catalog/25520/leveraging-artificial-intelligence-and-machine-learning-to-advance-environmental-health-research-and-decisions. 
  15. Jaklevic, M.C. (23 February 2017). "MD Anderson Cancer Center’s IBM Watson project fails, and so did the journalism related to it". Health News Review. http://www.healthnewsreview.org/2017/02/md-anderson-cancer-centers-ibm-watson-project-fails-journalism-related/. 
  16. Ienca, M.; Farretti, A.; Hurst, S. et al. (2018). "Considerations for Ethics Review of Big Data Health Research: A Scoping Review". PLoS One 13 (10): e0204937. doi:10.1371/journal.pone.0204937. PMC PMC6181558. PMID 30308031. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6181558. 
  17. El Emam, K.; Rodgers, S.; Malin, B. (2015). "Anonymising and Sharing Individual Patient Data". BMJ 350: h1139. doi:10.1136/bmj.h1139. PMC PMC4707567. PMID 25794882. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4707567. 
  18. Bos, J.W.; Lauter, K.; Naehrig, M. (2014). "Private Predictive Analysis on Encrypted Medical Data". Journal of Biomedical Informatics 50: 234–43. doi:10.1016/j.jbi.2014.04.003. PMID 24835616. 
  19. Mittelstadt, B.D.; Floridi, L. (2016). "The Ethics of Big Data: Current and Foreseeable Issues in Biomedical Contexts". Science and Engineering Ethics 22 (2): 303–41. doi:10.1007/s11948-015-9652-2. PMID 26002496. 
  20. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175. 
  21. Mahony, C.; Currie, R.; Daston, G. et al. (2018). "Highlight Report: 'Big Data in the 3R's: Outlook and Recommendations', a Roundtable Summary". Archives of Toxicology 92 (2): 1015–20. doi:10.1007/s00204-017-2145-0. PMID 29340744. 
  22. Bell, S.M.; Phillips, J.; Sedykh, A. et al. (2017). "An Integrated Chemical Environment to Support 21st-Century Toxicology". Environmental Health Perspectives 125 (5): 054501. doi:10.1289/EHP1759. PMC PMC5644972. PMID 28557712. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5644972. 
  23. Williams, A.J.; Grulke, C.M.; Edwards, J. et al. (2017). "The CompTox Chemistry Dashboard: A Community Data Resource for Environmental Chemistry". Journal of Cheminformatics 9 (1): 61. doi:10.1186/s13321-017-0247-6. PMC PMC5705535. PMID 29185060. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5705535. 
  24. Clinical Data Interchange Standards Consortium. "CDISC Standards". Clinical Data Interchange Standards Consortium. https://www.cdisc.org/standards. Retrieved 2019. 
  25. Interagency Coordinating Committee on the Validation of Alternative Methods (January 2018). "A Strategic Roadmap for Establishing New Approaches to Evaluate the Safety of Chemicals and Medical Products in the United States". National Toxicology Program, NIH. doi:10.22427/NTP-ICCVAM-ROADMAP2018. https://ntp.niehs.nih.gov/whatwestudy/niceatm/natl-strategy/index.html. Retrieved 04 January 2019. 
  26. Kleinstreuer, N.C.; Ceger, P.C.; Allen, D.G. et al. (2016). "A Curated Database of Rodent Uterotrophic Bioactivity". Environmental Health Perspectives 124 (5): 556–62. doi:10.1289/ehp.1510183. PMC PMC4858395. PMID 26431337. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4858395. 
  27. Aclima (12 September 2018). "Aclima & Google Scale Air Quality Mapping to More Places Around the World". Business Wire. https://www.businesswire.com/news/home/20180912005440/en/Aclima-Google-Scale-Air-Quality-Mapping-Places. Retrieved 2019. 
  28. Meyer, R. (25 September 2018). "Google’s New Tool to Fight Climate Change". The Atlantic. https://www.theatlantic.com/technology/archive/2018/09/google-climate-change-greenhouse-gas-emissions/571144/. Retrieved 2019. 
  29. "Environmental Insights Explorer". Google, Inc. https://insights.sustainability.google/. Retrieved 2019. 
  30. Noy, N. (5 September 2018). "Making it easier to discover datasets". The Keywork. Google, Inc. https://www.blog.google/products/search/making-it-easier-discover-datasets/. Retrieved 2019. 
  31. Vincent, J. (5 September 2018). "Google launches new search engine to help scientists find the datasets they need". The Verge. https://www.theverge.com/2018/9/5/17822562/google-dataset-search-service-scholar-scientific-journal-open-data-access. Retrieved 2019. 
  32. Lohr, S. (15 October 2018). "M.I.T. Plans College for Artificial Intelligence, Backed by $1 Billion". The New York Times. https://www.nytimes.com/2018/10/15/technology/mit-college-artificial-intelligence.html. Retrieved 04 January 2019. 
  33. Mansouri, K.; Grulke, C.M.; Richard, A.M. et al. (2016). "An Automated Curation Procedure for Addressing Chemical Errors and Inconsistencies in Public Datasets Used in QSAR Modelling". SAR and QSAR in Environmental Research 27 (11): 939–65. doi:10.1080/1062936X.2016.1253611. PMID 27885862. 
  34. Kleinstreuer, N.C.; Karmaus, A.; Mansouri, K. et al. (2018). "Predictive Models for Acute Oral Systemic Toxicity: A Workshop to Bridge the Gap From Research to Regulation". Computational Toxicology 8 (11): 21–24. doi:10.1016/j.comtox.2018.08.002. PMC PMC6178952. PMID 30320239. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6178952. 
  35. Browne, P.; Judson, R.S.; Casey, W. et al. (2017). "Correction to Screening Chemicals for Estrogen Receptor Bioactivity Using a Computational Model". Environmental Science and Technology 51 (16): 9415. doi:10.1021/acs.est.7b03317. PMID 28767231. 
  36. Austin, C.P.; Colvis, C.M.; Southall, N.T. et al. (2019). "Deconstructing the Translational Tower of Babel". Clinical and Translational Science 12 (2): 85. doi:10.1111/cts.12595. PMC PMC6440561. PMID 30412342. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6440561. 
  37. National Academies of Sciences, Engineering, and Medicine (6 June 2019). "Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions". Emerging Science for Environmental Health Decisions. http://nas-sites.org/emergingscience/meetings/ai/. 
  38. "Microsoft AI for Earth grantee gallery". Microsoft, Inc. https://microsoft.github.io/AIforEarth-Grantees/. 
  39. Cho, R. (5 June 2018). "Artificial Intelligence—A Game Changer for Climate Change and the Environment". State of the Planet. https://blogs.ei.columbia.edu/2018/06/05/artificial-intelligence-climate-environment/. Retrieved 2019. 
  40. Fontes, T.; Silva, L.M.; Silva, M.P. et al. (2014). "Can Artificial Neural Networks Be Used to Predict the Origin of Ozone Episodes?". The Science of the Total Environment 488–489: 197–207. doi:10.1016/j.scitotenv.2014.04.077. PMID 24830932. 
  41. Bellinger, C.; Jabbar, M.S.M.; Zaïane, O. et al. (2017). "A Systematic Review of Data Mining and Machine Learning for Air Pollution Epidemiology". BMC Public Health 17 (1): 907. doi:10.1186/s12889-017-4914-3. PMC PMC5704396. PMID 29179711. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5704396. 
  42. Bai, L.; Wang, J.; Ma, X. et al. (2018). "Air Pollution Forecasts: An Overview". International Journal of Environmental Research and Public Health 15 (4): 780. doi:10.3390/ijerph15040780. PMC PMC5923822. PMID 29673227. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5923822. 
  43. Xiong, Y.; Ba, X. Hou, A. et al. (2018). "Automatic Detection of Mycobacterium Tuberculosis Using Artificial Intelligence". Journal of Thoracic Disease 10 (3): 1936-1940. doi:10.21037/jtd.2018.01.91. PMC PMC5906344. PMID 29707349. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5906344. 
  44. Milinovich, G.J.; Williams, G.M.; Clenets, A.C.A. et al. (2014). "Internet-based Surveillance Systems for Monitoring Emerging Infectious Diseases". The Lancet Infectious Diseases 14 (2): 160–8. doi:10.1016/S1473-3099(13)70244-5. PMC PMC7185571. PMID 24290841. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185571. 
  45. Bedford, T.; Neher, R.. "Nextstrain". https://nextstrain.org/. 
  46. Li, L.-G.; Yin, X.; Zhang, T. (2018). "Tracking Antibiotic Resistance Gene Pollution From Different Sources Using Machine-Learning Classification". Microbiome 6 (1): 93. doi:10.1186/s40168-018-0480-x. PMC PMC5966912. PMID 29793542. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5966912. 
  47. Goh, G.B.; O Hodas, N.; Vishnu, A. (2017). "Deep Learning for Computational Chemistry". Journal of Computational Chemistry 38 (16): 1291–1307. doi:10.1002/jcc.24764. PMID 28272810. 
  48. Bobb, J.F.; Valeri, L.; Henn, B.C. et al. (2015). "Bayesian Kernel Machine Regression for Estimating the Health Effects of Multi-Pollutant Mixtures". Biostatistics 16 (3): 493–508. doi:10.1093/biostatistics/kxu058. PMC PMC5963470. PMID 25532525. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963470. 
  49. Park, S.K.; Zhao, Z.; Mukherjee, B. (2017). "Construction of Environmental Risk Score Beyond Standard Linear Models Using Machine Learning Methods: Application to Metal Mixtures, Oxidative Stress and Cardiovascular Disease in NHANES". Environmental Health 16 (1): 102. doi:10.1186/s12940-017-0310-9. PMC PMC5615812. PMID 28950902. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5615812. 
  50. VoPham, T.; Hart, J.E.; Laden, F. et al. (2018). "Emerging Trends in Geospatial Artificial Intelligence (geoAI): Potential Applications for Environmental Epidemiology". Environmental Health 17 (1): 40. doi:10.1186/s12940-018-0386-x. PMC PMC5905121. PMID 29665858. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5905121. 
  51. Liss, A.; Koch, M.; Naumova, E.N. (2014). "Redefining Climate Regions in the United States of America Using Satellite Remote Sensing and Machine Learning for Public Health Applications". Geospatial Health 8 (3): S647–59. doi:10.4081/gh.2014.294. PMID 25599636. 
  52. Rubens, M.; Ramamoorthy, V.; Saxena, A. et al. (2014). "Public Health in the Twenty-First Century: The Role of Advanced Technologies". Frontiers in Public Health 2: 224. doi:10.3389/fpubh.2014.00224. PMC PMC4226139. PMID 25426484. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4226139. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Some grammar and paragraph spacing was updated for improved readability. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version—by design—lists them in order of appearance. The original URL for Microsoft's AI for Earth is broken; the presumably updated and correct URL is used for this version.