Difference between revisions of "Journal:Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators"

From LIMSWiki
Jump to navigationJump to search
(Created stub. Saving and adding more.)
 
(Saving and adding more.)
Line 27: Line 27:
In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high-performance computing (HPC), [[bioinformatics]] support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.
In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high-performance computing (HPC), [[bioinformatics]] support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.


 
==Introduction==
Genotypic data based on DNA and RNA sequences have been the major driver of biology’s evolution into a data science. The current Illumina HiSeq X sequencing platform can generate 900 billion nucleotides of raw DNA sequence in under three days, four times the number of annotated nucleotides currently stored in GenBank, the United States “reference library” of DNA sequences.<ref name="NCBIGenBank18">{{cite web |url=https://www.ncbi.nlm.nih.gov/genbank/statistics/ |title=GenBank and WGS Statistics |publisher=National Center for Biotechnology Information |accessdate=2017}}</ref><ref name="IlluminaHiSeq16">{{cite web |url=https://www.illumina.com/content/dam/illumina-marketing/documents/products/datasheets/datasheet-hiseq-x-ten.pdf |format=PDF |title=HiSeq X Series of Sequencing Systems |publisher=Illumina, Inc |date=22 March 2016 |accessdate=2017}}</ref> In the last decade, a 50,000-fold reduction in the cost of [[DNA sequencing]]<ref name="WetterstrandDNA">{{cite web |url=https://www.genome.gov/sequencingcostsdata/ |title=DNA Sequencing Costs: Data |author=Wetterstrand, K. |publisher=National Human Genome Research Institute |accessdate=2017}}</ref> has led to an accumulation of 9.3 quadrillion (million billion) nucleotides of raw sequence data in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The amount of sequence in the SRA doubled on average every six to eight months from 2007 to 2016.<ref name="NCBISequence17">{{cite web |url=https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement |title=Sequence Read Archive |publisher=National Center for Biotechnology Information |date=07 November 2017 |accessdate=2017}}</ref><ref name="StephensBig15">{{cite journal |title=Big Data: Astronomical or Genomical? |journal=PLOS Biology |author=Stephens, Z.D.; Lee, S.Y.; Faghri, F. et al. |volume=13 |issue=7 |pages=e1002195 |year=2015 |doi=10.1371/journal.pbio.1002195 |pmid=26151137 |pmc=PMC4494865}}</ref> It is estimated that by 2025, the storage of human genomes alone will require two to 40 exabytes<ref name="StephensBig15" /> (an exabyte of storage would hold 100,000 times the printed materials of the U.S. Library of Congress<ref name="JohnstonALibrary12">{{cite web |url=https://blogs.loc.gov/thesignal/2012/04/a-library-of-congress-worth-of-data-its-all-in-how-you-define-it/ |title=A “Library of Congress” Worth of Data: It’s All In How You Define It |author=Johnston, L. |work=The Signal |publisher=Library of Congress |date=25 April 2012 |accessdate=08 March 2017}}</ref>). Beyond genotypic data, big data are flooding biology from all quarters—phenotypic data from agricultural field trials, patient medical records, and clinical trials; image data from microscopy, medical scanning, and museum specimens; interaction data from biochemical, cellular, physiological, and ecological systems; as well as an influx of data from translational fields such as bioengineering, materials science, and biogeography.





Revision as of 22:21, 7 May 2018

Full article title Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators
Journal PLOS Computational Biology
Author(s) Barone, Lindsay; Williams, Jason; Micklos, David
Author affiliation(s) Cold Spring Harbor Laboratory
Primary contact Email: lbarone at cshl dot edu
Editors Ouellette, Francis
Year published 2017
Volume and issue 13(11)
Page(s) e1005858
DOI 10.1371/journal.pcbi.1005755
ISSN 1553-7358
Distribution license Creative Commons Attribution 4.0 International
Website http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005755
Download http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005755&type=printable (PDF)

Abstract

In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high-performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

Introduction

Genotypic data based on DNA and RNA sequences have been the major driver of biology’s evolution into a data science. The current Illumina HiSeq X sequencing platform can generate 900 billion nucleotides of raw DNA sequence in under three days, four times the number of annotated nucleotides currently stored in GenBank, the United States “reference library” of DNA sequences.[1][2] In the last decade, a 50,000-fold reduction in the cost of DNA sequencing[3] has led to an accumulation of 9.3 quadrillion (million billion) nucleotides of raw sequence data in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The amount of sequence in the SRA doubled on average every six to eight months from 2007 to 2016.[4][5] It is estimated that by 2025, the storage of human genomes alone will require two to 40 exabytes[5] (an exabyte of storage would hold 100,000 times the printed materials of the U.S. Library of Congress[6]). Beyond genotypic data, big data are flooding biology from all quarters—phenotypic data from agricultural field trials, patient medical records, and clinical trials; image data from microscopy, medical scanning, and museum specimens; interaction data from biochemical, cellular, physiological, and ecological systems; as well as an influx of data from translational fields such as bioengineering, materials science, and biogeography.


Author summary

Our computational needs assessment of 704 principal investigators (PIs), with grants from the National Science Foundation (NSF) Biological Sciences Directorate (BIO), confirmed that biology is awash with big data. Nearly 90% of BIO PIs said they are currently or will soon be analyzing large data sets. They considered a range of computational needs important to their work, including high-performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. However, a majority of PIs—across bioinformatics and other disciplines, large and small research groups, and four NSF BIO programs—said their institutions are not meeting nine of 13 needs. Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HPC (71%) were the three greatest unmet needs. Hardware is not the problem; data storage and HPC ranked lowest on their list of unmet needs. The problem is the growing gap between the accumulation of big data and researchers’ knowledge about how to use it effectively.

Declarations

Acknowledgements

The authors wish to thank Bob Freeman and Christina Koch of the ACI-REF project for helpful discussions and references during the development of the survey.

Funding

This study is an Education, Outreach and Training (EOT) activity of CyVerse, an NSF-funded project to develop a “cyber universe” to support life sciences research (DBI-0735191 and DBI-1265383). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests

The authors have declared that no competing interests exist.

References

  1. "GenBank and WGS Statistics". National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/genbank/statistics/. Retrieved 2017. 
  2. "HiSeq X Series of Sequencing Systems" (PDF). Illumina, Inc. 22 March 2016. https://www.illumina.com/content/dam/illumina-marketing/documents/products/datasheets/datasheet-hiseq-x-ten.pdf. Retrieved 2017. 
  3. Wetterstrand, K.. "DNA Sequencing Costs: Data". National Human Genome Research Institute. https://www.genome.gov/sequencingcostsdata/. Retrieved 2017. 
  4. "Sequence Read Archive". National Center for Biotechnology Information. 7 November 2017. https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement. Retrieved 2017. 
  5. 5.0 5.1 Stephens, Z.D.; Lee, S.Y.; Faghri, F. et al. (2015). "Big Data: Astronomical or Genomical?". PLOS Biology 13 (7): e1002195. doi:10.1371/journal.pbio.1002195. PMC PMC4494865. PMID 26151137. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4494865. 
  6. Johnston, L. (25 April 2012). "A “Library of Congress” Worth of Data: It’s All In How You Define It". The Signal. Library of Congress. https://blogs.loc.gov/thesignal/2012/04/a-library-of-congress-worth-of-data-its-all-in-how-you-define-it/. Retrieved 08 March 2017. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.