Journal:Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators

Full article title	Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators
Journal	PLOS Computational Biology
Author(s)	Barone, Lindsay; Williams, Jason; Micklos, David
Author affiliation(s)	Cold Spring Harbor Laboratory
Primary contact	Email: lbarone at cshl dot edu
Editors	Ouellette, Francis
Year published	2017
Volume and issue	13(11)
Page(s)	e1005858
DOI	10.1371/journal.pcbi.1005755
ISSN	1553-7358
Distribution license	Creative Commons Attribution 4.0 International
Website	http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005755
Download	http://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1005755&type=printable (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

In a 2016 survey of 704 National Science Foundation (NSF) Biological Sciences Directorate principal investigators (BIO PIs), nearly 90% indicated they are currently or will soon be analyzing large data sets. BIO PIs considered a range of computational needs important to their work, including high-performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. Previous studies in the United States and Canada emphasized infrastructure needs. However, BIO PIs said the most pressing unmet needs are training in data integration, data management, and scaling analyses for HPC, acknowledging that data science skills will be required to build a deeper understanding of life. This portends a growing data knowledge gap in biology and challenges institutions and funding agencies to redouble their support for computational training in biology.

Author summary

Our computational needs assessment of 704 principal investigators (PIs), with grants from the National Science Foundation (NSF) Biological Sciences Directorate (BIO), confirmed that biology is awash with big data. Nearly 90% of BIO PIs said they are currently or will soon be analyzing large data sets. They considered a range of computational needs important to their work, including high-performance computing (HPC), bioinformatics support, multistep workflows, updated analysis software, and the ability to store, share, and publish data. However, a majority of PIs—across bioinformatics and other disciplines, large and small research groups, and four NSF BIO programs—said their institutions are not meeting nine of 13 needs. Training on integration of multiple data types (89%), on data management and metadata (78%), and on scaling analysis to cloud/HPC (71%) were the three greatest unmet needs. Hardware is not the problem; data storage and HPC ranked lowest on their list of unmet needs. The problem is the growing gap between the accumulation of big data and researchers’ knowledge about how to use it effectively.

Declarations

Acknowledgements

The authors wish to thank Bob Freeman and Christina Koch of the ACI-REF project for helpful discussions and references during the development of the survey.

Funding

This study is an Education, Outreach and Training (EOT) activity of CyVerse, an NSF-funded project to develop a “cyber universe” to support life sciences research (DBI-0735191 and DBI-1265383). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests

The authors have declared that no competing interests exist.

References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.