Journal:The FAIR Guiding Principles for scientific data management and stewardship

From LIMSWiki
Revision as of 20:14, 1 May 2017 by Shawndouglas (talk | contribs) (Created stub. Saving and adding more.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Full article title The FAIR Guiding Principles for scientific data management and stewardship
Journal Scientific Data
Author(s)
Author affiliation(s)
Primary contact E-mail: Barend Mons (must log in)
Year published 2016
Volume and issue 3
Page(s) 160018
DOI 10.1038/sdata.2016.18
ISSN 2052-4463
Distribution license Creative Commons Attribution 4.0 International
Website https://www.nature.com/articles/sdata201618
Download https://www.nature.com/articles/sdata201618.pdf (PDF)

Abstract

There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders — representing academia, industry, funding agencies, and scholarly publishers — have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This comment article represents the first formal publication of the FAIR Principles, and it includes the rationale behind them as well as some exemplar implementations in the community.

Comment

Supporting discovery through good data management

Good data management is not a goal in itself but rather the key conduit leading to knowledge discovery and innovation, and to subsequent data and knowledge integration and reuse by the community after the data publication process. Unfortunately, the existing digital ecosystem surrounding scholarly data publication prevents us from extracting maximum benefit from our research investments (e.g., Roche et al.[1]). Partially in response to this, science funders, publishers and governmental agencies are beginning to require data management and stewardship plans for data generated in publicly funded experiments. Beyond proper collection, annotation, and archival purposes, data stewardship includes the notion of "long-term care" of valuable digital assets, with the goal that they should be discovered and re-used for downstream investigations, either alone, or in combination with newly generated data. The outcomes from good data management and stewardship, therefore, are high-quality digital publications that facilitate and simplify this ongoing process of discovery, evaluation, and reuse in downstream studies. What constitutes "good data management" is, however, largely undefined, and is generally left as a decision for the data or repository owner. Therefore, bringing some clarity around the goals and desiderata of good data management and stewardship, and defining simple guideposts to inform those who publish and/or preserve scholarly data, would be of great utility.

This article describes four foundational principles — findability, accessibility, interoperability, and reusability — that serve to guide data producers and publishers as they navigate around these obstacles, thereby helping to maximize the added value gained by contemporary, formal scholarly digital publishing. Importantly, it is our intent that the principles apply not only to "data" in the conventional sense, but also to the algorithms, tools, and workflows that led to that data. All scholarly digital research objects[2] — from data to analytical pipelines — benefit from application of these principles, since all components of the research process must be available to ensure transparency, reproducibility, and reusability.

There are numerous and diverse stakeholders who stand to benefit from overcoming these obstacles: researchers wanting to share, get credit, and reuse each other’s data and interpretations; professional data publishers offering their services; software and tool-builders providing data analysis and processing services such as reusable workflows; funding agencies (private and public) increasingly concerned with long-term data stewardship; and a data science community mining, integrating, and analyzing new and existing data to advance discovery. To facilitate the reading of this manuscript by these diverse stakeholders, we provide definitions for common abbreviations in Box 1. Humans, however, are not the only critical stakeholders in the milieu of scientific data. Similar problems are encountered by the applications and computational agents that we task to undertake data retrieval and analysis on our behalf. These "computational stakeholders" are increasingly relevant, and the demand as much, or more, attention as their importance grows. One of the grand challenges of data-intensive science, therefore, is to improve knowledge discovery through assisting both humans and their computational agents in the discovery of, access to, and integration and analysis of task-appropriate scientific data and other scholarly digital objects.

Box 1: Terms and abbreviations
BD2K — Big Data 2 Knowledge, a trans-NIH initiative established to enable biomedical research as a digital research enterprise, to facilitate discovery and support new knowledge, and to maximise community engagement

DOI — Digital Object Identifier, a code used to permanently and stably identify (usually digital) objects; DOIs provide a standard mechanism for retrieval of metadata about the object, and generally a means to access the data object itself.

FAIR — Findable, Accessible, Interoperable, Reusable

FORCE11 — The Future of Research Communications and e-Scholarship, a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing; initiated in 2011

Interoperability — The ability of data or tools from non-cooperating resources to integrate or work together with minimal effort

JDDCP — Joint Declaration of Data Citation Principles, acknowledging data as a first-class research output and supporting good research practices around data re-use; JDDCP proposes a set of guiding principles for citation of data within scholarly literature, another dataset, or any other research object.

RDF — Resource Description Framework, a globally-accepted framework for data and knowledge representation that is intended to be read and interpreted by machines

For certain types of important digital objects, there are well-curated, deeply integrated, special-purpose repositories such as GenBank[3], worldwide Protein Data Bank (wwPDB)[4], and UniProt[5] in the life sciences; Space Physics Data Facility (SPDF; http://spdf.gsfc.nasa.gov/) and Set of Identifications, Measurements and Bibliography for Astronomical Data (SIMBAD)[6] in the space sciences.

These foundational and critical core resources are continuously curating and capturing high-value reference datasets and fine-tuning them to enhance scholarly output, provide support for both human and mechanical users, and provide extensive tooling to access their content in rich, dynamic ways. However, not all datasets or even data types can be captured by, or submitted to, these repositories. Many important datasets emerging from traditional, low-throughput bench science don’t fit in the data models of these special-purpose repositories, yet these datasets are no less important with respect to integrative research, reproducibility, and reuse in general. Apparently in response to this, we see the emergence of numerous general-purpose data repositories, at scales ranging from institutional (for example, a single university), to open globally-scoped repositories such as Dataverse[7], FigShare (http://figshare.com), Dryad[8], Mendeley Data (https://data.mendeley.com/), Zenodo (http://zenodo.org/), DataHub (http://datahub.io), DANS (http://www.dans.knaw.nl/), and EUDAT.[9] Such repositories accept a wide range of data types in a wide variety of formats, generally do not attempt to integrate or harmonize the deposited data, and place few restrictions (or requirements) on the descriptors of the data deposition. The resulting data ecosystem, therefore, appears to be moving away from centralization and is becoming more diverse and less integrated, thereby exacerbating the discovery and re-usability problem for both human and computational stakeholders.

A specific example of these obstacles could be imagined in the domain of gene regulation and expression analysis. Suppose a researcher has generated a dataset of differentially selected polyadenylation sites in a non-model pathogenic organism grown under a variety of environmental conditions that stimulate its pathogenic state. The researcher is interested in comparing the alternatively polyadenylated genes in this local dataset to other examples of alternative polyadenylation as well as the expression levels of these genes — both in this organism and related model organisms — during the infection process. Given that there is no special-purpose archive for differential polyadenylation data and no model organism database for this pathogen, where does the researcher begin?

References

  1. Roche, D.G.; Kruuk, L.E.; Lanfear, R.; Binning, S.A. (2015). "Public data archiving in ecology and evolution: How well are we doing?". PLOS Biology 13: e1002295. doi:10.1371/journal.pbio.1002295. PMC PMC4640582. PMID 26556502. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640582. 
  2. Bechhofer, S.; De Roure, D.; Gamble, M. et al. (2010). "Research objects: Towards exchange and reuse of digital knowledge". Nature Precedings. doi:10.1038/npre.2010.4626.1. 
  3. Benson, D.A.; Cavanaugh, M.; Clark, K. et al. (2013). "GenBank". Nucleic Acids Research 41 (D1): D36-42. doi:10.1093/nar/gks1195. PMC PMC4640582. PMID PMC3531190. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640582. 
  4. Berman, H.; Henrick, K.; Nakamura, H. (2003). "Announcing the worldwide Protein Data Bank". Nature Structural Biology 10 (12): 980. doi:10.1038/nsb1203-980. PMID 14634627. 
  5. UniProt Consortium (2015). "UniProt: A hub for protein information". Nucleic Acids Research 43 (D1): D204-12. doi:10.1093/nar/gku989. PMC PMC4384041. PMID 25348405. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4384041. 
  6. Wenger, M.; Ochsenbein, F.; Egret, D. et al. (2000). "The SIMBAD astronomical database: The CDS reference database for astronomical objects". Astronomy and Astrophysics Supplement Series 143 (1): 9–22. doi:10.1051/aas:2000332. 
  7. Crosas, M. (2011). "The Dataverse Network: An open-source application for sharing, discovering and preserving data". D-Lib Magazine 17 (1/2): 2. doi:10.1045/january2011-crosas. 
  8. White, H.C.; Carrier, S.; Thompson, A. et al. (2008). "The Dryad Data Repository: A Singapore Framework metadata architecture in a DSpace environment". DC-2008--Berlin Proceedings 2008: 157–162. http://dcpapers.dublincore.org/pubs/article/view/928. 
  9. Lecarpentier, D.; Wittenburg, P.; Elbers, W. (2013). "EUDAT: A new cross-disciplinary data infrastructure for science". International Journal of Digital Curation 8 (1): 279–287. doi:10.2218/ijdc.v8i1.260. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.