Difference between revisions of "Journal:Ten simple rules for maximizing the recommendations of the NIH data management and sharing plan"

From LIMSWiki
Jump to navigationJump to search
(Created stub. Saving and adding more.)
 
(Saving and adding more.)
Line 32: Line 32:


==Introduction==
==Introduction==
Clinical and [[Translational research|translational researchers]] have been aware of the increasing [[Information management|data management]] requirements of the [[National Institutes of Health]] (NIH) since its initial release of policies for data management and [[Data sharing|sharing]] in 2003. [1] The initial requirement of submission of a data sharing plan applied to funding applications of $500,000 or more in direct costs per year, and that requirement has evolved over the years in order to accommodate the nuances of managing clinical data, as well as increasing sophistication of research data management. After releasing a new Draft Data Management and Sharing (DMS) Policy and Supplemental Draft Guidance for comment in November 2019 [2], the NIH incorporated feedback from the community to produce the Final DMS Policy in October 2020. [3] The Final DMS Policy requires a one- to two-page data management and sharing plan (DMSP) to be submitted with the application for all funded research. The intent of the policy is to encourage data sharing to the extent that it is possible, as the policy states. The NIH expects that “researchers are prospectively planning for data sharing, which we anticipate will increasingly lead researchers to integrate data sharing into the routine conduct of research." The NIH adds that, "[a]ccordingly, we have included in the final DMS Policy an expectation that researchers will maximize appropriate data sharing when developing plans.” [3]
Sharing research data securely and efficiently is a key step toward supporting and advancing translational science, as it allows for savings in researcher time and effort and greater assurance of reproducibility. Concerns with research replicability and reproducibility lie behind the NIH’s guidelines and have been documented in regards to the larger research community extensively in the literature. [4–6] Open science practices, including publication of protocols and sharing of code, go a long way toward enabling research reproducibility. Sharing of the de-identified data from clinical studies, when possible, is also a crucial step.
Data sharing on the level required by the new policy is not new to researchers in certain fields, such as those familiar with the NIH Genomic Data Sharing Policy [7], the Model Organism Sharing Policy [8], and other existing sharing policies in the clinical research sphere where NIH funding is involved. [9] The update to existing practices required by the new policy is the requirement of submission of a DMSP with all NIH-funded research submissions, with an expectation of compliance and adherence to the plan (with allowances made for updates) throughout the lifecycle of funded projects.
The 10 simple rules below are intended to assist researchers in both writing a plan that is compliant with the new data management and sharing requirements and that is maximized for incorporating as seamlessly as possible into research [[workflow]]s. The rules are ordered as they pertain to the sections of the ''Elements of an NIH Data Management and Sharing Plan'' (DMSP Elements), the NIH’s supplemental guidance document on creating a data management and sharing plan, to demonstrate practical ways to meet the requirements (Fig. 1).
[[File:Fig1 Gonzales PLOSComBio22 18-8.png|900px]]
{{clear}}
{|
| style="vertical-align:top;" |
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" |<blockquote>'''Figure 1.''' "Ten simple rules for maximizing the recommendations of the NIH data management and sharing plan" mapped against the ''Elements of an NIH Data Management and Sharing Plan''. This mapping shows how the rules in this article map to the recommended elements of an NIH data management and sharing plan as defined in the ''Supplemental Information to the NIH Policy for Data Management and Sharing: Elements of an NIH Data Management and Sharing Plan''.</blockquote>
|-
|}
|}
==Rule 1: Describe the data: What is it, how much will be generated, and what is the level of processing?==
'''NOTE''': This rule corresponds to DMSP Elements: “Data Types,” point 1.
The DMSP Elements guidance requires description of the types of data that will be generated in the course of the project, including information about the data’s modality, level of aggregation, and level of processing. [10] Though the project is not yet begun at the time of the DMSP submission (which accompanies the budget justification in the grant application), list the data types the research team anticipates will be created. This can be addressed by the following:
* Modality (or high-level category): List the overall type of data to be created, such as [[Genomics|genomic]], [[imaging]], text sequences, modeling data, etc.
* Formats: List the anticipated data formats to be created, such as CSV, TSV, XML, JSON, fMRI files, SAV, SAS, DTA.
* Amount: To the extent possible, list the number of files expected to be generated and/or their anticipated storage space (terabytes of data, petabytes, etc.).
* Aggregation: List whether individual or aggregated data provides insights into the research question(s) and also which type (aggregated or non-aggregated) will be shared.
* Processing: List the anticipated level of processing that will be pursued in the project and also the processing level of data that will be shared.
Regarding the portions of project data that may be shared, as referenced above, keep in mind that sharing of all data from the project is not required. Subsets of the full dataset may be shared based on what is legally and ethically permitted for sharing (more on this in rules to follow). Subsets can include portions of the data demonstrating the principles outlined in a resulting publication, small representative de-identified subsets, subsets allowing replication of the study, etc.
==Rule 2: Choose documentation types from the beginning of the project==
'''NOTE''': This rule corresponds to DMSP Elements: “Data Types,” point 2.
The NIH’s DMSP Elements requires that, in addition to describing the project data that will be produced, a description of the portion of project data that will be preserved and shared is required. [10] Though the project has yet to formally begin, the research team may already have in mind such categories of data, as well as the [[metadata]] descriptions that will accompany data throughout its lifecycle, and the types of documentation that will be employed in the project to keep track of the data. Though detailed documentation examples are not required at the time of submission (and would be too lengthy for a one- to two-page data management and sharing plan), it is a good time to consider the documentation that will be used in the project, which may consist of:
* Metadata documentation: Explain whether the project will describe data using metadata such as the NIH Common Data Elements [11], the MIAME or MINSEQE [12] standards, or other metadata vocabularies that can be found through resources such as the Digital Curation Centre (DCC). [13]
* Data dictionary: A data dictionary describes aspects of the data at the most granular level. This document is generally maintained in spreadsheet form and outlines details of each variable, including both human readable and “coded” names, definitions, units of measurement, data types and ranges allowed, and permissible null values. [14]
* README files: A README contains detailed information about data file formats, as well as data collection methodology, including details on instruments and software used, explanations of relationships between files, and details on [[quality control]] (QC) practices. [15] The format is generally a brief explanatory document outlining dataset structures, terminology, and definitions that make research data files easier to understand for secondary users, regardless of where these files are stored.
The abovementioned files will be helpful to have in later stages of the project, enabling compliance when the data-sharing stage nears. For any data that is not planned to be preserved and shared online for legal, ethical, or other reasons, a rationale is requested in the DMSP. Having such descriptive metadata providing general information on the content of the files can assist with reinforcing such rationales. In such cases, the types of descriptive files outlined above can serve to represent sensitive datasets without divulging [[Protected health information|protected information]]. Moreover, these descriptive files can be made available and discoverable through an institutional-, generalist-, or discipline-specific repository, with metadata denoting the location of the data and more detailed [[information]] about brokering access and use of the data.
==Rule 3: Describe the tools and software to be used in the project==
'''NOTE''': This rule corresponds to DMSP Elements: “Related Tools, Software, and/or Code”.
The DMSP Elements recommends providing “an indication of whether specialized tools are needed to access or manipulate shared scientific data to support replication or reuse, and name(s) of the needed tool(s) and software.” [10] This requirement accompanies and complements the requirements for sharing information about project data because knowledge of the tools and software used in the project supports reproducibility, which is an underlying motivation of the Final NIH Data Management and Sharing Policy. Reproducibility is “the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.” [16] Data is just one part of the original materials used in a study; the software and tools used to gather and manipulate the data are equally important. Data scientist and reproducibility expert Victoria Stodden emphasizes the importance of computational reproducibility or providing information about the code, scripts, hardware, software, and implementation details of a study in order to enable full reproducibility, allowing for the integral part that computers and software play in modern science. [17]
In a compliant DMSP, describe the following:
* Devices that will be used to collect project data;
* Software or programming languages that will be used to work with the data (e.g., [[Python (programming language)|Python]], STATA, [[R (programming language)|R]]);
* Whether the tools and software are open-source (free) or proprietary (must be purchased); and
* If known, how long the tools and software will be usable to access the data (e.g., until a software program’s end-of-life date).
==Rule 4: Use standard file types, identifiers, and descriptive elements==
'''NOTE''': This rule corresponds to DMSP Elements: “Standards”.





Revision as of 21:14, 22 March 2023

Full article title Ten simple rules for maximizing the recommendations of the NIH data management and sharing plan
Journal PLOS Computational Biology
Author(s) Gonzales, Sara; Carson, Matthew B.; Holmes, Kristi
Author affiliation(s) Northwestern University
Primary contact Email: sara dot gonzales2 at northwestern dot edu
Year published 2022
Volume and issue 18(8)
Article # e1010397
DOI 10.1371/journal.pcbi.1010397
ISSN 1553-7358
Distribution license Creative Commons Attribution 4.0 International
Website https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1010397
Download https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1010397&type=printable (PDF)

Abstract

The National Institutes of Health (NIH) Policy for Data Management and Sharing (DMS Policy) recognizes the NIH’s role as a key steward of the United States' biomedical research and information and seeks to enhance that stewardship through systematic recommendations for the preservation and sharing of research data generated by funded projects. The policy is effective as of January 2023. The recommendations include a requirement for the submission of a data management and sharing plan (DMSP) with funding applications, and while no strict template was provided, the NIH has released supplemental draft guidance on elements to consider when developing such a plan. This article provides 10 key recommendations for creating a DMSP that is both maximally compliant and effective.

Keywords: data management, data sharing,

Introduction

Clinical and translational researchers have been aware of the increasing data management requirements of the National Institutes of Health (NIH) since its initial release of policies for data management and sharing in 2003. [1] The initial requirement of submission of a data sharing plan applied to funding applications of $500,000 or more in direct costs per year, and that requirement has evolved over the years in order to accommodate the nuances of managing clinical data, as well as increasing sophistication of research data management. After releasing a new Draft Data Management and Sharing (DMS) Policy and Supplemental Draft Guidance for comment in November 2019 [2], the NIH incorporated feedback from the community to produce the Final DMS Policy in October 2020. [3] The Final DMS Policy requires a one- to two-page data management and sharing plan (DMSP) to be submitted with the application for all funded research. The intent of the policy is to encourage data sharing to the extent that it is possible, as the policy states. The NIH expects that “researchers are prospectively planning for data sharing, which we anticipate will increasingly lead researchers to integrate data sharing into the routine conduct of research." The NIH adds that, "[a]ccordingly, we have included in the final DMS Policy an expectation that researchers will maximize appropriate data sharing when developing plans.” [3]

Sharing research data securely and efficiently is a key step toward supporting and advancing translational science, as it allows for savings in researcher time and effort and greater assurance of reproducibility. Concerns with research replicability and reproducibility lie behind the NIH’s guidelines and have been documented in regards to the larger research community extensively in the literature. [4–6] Open science practices, including publication of protocols and sharing of code, go a long way toward enabling research reproducibility. Sharing of the de-identified data from clinical studies, when possible, is also a crucial step.

Data sharing on the level required by the new policy is not new to researchers in certain fields, such as those familiar with the NIH Genomic Data Sharing Policy [7], the Model Organism Sharing Policy [8], and other existing sharing policies in the clinical research sphere where NIH funding is involved. [9] The update to existing practices required by the new policy is the requirement of submission of a DMSP with all NIH-funded research submissions, with an expectation of compliance and adherence to the plan (with allowances made for updates) throughout the lifecycle of funded projects.

The 10 simple rules below are intended to assist researchers in both writing a plan that is compliant with the new data management and sharing requirements and that is maximized for incorporating as seamlessly as possible into research workflows. The rules are ordered as they pertain to the sections of the Elements of an NIH Data Management and Sharing Plan (DMSP Elements), the NIH’s supplemental guidance document on creating a data management and sharing plan, to demonstrate practical ways to meet the requirements (Fig. 1).


Fig1 Gonzales PLOSComBio22 18-8.png

Figure 1. "Ten simple rules for maximizing the recommendations of the NIH data management and sharing plan" mapped against the Elements of an NIH Data Management and Sharing Plan. This mapping shows how the rules in this article map to the recommended elements of an NIH data management and sharing plan as defined in the Supplemental Information to the NIH Policy for Data Management and Sharing: Elements of an NIH Data Management and Sharing Plan.

Rule 1: Describe the data: What is it, how much will be generated, and what is the level of processing?

NOTE: This rule corresponds to DMSP Elements: “Data Types,” point 1.

The DMSP Elements guidance requires description of the types of data that will be generated in the course of the project, including information about the data’s modality, level of aggregation, and level of processing. [10] Though the project is not yet begun at the time of the DMSP submission (which accompanies the budget justification in the grant application), list the data types the research team anticipates will be created. This can be addressed by the following:

  • Modality (or high-level category): List the overall type of data to be created, such as genomic, imaging, text sequences, modeling data, etc.
  • Formats: List the anticipated data formats to be created, such as CSV, TSV, XML, JSON, fMRI files, SAV, SAS, DTA.
  • Amount: To the extent possible, list the number of files expected to be generated and/or their anticipated storage space (terabytes of data, petabytes, etc.).
  • Aggregation: List whether individual or aggregated data provides insights into the research question(s) and also which type (aggregated or non-aggregated) will be shared.
  • Processing: List the anticipated level of processing that will be pursued in the project and also the processing level of data that will be shared.

Regarding the portions of project data that may be shared, as referenced above, keep in mind that sharing of all data from the project is not required. Subsets of the full dataset may be shared based on what is legally and ethically permitted for sharing (more on this in rules to follow). Subsets can include portions of the data demonstrating the principles outlined in a resulting publication, small representative de-identified subsets, subsets allowing replication of the study, etc.

Rule 2: Choose documentation types from the beginning of the project

NOTE: This rule corresponds to DMSP Elements: “Data Types,” point 2.

The NIH’s DMSP Elements requires that, in addition to describing the project data that will be produced, a description of the portion of project data that will be preserved and shared is required. [10] Though the project has yet to formally begin, the research team may already have in mind such categories of data, as well as the metadata descriptions that will accompany data throughout its lifecycle, and the types of documentation that will be employed in the project to keep track of the data. Though detailed documentation examples are not required at the time of submission (and would be too lengthy for a one- to two-page data management and sharing plan), it is a good time to consider the documentation that will be used in the project, which may consist of:

  • Metadata documentation: Explain whether the project will describe data using metadata such as the NIH Common Data Elements [11], the MIAME or MINSEQE [12] standards, or other metadata vocabularies that can be found through resources such as the Digital Curation Centre (DCC). [13]
  • Data dictionary: A data dictionary describes aspects of the data at the most granular level. This document is generally maintained in spreadsheet form and outlines details of each variable, including both human readable and “coded” names, definitions, units of measurement, data types and ranges allowed, and permissible null values. [14]
  • README files: A README contains detailed information about data file formats, as well as data collection methodology, including details on instruments and software used, explanations of relationships between files, and details on quality control (QC) practices. [15] The format is generally a brief explanatory document outlining dataset structures, terminology, and definitions that make research data files easier to understand for secondary users, regardless of where these files are stored.

The abovementioned files will be helpful to have in later stages of the project, enabling compliance when the data-sharing stage nears. For any data that is not planned to be preserved and shared online for legal, ethical, or other reasons, a rationale is requested in the DMSP. Having such descriptive metadata providing general information on the content of the files can assist with reinforcing such rationales. In such cases, the types of descriptive files outlined above can serve to represent sensitive datasets without divulging protected information. Moreover, these descriptive files can be made available and discoverable through an institutional-, generalist-, or discipline-specific repository, with metadata denoting the location of the data and more detailed information about brokering access and use of the data.

Rule 3: Describe the tools and software to be used in the project

NOTE: This rule corresponds to DMSP Elements: “Related Tools, Software, and/or Code”.

The DMSP Elements recommends providing “an indication of whether specialized tools are needed to access or manipulate shared scientific data to support replication or reuse, and name(s) of the needed tool(s) and software.” [10] This requirement accompanies and complements the requirements for sharing information about project data because knowledge of the tools and software used in the project supports reproducibility, which is an underlying motivation of the Final NIH Data Management and Sharing Policy. Reproducibility is “the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results.” [16] Data is just one part of the original materials used in a study; the software and tools used to gather and manipulate the data are equally important. Data scientist and reproducibility expert Victoria Stodden emphasizes the importance of computational reproducibility or providing information about the code, scripts, hardware, software, and implementation details of a study in order to enable full reproducibility, allowing for the integral part that computers and software play in modern science. [17]

In a compliant DMSP, describe the following:

  • Devices that will be used to collect project data;
  • Software or programming languages that will be used to work with the data (e.g., Python, STATA, R);
  • Whether the tools and software are open-source (free) or proprietary (must be purchased); and
  • If known, how long the tools and software will be usable to access the data (e.g., until a software program’s end-of-life date).

Rule 4: Use standard file types, identifiers, and descriptive elements

NOTE: This rule corresponds to DMSP Elements: “Standards”.



References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, grammar, and punctuation. In some cases important information was missing from the references, and that information was added.