Difference between revisions of "Journal:Water, water, everywhere: Defining and assessing data sharing in academia"

From LIMSWiki
Jump to navigationJump to search
(Added content. Saving and adding more.)
(Added content. Saving and adding more.)
Line 91: Line 91:
==Results==
==Results==
===NSF-funded projects===
===NSF-funded projects===
We reviewed 33 NSF data management plans acquired from the OSU Office of Sponsored Programs and attempted to locate shared data resulting from the associated projects. Of these, eight DMPs (24 percent) were associated with proposals for workshops, equipment, conferences or other activities that were not expected to generate data. That left 25 NSF-funded projects with DMPs available for which shared data could potentially be found.
We reviewed 33 NSF data management plans acquired from the OSU Office of Sponsored Programs and attempted to locate shared data resulting from the associated projects. Of these, eight DMPs (24 percent) were associated with proposals for workshops, equipment, conferences or other activities that were not expected to generate data. That left 25 NSF-funded projects with DMPs available for which shared data could potentially be found.<ref name="VanTuylDataFrom15">{{cite web |url=https://ir.library.oregonstate.edu/xmlui/handle/1957/57669 |title=Data from: Water, water everywhere: Defining and assessing data sharing in academia |author=Van Tuyl, S.; Whitmire, A.L. |work=ScholarsArchive@OSU - Datasets |publisher=Oregon State University |date=2015 |doi=10.7267/N9W66HPQ}}</ref>


Of the 25 projects for which we attempted to locate shared datasets and generate a DATA score, nineteen (76%) had an overall score of 0 (Fig 1). Of the remaining six projects, one each had a score of 2, 5, 6 or 8, and two had a score of 7 (Table 2).
Of the 25 projects for which we attempted to locate shared datasets and generate a DATA score, nineteen (76 percent) had an overall score of 0 (Fig 1). Of the remaining six projects, one each had a score of 2, 5, 6 or 8, and two had a score of 7 (Table 2).


[[File:Journal.pone.0147942.g001.PNG|500px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="500px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Total DATA scores from 25 NSF-funded projects, as located via data management plans</blockquote>
|-
|}
|}
[[File:Journal.pone.0147942.t002.PNG|900px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="900px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Table 2.''' Non-zero DATA scores from 25 NSF funded projects, with element scores shown.</blockquote>
|-
|}
|}
==Data availability==
==Data availability==
All raw and processed data for this paper are shared at ScholarsArchive@OSU - Oregon State University's repository for scholarly materials. Data may be accessed at: http://dx.doi.org/10.7267/N9W66HPQ.
All raw and processed data for this paper are shared at ScholarsArchive@OSU - Oregon State University's repository for scholarly materials. Data may be accessed at: http://dx.doi.org/10.7267/N9W66HPQ.

Revision as of 20:37, 1 August 2016

Full article title Water, water, everywhere: Defining and assessing data sharing in academia
Journal PLOS ONE
Author(s) Van Tuyl, S.; Whitmire, Amanda, L.
Author affiliation(s) Oregon State University, Stanford University
Primary contact Email: steve dot vantuyl at oregonstate dot edu
Editors Ouzounis, Christos A.
Year published 2016
Volume and issue 11(2)
Page(s) e0147942
DOI 10.1371/journal.pone.0147942
ISSN 1932-6203
Distribution license Creative Commons Attribution 4.0 International
Website http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0147942
Download http://journals.plos.org/plosone/article/asset?id=10.1371%2Fjournal.pone.0147942.PDF (PDF)

Abstract

Sharing of research data has begun to gain traction in many areas of the sciences in the past few years because of changing expectations from the scientific community, funding agencies, and academic journals. National Science Foundation (NSF) requirements for a data management plan (DMP) went into effect in 2011, with the intent of facilitating the dissemination and sharing of research results. Many projects that were funded during 2011 and 2012 should now have implemented the elements of the data management plans required for their grant proposals. In this paper we define "data sharing" and present a protocol for assessing whether data have been shared and how effective the sharing was. We then evaluate the data sharing practices of researchers funded by the NSF at Oregon State University in two ways: by attempting to discover project-level research data using the associated DMP as a starting point, and by examining data sharing associated with journal articles that acknowledge NSF support. Sharing at both the project level and the journal article level was not carried out in the majority of cases, and when sharing was accomplished, the shared data were often of questionable usability due to access, documentation, and formatting issues. We close the article by offering recommendations for how data producers, journal publishers, data repositories, and funding agencies can facilitate the process of sharing data in a meaningful way.

Introduction

“It is one thing to encourage data deposition and resource sharing through guidelines and policy statements, and quite another to ensure that it happens in practice.”[1]

In 2011, the National Science Foundation (NSF) reaffirmed a longstanding requirement for the dissemination and sharing of research results by adding a requirement for the submission of a data management plan (DMP) with grant proposals.[2] DMPs are intended to explain how researchers will address the requirement that they will “share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.”[3] The expectation that NSF-funded researchers will share data has been in place since at least 1995, the year of the oldest NSF Grant Proposal Guide that we could locate in the NSF online archive[4], but the requirement is likely much older. A memorandum put forth by the White House Office of Science and Technology Policy (OSTP) in 2013 aimed at ensuring public access to the results of federally funded research[5], and the subsequent responses from funding agencies, lends credence to the notion that Federal funding agencies are now beginning to take seriously the idea that federally funded data are products that should be managed and shared in order to maximize scientific output from federal investments.

While the NSF does not currently require sharing the dataset that underlies an article at the time of publication, many scientific journals have begun to require or request data sharing as part of the publication process.[6] This move has been motivated by recent high profile cases of scientific misconduct related to falsified/poorly analyzed data[7] and the increasing acknowledgment among scientific communities that data sharing should be part of the process of communicating research results.[8][9][10][11]

A challenge has arisen, though, of defining data sharing in a way that is useful to a broad spectrum of data producers and consumers. The NSF, for example, has been reluctant to define not only data sharing or data sharing best practices, but the meaning of data itself, insisting that these definitions should “be determined by the community of interest through the process of peer review and program management,” rather than being mandated.[12] This lack of guidance has caused some level of confusion among researchers trying to share data, and among service providers attempting to offer venues for data sharing. We have begun to see communities of practice offering guidance on best practices for data sharing from individual research domains (for examples see references [1], [13], [14] and DataONE.org) and from broad-level organizations such as Force11[15] and Kratz and Strasser.[9] While many of these resources are helpful for understanding how to effectively share data, we have yet to see a rubric for evaluating how well a dataset is shared and assessing where improvements should be made to facilitate more effective sharing.

In this study we set a definition of data sharing and create a rubric for evaluating how well data have been shared at two significant levels: for research projects as a whole and as a dataset that underlies a journal article. We focus on research projects because of the NSF and OSTP focus on project-level requirements for data sharing (as cited above), and on journal articles because these represent a logical and common venue for data sharing.[16] We use our rubric to evaluate data sharing from NSF-funded projects that were put into effect after the requirements for management and sharing plan was put into place. Likewise, we use our rubric to evaluate data sharing in journal articles that originate from NSF-funded research projects that are subject to said policy. We conclude by offering guidance on best practices for facilitating data sharing to authors, journals, data repositories, and funding agencies.

Methods

Definition of data sharing

In this paper, we define criteria for assessing the effectiveness of data sharing under the assumption that the goal of data sharing is to make the data usable, with minimal effort, to the new user. We take the position that the bar should be set relatively low for the usability of data in order to facilitate the downstream goals of sharing such as validation of results, reproducibility of research, and use of data for novel science.

The criteria chosen for this research (elaborated below) were developed in consideration of recommendations from the academic literature (e.g. [9], [13], [15]), from best practices identified by organizations focused on data sharing (e.g. [15] and DataONE), and our experiences as data service providers and data users. Based on these sources, we define data sharing as addressing four criteria: Discoverability, Accessibility, Transparency, and Actionability (DATA). Specifically, data should be:

Discoverable — One should be able to get to the data using direct and persistent links.[13][15] Pointing in a general way to a repository or database without providing an identifier or specific location does not ensure the intended dataset will be found. Lack of specificity in location results in a lower discoverability score.
Accessible — The data should be shared via an open platform with minimal barriers to access.[9][13][15] Examples of barriers to access would be having to contact the dataset creator for permission to use the dataset or having to provide an explanation for how or why one wants to use the data.
Transparent — The data should have collocated documentation that describes it sufficiently for an expert to use.[9][13][15] Citing the methods section of an article is not sufficient because articles lack significant details that describe the dataset (definition of headers, units, processing steps, etc.). Relying on a paper for dataset description is also strongly discouraged because most papers are not accessible without a subscription to the parent journal. Likewise, referring to external community standards of practice (SOP) is not likely to be a robust descriptive mechanism over the long term as SOPs change over time and their provenance may not be clearly documented.
Actionable — One should be able to use the data in analytical platforms with minimal reprocessing or reformatting.[13][15] For example, sharing quantitative data as a figure in an article or as a table in a PDF requires burdensome reformatting. These data are not considered actionable.

NSF-funded project data and protocol

We used an advanced search of the NSF Awards Database to identify NSF-funded projects at Oregon State University (OSU) that started on or after the start date of the DMP requirement (18 January 2011) through the end of 2013. Projects with a later start date than that are not likely to be far enough along to have much shared data. We set an end date parameter of 01 July 2015 in order to exclude ongoing projects that would overlap with this research. While we recognize that projects with a recent end date are not required to have shared any data yet, we wanted to avoid unnecessarily excluding projects that may have shared data during the course of the research. This query resulted in 91 projects. Within this set of search results, the OSU Office of Sponsored Programs was able to provide us with about one-third of the DMPs (N = 33).

The process of attempting to locate datasets based on a DMP started in the most obvious way: by looking where the DMP stated that data would be made available. If a dataset was not found in the location specified in the DMP, we looked in three additional locations (listed below). Given that it can be years between when a proposal is submitted and when a dataset is ready to be shared, we anticipated that there would be deviations from DMPs in the actual venue for data sharing. Our search protocol for discovering datasets associated with these DMPs therefore included looking in the following places, in order:

  1. Location specified in DMP, if applicable
  2. NSF award page (use a simple search; include expired awards)
  3. PI website(s), research group or project website
  4. DataCite Metadata Search for author name (http://search.datacite.org/ui)

We used the DataCite Metadata Search as a catch-all for cases when more directed searches did not yield results. Datasets that have been assigned a digital object identifier (DOI) have, in most cases, been assigned a DOI through the DataCite Registration Agency and are thus searchable via that interface. At this time, an openly accessible, consolidated registry for locating datasets across domains and repository type (for example, federal data centers, institutional repositories and standalone repositories like Dryad or figshare) does not exist. There are repository registries, such as Re3data.org, that facilitate locating a data repository of interest, but there isn’t a mechanism to search across the repositories that are listed therein. We did not target specific databases or repositories unless they were explicitly mentioned in the DMP because it would be too time-intensive to search every known database by hand.

NSF-funded journal articles data and protocol

We used the Thomson Reuters Web of Science database to identify journal articles produced by OSU faculty and funded by the National Science Foundation in the years 2012, 2013, and 2014. We selected this year range to attempt to minimize the number of papers that were affiliated with NSF funded projects from before the NSF data management plan mandate went into effect in 2011. This query to Web of Science resulted in 1013 journal articles for which we exported all of the data allowed by the database including, but not limited to, authors, title, journal title, and DOI. From this list of journal articles, we selected a random sample of 120 articles to review.

We reviewed each article to determine whether data were shared with the article using the following steps:

  1. Scan the landing page for the article at the journal website for links or references to shared or supplementary data.
  2. Scan the article for information about shared data in the acknowledgments, supplementary data, appendices, and references.
  3. Scan the methods section for links or references to the datasets used in the paper (including links to repositories, accession numbers, references to data sources, etc.).
  4. Search the entire document for the word “data” and scrutinize all mentions of the word for references to shared data. If the paper is related in some way to simulation or modeling, search the entire document for the words (and variants thereof) “parameter”, “calibration”, and “validation” and scrutinize all mentions of these words for references to shared data.

Data sharing evaluation protocol

Evaluations for each element of DATA (Discoverable, Accessible, Transparent, and Actionable) were made for each resource (project or journal article; see below for evaluation protocols) to assess the quality of data sharing for each source. For each DATA element, the resource was assigned a score of insufficient (0), partial (1), or full (2) compliance with the element of data sharing. A final DATA score was assigned by summing the scores from individual DATA elements. We assessed the data sharing practices for journal articles and funded projects based on the criteria in Table 1.

Journal.pone.0147942.t001.PNG

Table 1. Scoring criteria for the effectiveness of data sharing.

Results

NSF-funded projects

We reviewed 33 NSF data management plans acquired from the OSU Office of Sponsored Programs and attempted to locate shared data resulting from the associated projects. Of these, eight DMPs (24 percent) were associated with proposals for workshops, equipment, conferences or other activities that were not expected to generate data. That left 25 NSF-funded projects with DMPs available for which shared data could potentially be found.[17]

Of the 25 projects for which we attempted to locate shared datasets and generate a DATA score, nineteen (76 percent) had an overall score of 0 (Fig 1). Of the remaining six projects, one each had a score of 2, 5, 6 or 8, and two had a score of 7 (Table 2).

Journal.pone.0147942.g001.PNG

Figure 1. Total DATA scores from 25 NSF-funded projects, as located via data management plans

Journal.pone.0147942.t002.PNG

Table 2. Non-zero DATA scores from 25 NSF funded projects, with element scores shown.

Data availability

All raw and processed data for this paper are shared at ScholarsArchive@OSU - Oregon State University's repository for scholarly materials. Data may be accessed at: http://dx.doi.org/10.7267/N9W66HPQ.

Funding

Publication of this article in an open access journal was funded by the Oregon State University Libraries & Press Open Access Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests

The authors have declared that no competing interests exist.

References

  1. 1.0 1.1 Schofield, P.N.; Bubela, T.; Weaver, T. et al. (2009). "Post-publication sharing of data and tools". Nature 461 (7261): 171–3. doi:10.1038/461171a. PMID 19741686. 
  2. National Science Foundation (January 2011). "Significant Changes to the GPG". GPG Subject Index. National Science Foundation. http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_sigchanges.jsp. 
  3. National Science Foundation (October 2012). "Chapter VI - Other Post Award Requirements and Considerations, section D.4.b". Proposal and Award Policies and Procedures Guide: Part II - Award & Administration Guide. National Science Foundation. http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/aag_6.jsp#VID4. 
  4. National Science Foundation (17 August 1995). "Grant Proposal Guide". National Science Foundation. http://www.nsf.gov/publications/pub_summ.jsp?ods_key=nsf9527&org=NSF. 
  5. "Memorandum for the heads of executive departments and agencies: Increasing access to the results of federally funded scientific research" (PDF). Executive Office of the President, Office of Science and Technology Policy. 22 February 2013. https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf. 
  6. Sturges, P.; Bamkin, M.; Anders, J.H.S. et al. (2015). "Research data sharing: Developing a stakeholder-driven model for journal policies". Journal of the Association for Information Science and Technology 66 (12): 2445–2455. doi:10.1002/asi.23336. 
  7. The Editorial Board (1 June 2015). "Scientists Who Cheat". The New York Times. The New York Times Company. http://www.nytimes.com/2015/06/01/opinion/scientists-who-cheat.html. 
  8. Martone, M.E. (2014). "Brain and Behavior: We want you to share your data". Brain and Behavior 4 (1): 1–3. doi:10.1002/brb3.192. PMC PMC3937699. PMID 24653948. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3937699. 
  9. 9.0 9.1 9.2 9.3 9.4 Kratz, J.; Strasser, C. (2014). "Data publication consensus and controversies". F1000Research 3: 94. doi:10.12688/f1000research.3979.3. PMC PMC4097345. PMID 25075301. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4097345. 
  10. McNutt, M. (2015). "Data, eternal". Science 347 (6217): 7. doi:10.1126/science.aaa5057. PMID 25554763. 
  11. Bloom, T.; Ganley, E.; Winker, M. (2014). "Data Access for the Open Access Literature: PLOS's Data Policy". PLOS Medicine 11 (2): e1001607. doi:10.1371/journal.pmed.1001607. PMC PMC3934818. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3934818. 
  12. National Science Foundation (30 November 2010). "Data Management & Sharing Frequently Asked Questions (FAQs)". National Science Foundation. http://www.nsf.gov/bfa/dias/policy/dmpfaqs.jsp. 
  13. 13.0 13.1 13.2 13.3 13.4 13.5 White, E.P.; Baldridge, E.; Brym, Z.T. et al. (2013). "Nine simple ways to make it easier to (re)use your data". IEE Ideas in Ecology and Evolution 6 (2): 1–10. doi:10.4033/iee.2013.6b.6.f. 
  14. Kervin, K.E.; Michener, W.K.; Cook, R.B. (2013). "Common Errors in Ecological Data Sharing". Journal of eScience Librarianship 2 (2): e1024. doi:10.7191/jeslib.2013.1024. 
  15. 15.0 15.1 15.2 15.3 15.4 15.5 15.6 "The FAIR Data Principles - For Comment". Force11. https://www.force11.org/group/fairgroup/fairprinciples. Retrieved 10 July 2015. 
  16. Ferguson, L. (3 November 2014). "How and why researchers share data (and why they don't)". Wiley Exchanges. John Wiley & Sons, Inc. https://hub.wiley.com/community/exchanges/discover/blog/2014/11/03/how-and-why-researchers-share-data-and-why-they-dont. Retrieved 07 July 2015. 
  17. Van Tuyl, S.; Whitmire, A.L. (2015). "Data from: Water, water everywhere: Defining and assessing data sharing in academia". ScholarsArchive@OSU - Datasets. Oregon State University. doi:10.7267/N9W66HPQ. https://ir.library.oregonstate.edu/xmlui/handle/1957/57669. 

Notes

This version is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.