Journal:Support Your Data: A research data management guide for researchers
|Full article title||Support Your Data: A research data management guide for researchers|
|Journal||Research Ideas and Outcomes|
|Author(s)||Borghi, John A.; Abrams, Stephen; Lowenberg, Daniella; Simms, Stephanie; Chodacki, John|
|Author affiliation(s)||University of California Curation Center|
|Primary contact||Email: john dot borghi at ucop dot edu|
|Volume and issue||4|
|Distribution license||Creative Commons Attribution 4.0 International|
- 1 Abstract
- 2 Introduction
- 3 Project development
- 4 The support your data materials
- 5 RDM rubric
- 6 One-page guides
- 7 Using the Support Your Data materials
- 8 Next steps
- 9 Supplementary material
- 10 Acknowledgements
- 11 Footnotes
- 12 References
- 13 Notes
Researchers are faced with rapidly evolving expectations about how they should manage and share their data, code, and other research materials. To help them meet these expectations and generally manage and share their data more effectively, we are developing a suite of tools which we are currently referring to as "Support Your Data." These tools— which include a rubric designed to enable researchers to self-assess their current data management practices and a series of short guides which provide actionable information about how to advance practices as necessary or desired—are intended to be easily customizable to meet the needs of researchers working in a variety of institutional and disciplinary contexts.
Keywords: research data management, RDM, data sharing, open data, open science
Research data management (RDM), a term that encompasses activities related to the storage, organization, documentation, and dissemination of data[a], is central to efforts aimed at maximizing the value of scientific investment (e.g., the Holdren memorandum) and addressing concerns related to the integrity of the research process (e.g., Collins and Tabak's discussion on reproducibility). Unfortunately, when surveyed directly, researchers often acknowledge that they lack the skills and experience needed to manage and share their data effectively. This disconnect demonstrates the need for tools that bridge the communication gap that exists between the research community, data service providers, and other local, national, and international data stakeholder groups. The development of one such tool, which we are tentatively referring to as “Support Your Data,” is the subject of this project report.
As demonstrated by visualizations such as the research data lifecycle, RDM is continuous, iterative, and embedded throughout the course of a research project. Well thought out RDM practices make the research process more efficient, facilitate collaboration, and help prevent the loss of data (see Lowndes et al. 2017). Effective RDM is also crucial to establishing the accessibility of data after a project’s conclusion, which is increasingly required by data stakeholders such as research funding agencies and scholarly publishers. Steps must be taken early in the research process to ensure that data can be shared later. For example, the sharing of data from human participants must be approved by an institutional review board (IRB) and described in informed consent documents before any data is collected. More generally, data that are made available are only useful if formatted, documented, and organized in a manner that enables examination and reuse by others. Related guidance (e.g., from Goodman et al.) and standards (e.g., FAIR Guiding Principles) highlight that proper data management is a key factor in enabling effective data sharing, which is itself a key factor in establishing research transparency and reproducibility.
Complementing calls for improved data management and more widespread data sharing by transparency and reproducibility-related initiatives within the research community, RDM has increasingly become a focus for academic libraries. Though offerings vary considerably between institutions, library RDM programs generally emphasize skills training and assisting researchers in complying with data-related policies and mandates Guidance provided to researchers by library-based data service providers often focuses on topics such as data management planning, metadata and documentation, data organization, storage and backup procedures, and long term preservation. Though “best practice” documents written by researchers often cover similar topics, they generally do not reference the work of data service providers. A recent effort to bridge these two perspectives through a survey of data management practices in the field of human brain imaging (neuroimaging) demonstrates that many researchers are unaware of or do not make use of library-based RDM resources. Furthermore, their RDM practices are highly variable, often described using hypothesis or workflow-specific terminology, and rooted in immediate and practical concerns (e.g., “I want to prevent the loss of data.”). Therefore, for data service providers, crossing this communication gap and effectively engaging with researchers on the topic of RDM requires not only overcoming differences in language, terminology, and priorities between and within different research areas, but also placing related concepts within the context of a researcher’s day-to-day work with data.
There are several existing tools that bring together the perspectives of data service providers and researchers to evaluate RDM practices. However, because these tools are often oriented towards data service providers, they have not seen widespread adoption by researchers who may have minimal contact with library-based RDM programs. For example, the Data Curation Profiles toolkit-which consists of a structured interviewed designed to elucidate data-related practices and needs in different academic disciplines-was designed to launch discussions between librarians and researchers and facilitate the development of data services that address the needs of researchers. Other RDM assessment tools draw heavily from the capability maturity model (CMM) framework, which describes practices based on their degree of formality and optimization. A maturity model specific to the management of scientific data characterizes research groups on the basis of how well their procedures related to data acquisition, description, dissemination, and preservation are defined, documented, and generalized. The DMVitals tool combines elements of the Data Curation Profiles and maturity-based tools to systematically assess a researcher’s data management practices and generate customized and actionable recommendations based on institutional and domain standards.
This brief review of the current RDM landscape highlights several significant trends:
- Researchers face an evolving array of expectations related to how they manage and share data. Unfortunately, there is a significant communication gap between researchers and library-based data service providers.
- Overcoming this communication gap requires placing RDM in the context of a researcher’s day-to-day work with data and overcoming differences in language, terminology, and priorities between and within different research communities.
- There is currently no user-friendly guide that allows researchers to assess and advance their own data management practices.
The intention of the Support Your Data project is to address these trends by developing materials that frame activities related to research data management so that they can be easily understood and acted upon by researchers. At present, these materials consist of a rubric designed to allow researchers to self assess their own RDM practices over the course of a research project and a complementary set of guides that direct researchers towards RDM-related services at their institution and provide actionable information about how to advance their practices as necessary or desired. To meet the needs of researchers in different institutional and disciplinary contexts, all of these materials have been designed to be easily customizable.
The development process for the Support Your Data project drew upon a large number of sources. An initial point of inspiration was the “HowOpenIsIt?” guide developed by SPARC, PLOS, and the Open Access Scholarly Publishers Association (OASPA). The format of this guide, in which a number of topics (e.g., author posting rights, reuse rights) are described on a spectrum from closed to open access, allows for a number of complex and interrelated issues to be presented in a relatively simple and easy to understand manner. This prompted us to consider how to present research data management, a topic sufficiently complex as to be labelled a “wicked problem,” in a similar manner.
A literature search and analysis of existing RDM evaluation tools revealed that the majority were either designed to benchmark RDM services at the institutional level (e.g., the Australian National Data Service's data management framework and the Digital Curation Center's CARDIO effort) or intended to foster communication between researchers and library based data service providers. For this reason, we decided that our yet unnamed project should focus on developing materials for researchers. Working under the assumption that researchers in different institutional and disciplinary contexts might have a range of RDM-related priorities and access to different levels of RDM-related services, we decided at the outset of the development process that our materials should be developed with an eye towards customization.
One major early difficulty was determining how to describe the research process. While we wanted to draw from the workflow-based organization of visualizations such as the research data lifecycle, we also wanted to avoid presenting the progression of a research project using models or terminology that would be unfamiliar or unappealing to researchers. After conducting an informal survey of what words researchers associate with given activities (e.g., “What term(s) do you use to describe the stage of your research that involves acquiring, accumulating, or measuring data?”) and examining related work on the topic (e.g., Mattern et al. 2015) we decided to focus on describing RDM-related practices rather than project stages. Even so, terminology proved to be a significant problem as we quickly determined that phrases such “data management planning” and “data sharing” had significantly different meanings to different audiences. Our efforts to reduce jargon would continue throughout the development process.
As with other RDM evaluation tools, we adopted elements of the capability maturity model framework to describe different data management-related activities on a continuum from “ad hoc” to “refined and optimized.” This early conception of an “RDM Maturity Guide” was described in early blog posts intended to elicit feedback from members of the the data services and research communities. However, as the project progressed, we moved away from explicitly referencing the concept of practice maturity. Informal feedback received during the development of a parallel project, in which researchers were asked to provide quantitative RDM maturity ratings for themselves and their field as a whole, revealed that the concept needed constant clarification and that researchers were resistant to the connotation that their practices could be considered “immature.”
The general structure of what would become the Support Your Data rubric was therefore refined to include a series of RDM-related activities described at different levels of definition and optimization. Because the rubric was to be designed to allow researchers to self-assess the current state of their RDM practices, we quickly decided that the rubric should be complemented by a series of short guides designed to provide information about how to advances practices as necessary or desired. In a series of biweekly meetings, we then set out to draft content for these materials. Feedback from the broader community was sought throughout this process through additional blog posts and presentations at research data-focused conferences (e.g., see Borghi et al. 2017 and Borghi et al. 2018)
Initially, development of the content for the rubric and the guides progressed in parallel. Informed by informal surveys of researchers and data service providers (e.g. “What activities do you consider part of ‘planning for data’?”), we reviewed draft materials, worked to clarify language, and added relevant information as necessary. Though the activities described in the rows of the rubric (and expanded upon further in the guides) remained largely consistent throughout the development process, the earliest iterations of the rubric did not use use set labels to describe a researcher’s practices related to each activity. This was intentional, as we wanted to resist quantification of a researcher’s practices into a score of their RDM maturity. However, after an initial round of revisions, we determined that the rubric was becoming unbalanced. The lack of labels meant that different activities were being described at different levels of specificity which made interpretation difficult, thus defeating the entire purpose of the project.
In response, we refined the structure of the rubric further so that a researcher’s RDM-related activities were described using one of four labels (see next section). After taking care that these labels were descriptive and not evaluative, we then completed a draft version of the entire rubric. We decided to use declarative statements to describe each RDM-related activity under each label in order to maximize the degree to which a researcher would identify a description with their own practices. We then proceeded to refine the content and structure of the guides. The materials presented in the next section are the result of this most recent round of revision.
The support your data materials
At present, the Support Your Data materials consist of a rubric designed to allow researchers to self assess their own RDM practices and a complementary series of one-page guides intended to provide researchers access to RDM-related expertise (including local RDM-related resources) and advance practices as necessary or desired. All of these materials are intended to be customizable in order to meet the needs of researchers in different institutional or disciplinary contexts.
The aim of the Support Your Data project is to be descriptive rather than prescriptive. Neither the rubric nor the guides assumes that every researcher will want, need, or be able to achieve the same level data management practices. Rather, the intent of these materials is to help researchers understand where they are in regards to RDM and, when appropriate, how to get to where they want or need to be.
A schematic version of RDM rubric is shown in Table 1. Different RDM-related activities occurring over the course of a research project are represented in separate rows. Though the order from top to bottom loosely follows the progression of a research project, it is very likely that these activities will occur in a different order or simultaneously in a researcher's day-to-day work with data. The six activities described in the rubric (planning, organizing, saving, preparing, analyzing, and sharing) are intentionally general in order to make the rubric applicable to as wide a population as possible. Future versions of the rubric, adapted to specific disciplinary or institutional contexts, could incorporate greater, fewer, or altogether different activities.
Proceeding left to right, a series of declarative statements describe each activity in terms of how well they are designed to foster access to and use of data in the future. The four levels, “ad hoc,” “one-time,” “active and informative,” and “optimized for re-use,” are intended to be descriptive not prescriptive.
- Ad hoc - Refers to circumstances in which practices are neither standardized or documented. Every time a researcher has to manage their data they have to design new practices and procedures from scratch.
- One time - Refers to circumstances in which data management occurs only when it is necessary, such as in direct response to a mandate from a funder or publisher. Practices or procedures implemented at one phase of a project are not designed with later phases in mind.
- Active and informative - Refers to circumstances in which data management is a regular part of the research process. Practices and procedures are standardized, well documented, and well integrated with those implemented at other phases.
- Optimized for re-use - Refers to circumstances in which data all management activities are designed to facilitate the re-use of data in the future.
It should be noted that “re-use” in the context of the Support Your Data project is not necessarily meant as an endorsement of data sharing or other open science practices but is representative of the close link between effective sharing and effective research data management. It is very likely that the person who will need to examine or re-use a given dataset will be the researcher who collected or analyzed it in the first place.
Prelimary versions of the guides associated with each row of the RDM rubric are available as Suppl. materials 2, 3, 4, 5, 6, and 7. Designed to be easily customizable to fit the terminology, practices, and services associated with different disciplinary and institutional communities, the guides all follow a similar structure.
- Abstract - A brief summary of the contents of the guide.
- What does it mean? - Provides an operational definition of the activity covered by the guide. For some guides (Planning, Preparing), this consists of a sentence or two describing the activity. For others (e.g. Saving, Preparing, Analyzing, Sharing) this involves a more detailed breakdown of what each activity involves in practice.
- Requirements and how to meet them - Provides a brief summary of how to meet expectations or mandates related to each activity. Because data-related requirements and services are highly discipline and institionally specific, the contents of these sections are designed to be easily customizable.
- Things to think about - Contains notes and recommendations that do not fit into the other sections.
Both the rubric and the guides are intended for easy customization to reflect the terminology, tools, best practices, and services specific to different disciplinary and institutional communities. In the template guides, some suggested points of customization are highlighted in yellow (discipline-specific) and red (institution-specific). Discipline-specific versions may incorporate the jargon, workflow, standards, and priorities of researchers working in a particular domain (e.g., neuroscience). Institution-specific versions may also incorporate links to available data management, curation, and preservation tools and services.
Using the Support Your Data materials
We envision several use cases for the Support Your Data materials. The most likely is one in which these materials are used to facilitate discussion between an individual researcher or research group and a data service provider. In such a case, the researcher or research group can use the RDM rubric to identify the difference between where they are in regards to RDM versus where they want or need to be and then a data service provider can use the guides, customized to highlight available services and tools, to provide information about how to move forward. Another probable use case is one in which a particular research community uses these materials as part of a broader effort to improve data management (including data sharing) related practices. In this case, the organization and content of both the RDM rubric and the guides can be customized, with the assistance of data service providers, to include community-specific activities, requirements, and terminology. Though we were careful to ensure that our materials are merely descriptive, such customized versions could be more prescriptive in adhering to institutional or discipline-specific norms or policies.
Though helping researchers respond to evolving expectations related to the management and sharing of their data was a major driving force behind the project, the Support Your Data materials, at least in their current iteration, are not designed to increase compliance with specific policies or requirements. For example, though a researcher using these materials would be directed to local RDM services and tools (e.g., a local DMPTool instance) related to the creation of data management plans (DMPs), neither the rubric nor the “planning for data” guide give specific guidance on how to comply with the DMP requirements of different funding agencies. However, in helping researchers assess and advance their data management practices, the Support Your Data materials may indirectly help them comply more effectively with data-related requirements throughout the lifecycle of a research project.
Now that we have a complete set of draft materials, the next step of the Support Your Data project is to focus on design and adoption. Moving forward, we will work with internal and external partners on the visual presentation of the materials and to develop pamphlets, postcards, and other collateral. As has been the case throughout the project, we will also continue to invite feedback and explore partnerships with stakeholders interested in developing customized materials.
- Suppl. material 1: A formatted version of the Support Your Data RDM rubric (.odp file)
- Suppl. material 2: A draft guide that corresponds with the "Planning your project" row of the RDM rubric. Suggested points of customization are highlighted in yellow (discipline-specific) and red (institution-specific). (.odt file)
- Suppl. material 3: A draft guide that corresponds with the "Organizing your data" row of the RDM rubric. Suggested points of customization are highlighted in yellow (discipline-specific) and red (institution-specific). (.odt file)
- Suppl. material 4: A draft guide that corresponds with the "Saving and backing up your data" row of the RDM rubric. Suggested points of customization are highlighted in yellow (discipline-specific) and red (institution-specific). (.odt file)
- Suppl. material 5: A draft guide that corresponds with the "Getting your data ready for analysis" row of the RDM rubric. Suggested points of customization are highlighted in yellow (discipline-specific) and red (institution-specific). (.odt file)
- Suppl. material 6: A draft guide that corresponds with the "Analyzing your data and handling the outputs" row of the RDM rubric. Suggested points of customization are highlighted in yellow (discipline-specific) and red (institution-specific). (.odt file)
- Suppl. material 7: A draft guide that corresponds with the "Sharing and publishing your data" row of the RDM rubric. Suggested points of customization are highlighted in yellow (discipline-specific) and red (institution-specific). (.odt file)
UC Curation Center, California Digital Library
JB drafted the manuscript and lead the development of the materials. SA, DL, SS, and JC co-developed the materials and reviewed the manuscript.
Conflicts of interest
The authors declare no conflicts of interest.
- For the purposes of this report we are using the term “data” broadly to refer to the inputs or outputs required to evaluate, reproduce, or built upon the analyses or conclusions of a given research project. This includes, but is not limited to, raw data, processed data, research-related code, and documentation pertaining to study parameters and procedures.
- Holdren, J.P. (22 February 2013). "Increasing Access to the Results of Federally Funded Scientific Research". Office of Science and Technology Policy. https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf.
- Collins, F.S.; Tabak, L.A. (2014). "Policy: NIH plans to enhance reproducibility". Nature 505 (7485): 612–13. doi:10.1038/505612a.
- Barone, L.; Williams, J.; Micklos, D. (2017). "Unmet needs for analyzing biological big data: A survey of 704 NSF principal investigators". PLOS Computational Biology 13 (11): e1005858. doi:10.1371/journal.pcbi.1005755. PMC PMC5654259. PMID 29049281. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5654259.
- Federer, L.M.; Lu, Y.L.; Joubert, D.J. et al. (2015). "Biomedical Data Sharing and Reuse: Attitudes and Practices of Clinical and Scientific Research Staff". PLoS One 10 (6): e0129506. doi:10.1371/journal.pone.0129506. PMC PMC4481309. PMID 26107811. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4481309.
- Tenopir, C.; Sandusky, R.J.; Allard, S.; Birch, B. (2014). "Research data management services in academic research libraries and perceptions of librarians". Library & Information Science Research 36 (2): 84–90. doi:10.1016/j.lisr.2013.11.003.
- Carlson, J. (2014). "The use of lifecycle models in developing and supporting data services". In Ray, J.M.. Research Data Management: Practical Strategies for Information Professionals. Purdue University Press. ISBN 9781557536648.
- Cox, A.M.; Tam, W.W.T.. "A critical analysis of lifecycle models of the research process and research data management". Aslib Journal of Information Management 70 (2): 142-57. doi:10.1108/AJIM-11-2017-0251.
- Lowndes, J.S.S.; Best, B.D.; Scarborough, C. et al. (2017). "Our path to better science in less time using open data science tools". Nature Ecology and Evolution 1: 0160. doi:10.1038/s41559-017-0160.
- Meyer, M.N. (2018). "Practical Tips for Ethical Data Sharing". Advances in Methods and Practices in Psychological Science 1 (1): 131-144. doi:10.1177/2515245917747656.
- Goodman, A.; Pepe, A.; Blocker, A.W. et al.. "Ten Simple Rules for the Care and Feeding of Scientific Data". PLoS Computational Biology 10 (4): e1003542. doi:10.1371/journal.pcbi.1003542.
- Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4792175.
- Ioannidis, J.P.A.. "How to Make More Published Research True". PLoS Medicine 11 (10): e1001747. doi:10.1371/journal.pmed.1001747.
- Munafò, M.R.; Nosek, B.A.; Bishop, D.V.M. et al. (2017). "A manifesto for reproducible science". Nature Human Behaviour 1: 0021. doi:10.1038/s41562-016-0021.
- Cox, A.M.; Kennan, M.A.; Lyon, L. et al. (2017). "Developments in research data management in academic libraries: Towards an understanding of research data service maturity". Journal of the Association for Information Science and Technology 68 (9): 2182-2200. doi:10.1002/asi.23781.
- Flores, J.R.; Brodeur, J.J.; Daniels, M.G. et al. (2015). "Libraries and the Research Data Management Landscape". In Maclachlan, J.C.; Waraksa, E.A.; Williford, C.. The Process of Discovery: The CLIR Postdoctoral Fellowship Program and the Future of the Academy. Council on Library and Information Resources. ISBN 9781932326529.
- Borghi, J.A.; Van Gulick, A.E. (2018). "Data management and sharing in neuroimaging: Practices and perceptions of MRI researchers". bioRxiv. doi:10.1371/journal.pone.0200562.
- Witt, M.; Carlson, J.; Brandt, D.S.; Cragin, M.H. (2009). "Constructing Data Curation Profiles". International Journal of Digital Curation 4 (3): 93-103. doi:10.2218/ijdc.v4i3.117.
- Paulk, M.C.; Curtis, B.; Chrissis, M.B.; Weber, C.V. (1993). "Capability maturity model, version 1.1". IEEE Software 10 (4): 18-27. doi:10.1109/52.219617.
- Crowston, K.; Qin, J. (2012). "A capability maturity model for scientific data management: Evidence from the literature". Proceedings of the American Society for Information Science and Technology 48 (1): 1-9. doi:10.1002/meet.2011.14504801036.
- Sallans, A.; Lake, S. (2014). "Data management assessment and planning tools". In Ray, J.M.. Research Data Management: Practical Strategies for Information Professionals. Purdue University Press. ISBN 9781557536648.
- "HowOpenIsIt? A Guide for Evaluating the Openness of Journals". New Venture Fund. 2013. https://sparcopen.org/our-work/howopenisit/.
- Awre, C.; Baxter, J.; Clifford, B. et al. (2015). "Research Data Management as a “wicked problem"". Library Review 64 (4/5): 356-371. doi:10.1108/LR-04-2015-0043.
- "Creating a data management framework". Australian National Data Service. 2011. http://www.ands.org.au/guides/creating-a-data-management-framework.
- "Collaborative Assessment of Research Data Infrastructure and Objectives (CARDIO)". Digital Curation Center. 2013. https://cardio.dcc.ac.uk/about/.
- Mattern, E.; Jeng, W.; He, D. (2015). "Using participatory design and visual narrative inquiry to investigate researchers’ data challenges and recommendations for library research data services". Program: electronic library and information systems 49 (4): 408-423. doi:10.1108/PROG-01-2015-0012.
- Borghi, J.A.; Abrams, S.; Chodacki, J. et al. (22 September 2017). "Developing a Data Management Guide for Researchers". Zenodo. doi:10.5281/zenodo.1213384. https://zenodo.org/record/1213384.
- Borghi, J.A.; Abrams, S.; Lowenberg, D. et al. (21 March 2018). "Support Your Data: A Data Management Guide for Researchers". Zenodo. doi:10.5281/zenodo.1204885. https://zenodo.org/record/1204885.
- Nichols. T.E.; Das, S.; Eickhoff, S.B. et al. (2017). "Best practices in data analysis and sharing in neuroimaging using MRI". Nature Neuroscience 20: 299–303. doi:10.1038/nn.4500.
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Footnotes were originally numbered but have been converted to lowercase alpha for this version. The original article lists references alphabetically, but this version—by design—lists them in order of appearance.