Difference between revisions of "Journal:Risk assessment for scientific data"

Full article title	Risk assessment for scientific data
Journal	Data Science Journal
Author(s)	Mayernik, Matthew S.; Breseman, Kelsey; Downs, Robert R.; Duerr, Ruth; Garretson, Alexis; Hou, Chung-Yi,; EDGI and ESIP Data Stewardship Committee
Author affiliation(s)	National Center for Atmospheric Research, Environmental Data & Governance Initiative, Columbia University,; Ronin Institute for Independent Scholarship, George Mason University
Primary contact	Email: mayernik at ucar dot edu
Year published	2020
Volume and issue	19(1)
Article #	10
DOI	10.5334/dsj-2020-010
ISSN	1683-1470
Distribution license	Creative Commons Attribution 4.0 International
Website	https://datascience.codata.org/articles/10.5334/dsj-2020-010/
Download	https://datascience.codata.org/articles/10.5334/dsj-2020-010/galley/944/download/ (PDF)

Revision as of 00:01, 15 December 2020

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

Ongoing stewardship is required to keep data collections and archives in existence. Scientific data collections may face a range of risk factors that could hinder, constrain, or limit current or future data use. Identifying such risk factors to data use is a key step in preventing or minimizing data loss. This paper presents an analysis of data risk factors that scientific data collections may face, and a data risk assessment matrix to support data risk assessments to help ameliorate those risks. The goals of this work are to inform and enable effective data risk assessment by: a) individuals and organizations who manage data collections, and b) individuals and organizations who want to help to reduce the risks associated with data preservation and stewardship. The data risk assessment framework presented in this paper provides a platform from which risk assessments can begin, and a reference point for discussions of data stewardship resource allocations and priorities.

Keywords: risk assessment, data preservation, data stewardship, metadata

Introduction

At the “The Rescue of Data At Risk” workshop held in Boulder, Colorado on September 8 and 9, 2016^[b], participants were asked the following question: “How would you define ‘at-risk’ data?” Discussions on this point ranged widely and touched on several challenges, including lack of funding or personnel support for data management, natural and political disasters, and metadata loss. One participant’s organization’s definition of risk, however, stood out: “data were considered to be at-risk unless they had a dedicated plan to not be at-risk.” This simple statement vividly depicts how data’s default state is being in a state of risk. In other words, ongoing stewardship is required to keep data collections and archives in existence.

The risk factors that a given data collection or archive may face vary, depending on the data’s characteristics, the data’s current environment, and the priorities and resources available at the time. Many risks can be reduced or eliminated by following best practices codified as certifications and guidelines, such as the CoreTrustSeal Data Repository Certification^[1], as well as the ISO 16363:2012 standard, which defines audit and certification procedures for trustworthy digital repositories.^[2] Both the CoreTrustSeal certification and ISO 16363:2012 are based on the ISO 14721:2012 standard that defines the reference model for an open archival information system (OAIS).^[3] But these certifications can be large and complex. Additionally, many of the organizations that hold valuable scientific data collections may not be aware of these standards, even if the organizations are potentially resourced to tackle the challenge.^[4] Further, the attainment of such certifications does not necessarily reduce the risks to data that are outside of the scope of a particular certification instrument.

This paper presents an analysis of data risk factors that stakeholders of scientific data collections and archives may face, and a matrix to support data risk assessments to help ameliorate those risks. The three driving questions for this analysis are:

How do stakeholders assess what data are at risk?
How do stakeholders characterize what risk factors data collections and/or archives face?
How do stakeholders make the associated risks more transparent, internally and/or externally?

The goals of this work are to inform and enable effective data risk assessment by: a) individuals and organizations who manage data collections, and b) individuals and organizations who want to help to reduce the risks associated with data preservation and stewardship. Stakeholders for these two activities include producers, stewards, sponsors, and users of data, as well as the management and staff of the institutions that employ them.

Background

This project was coordinated through the Data Stewardship Committee within the Earth Science Information Partners (ESIP), a non-profit organization that exists to support collection, stewardship, and use of earth science data, information, and knowledge.^[c] The immediate motivation for the project stemmed from the Data Stewardship Committee members engaging with groups who were undertaking grass-roots “data rescue” initiatives after the 2016 U.S. presidential election. At that time, a number of loosely organized and coordinated efforts were initiated to duplicate data from U.S. government organizations to prevent potential politically motivated data deletion or obfuscation.^[5]^[6] In many cases, these initiatives specifically focused on duplicating government-hosted earth science data.

ESIP Data Stewardship Committee members wrote a white paper to provide the earth science data centers’ perspective on these grassroots “data rescue” activities.^[7] That document described essential considerations within the day-to-day work of existing federal and federally-funded earth science data archiving organizations, including data centers’ constant focus on documentation, traceability, and persistence of scientific data. The white paper also provided suggestions for how those grassroots efforts might productively engage with the data centers themselves.

One point that was emphasized in the white paper was that the actual risks faced by the data collections may not be transparent from the outside. In other words, “data rescue” activities may have in fact been duplicating data that were at minimal risk of being lost.^[8] This point, and the white paper in general, was well received by people inside and outside of these grass-roots initiatives.^[9]^[10] Questions then came back to the ESIP Data Stewardship Committee about how to understand what data held by government agencies were actually at-risk.

The analysis presented in this paper was initiated in response to these questions. Since then, these grassroots “data rescue” initiatives have had mixed success in sustaining and formalizing their efforts.^[11]^[12]^[13]The intention of our paper is to enable more effective data risk assessment broadly. Rescuing data after they have been corrupted, deleted, or lost can be time- and effort-intensive, and in some cases it may be impossible.^[14] Thus, we aim to provide guidelines to any individual or organization that manages and provides access to scientific data. In turn, these individuals and organizations can better assess the risks that their data face and characterize those risks.

When discussing risk and, in particular, data risk, it is useful to ask "what is the objective that is being challenged by the possible risk factors?" With regard to data, in general, discussions of risk might presume that “risks” threaten the current or future access to data by the potential data users. Currently, continuing public access to and use of scientific data is particularly relevant in light of recent open data and open science initiatives. In this regard, risks for scientific data include factors that could hinder, constrain, or limit current or future data use. Identifying such data use risk factors offers further analysis opportunities to prevent, mitigate, or eliminate the risks.

Data risk assessment

Risk assessment is a regular activity within many organizations. In a general sense, risk management plans are complementary to project management plans. (Cervone 2006) Organizational assessment of digital data and information collections is likewise not new. (Maemura, Moles & Becker 2017) The analysis presented in this paper builds on prior work in a number of areas: 1) research on data risks, 2) data rescue initiatives within government agencies and specific disciplines, 3) CODATA and RDA working groups and meetings, 4) trusted repository certifications, and 5) knowledge and experience of the ESIP Data Stewardship Committee members. Table 1 summarizes data risk factors that emerge from these knowledgebases. The list of risk factors shown in Table 1 is not meant to be exhaustive. Rather, it provides a useful illustration of the diverse ways in which data sets, collections, and archives might encounter risks to data usability and accessibility. The rest of this section details further key insights from the five areas of prior work noted above.

	Risk factor	Description
Table 1. Risk factors for scientific data collections
1.	Lack of use	Data are rarely accessed and dubbed "unwanted," thus getting thrown away.
2.	Loss of funding for archive	The whole archive loses its funding source.
3.	Loss of funding for specific datasets	Specific datasets lose their funding source.
4.	Loss of knowledge around context or access	Data owners lose individuals—e.g., due to retirement or death—who know how to access the data or know the metadata associated with these data that make the data useable to others.
5.	Lack of documentation and metadata	Data cannot be interpreted due to lack of contextual knowledge.
6.	Data mislabeling	Data are lost because they are poorly identified (either physically or digitally).
7.	Catastrophes	Fires, floods, wars, human conflicts, etc. destroy data and/or their owners.
8.	Poor data governance	Uncertain or unknown decision making processes impede effective data management.
9.	Problems with legal status for data ownership and use	Uncertain, unknown, or restrictive legal status limits the possible uses of data.
10.	Media deterioration	Physical media deterioration prevents data from being accessed (paper, tape, or digital media).
11.	Missing files	Data files are lost without any known reason.
12.	Overdependence on a single service provider	Problems arise from having a single point of failure, particularly if a vital service provider goes out of business.
13.	Accidental deletion	Data are accidentally deleted by staff error.
14.	Lack of planning	Lack of planning puts data collections at risk of being susceptible to unexpected events.
15.	Cybersecurity breach	Data are intentionally deleted or corrupted via a security breach, e.g., via malware.
16.	Overabundant data	Difficulty dealing with too much data results in a reduction in value or quality of whole collections.
17.	Political interference	Data is deleted or made inaccessible due to uncontrollable political decisions.
18.	Lack of provenance information	Data cannot be trusted or understood because of a lack of information about data processing steps, or about data stewardship chains of trust.
19.	File format obsolescence	Data cannot be accessed due to lack of knowledge, equipment, or software for reading a specific file format.
20.	Storage hardware breakdown	Data is lost due to a sudden and catastrophic malfunction of storage hardware.
21.	Bit rot and data corruption	Digital data on storage hardware gradually becomes corrupted due to an accumulation of non-critical failures (bits flipping) in a data storage device.

Research on data risks

A range of studies have explored the kinds of risks that scientific data may face, and potential ways to mitigate specific risk factors. Many of these studies touch on practices that are typical of scientific data archives. Metadata, for example, can be considered both a risk factor and a mitigation strategy. Insufficient metadata is itself a potential factor that can reduce the discoverability, usability, and preservability of data, particularly in situations where direct human knowledge of the data is absent.^[15] In fact, many data rescue projects find that the “rescue” efforts must be targeted much more toward metadata than data.^[16]^[17] This might be the case for a couple of reasons. First, insufficient or missing metadata might prevent data from being usable regardless of the condition of the data themselves. Examples include missing column headers in tabular data that prevent a user from knowing what the data are representing, and insufficient provenance metadata that prevent users from trusting the data due to lack of context about data collection and quality control. Second, metadata are also central to documenting and mitigating risks as they manifest, while preventing risks from becoming problematic in the future. (Anderson et al. 2011) For example, documenting data ownership and usage rights is an essential step in mitigating risk factor #9, “Problems with legal status for data ownership and use,” from Table 1.

Different kinds of metadata might be necessary to reduce specific data risks. For example, specifications of file format structures are a critical type of metadata for mitigating risks associated with digital file format obsolescence. Open specifications complement other critical mitigation practices and tools related to file format obsolescence. As one example, keeping rendering software available is an important way to retain access to particular file formats, but this typically also requires maintaining documentation of how the rendering software works.^[18]

Other risk factors (listed in Table 1) relate to the sustainability and transparency of the archiving organization. These factors are important in ensuring the accessibility of the data and the trustworthiness of the archive. As Yakel et al.^[19] note, “[t]rust in the repository is a separate and distinct factor from trust in the data.” For people outside of the repository, “institutional reputation appears to be the strongest structural assurance indicator of trust.”^[19] In essence, effective communication about data risks and steps taken to eliminate problems is helpful in ensuring users that the archive is trustworthy.^[20]

Data that face extreme or unusual risks, however, may not be manageable via typical data curation workflows. Downs and Chen^[21] note that dealing with some data risk factors “may well require divergence from regular data curation procedures, as tradeoffs may be necessary.” For example, Gallaher et al.^[22] undertook an extensive project to recover, reconstruct, and reprocess data from early satellite missions into modern formats that are usable by modern scientists. This project involved dealing with degrading and fragile magnetic tapes, extracting data from the tapes’ unusual format, and recreating documentation for the data. Additionally, natural disasters, fires, and floods also present unpredictable risk factors to data collections of all kinds. While these kinds of events can be planned for and steps can be taken to prevent the occurrence of some of them (e.g., fires), they can still cause major data loss and/or require significant recovery effort.

Mitigating risks, of whatever kind, takes effort and resources. The time required to create metadata, re-format files, create contingency plans, and communicate these efforts to user communities can be considerable. This time investment can be the greatest barrier to performing risk assessment and mitigation activities.^[23] Putting focus on assessment of data risk factors may mean that “certain priorities need to be re-ordered, new skills acquired and taught, resources redirected, and new networks constructed.”^[24] It can be possible to automate some components of risk assessment^[25], but most of the steps require human effort. This intensive effort is vividly illustrated by the many data rescue initiatives that have taken place within government agencies and other kinds of organizations over the past few decades.

Data rescue initiatives within government agencies and specific disciplines

Legacy data are data collected in the past with different technologies and data formats than those used today. These data often face the largest numbers of risk factors that could lead to data loss. A wide range of government agencies and other organizations have conducted legacy data rescue initiatives to modernize data and make them more accessible and usable for today’s science. Each data rescue project typically faces many different kinds of data risks. For example, a recent satellite data rescue effort had to address the “loss of datasets, reconciliation of actual media contents with metadata available, deviation of the actual data format from expectations or documentation, and retiring expertise.”^[26] Data rescue projects typically involve work to prevent future risk factors from manifesting, in addition to modernizing data for accessibility and usability. For example, data rescue projects migrate data to less endangered data formats, and create new metadata and quality control documentation.^[27]

CODATA/RDA working groups & meetings

Relevant professional organizations, including the International Council for Science (ICSU) Committee on Data for Science and Technology (CODATA) and the Research Data Alliance (RDA), also have been actively identifying improvements for data stewardship practices that can reduce potential risks to data. For example, the former Data At Risk Task Group (DAR-TG), of CODATA, raised awareness about the value of heritage data and described the benefits obtained from several data rescue projects.^[24] This group also organized the 2016 “Rescue of Data At Risk” workshop mentioned in the introduction of this paper. That workshop led to a document titled “Guidelines to the Rescue of Data At Risk.”^[28] Subsequently, the Data Rescue Interest Group^[29] of the Research Data Alliance (RDA), spawned from the CODATA DAR-TG, also focuses on efforts to increase awareness of data rescue projects.

Repository certifications and maturity assessment

Many data repositories have conducted self-assessments and external assessments to document their compliance with the standards for trusted repositories and attain certification of their capabilities and practices for managing data. In addition to emphasizing organizational issues, repository certification instruments, such as ISO 16363^[2] and CoreTrustSeal^[1] certification, also focus on digital object management and infrastructure capabilities. Engaging in such assessments offers benefits to repositories and their stakeholders. A key benefit is the identification of areas where improvements have been completed or need to be completed to reduce risks to data.^[1] In an examination of perceptions of repository certification, Donaldson et al.^[26] found that process improvement was often reported by repository staff as a benefit of repository certification.

In addition to (or complementary to) formal certifications, data repositories may conduct data stewardship maturity assessment exercises to help in identifying data risks and informing data risk mitigation strategies.^[30] “Maturity” is used in the sense presented by Peng et al.^[31], refering to the level of performance attained to ensure preservability, accessibility, usability, transparency/traceability, and sustainability of data, along with the level of performance in data quality assurance, data quality control/monitoring, data quality assessment, and data integrity checks. Maturity at the institutional (or archive) level in areas such as policy, funding, and infrastructure does not necessarily translate to comprehensive maturity at the dataset level.^[32] Data stewardship maturity assessment should therefore be performed both at the institutional level and at the dataset level. It is recognized that performing stewardship maturity assessments can be time-consuming and resource-intensive. However, the stewardship organizations are encouraged to perform self-assessment using a “stage by stage” or “a la carte” approach.^[33] Ultimately, both formal certifications and informal maturity assessments help organizations not only gain self-awareness, but also identify better solutions for their data that might be at risk of being lost or rendered unusable.

Developing a data risk assessment matrix

Footnotes

↑ We list EDGI and the ESIP Data Stewardship Committee as authors due to the contributions of many individuals from both organizations to the work described in this paper. The named authors are the individuals involved in each organization who contributed directly to the paper’s text.
↑ The workshop was organized under the auspices of the Research Data Alliance (RDA) and the Committee on Data (CODATA) within the International Science Council.
↑ See https://wiki.esipfed.org/Preservation_and_Stewardship.

References

↑ ^1.0 ^1.1 ^1.2 CoreTrustSeal Standards and Certification Board (2020). "CoreTrustSeal". https://www.coretrustseal.org/.
↑ ^2.0 ^2.1 "ISO 16363:2012 - Space data and information transfer systems — Audit and certification of trustworthy digital repositories". International Organization for Standardization. February 2012. https://www.iso.org/standard/56510.html.
↑ "ISO 14721:2012 - Space data and information transfer systems — Open archival information system (OAIS) — Reference model". International Organization for Standardization. September 2012. https://www.iso.org/standard/56510.html.
↑ Maemura, E.; Moles, N.; Becker, C. (2017). "Organizational assessment frameworks for digital preservation: A literature review and mapping". JASIST 68 (7): 1619–37. doi:10.1002/asi.23807.
↑ Dennis, B. (13 December 2016). "Scientists are frantically copying U.S. climate data, fearing it might vanish under Trump". The Washington Post. https://www.washingtonpost.com/news/energy-environment/wp/2016/12/13/scientists-are-frantically-copying-u-s-climate-data-fearing-it-might-vanish-under-trump/.
↑ Varinsky, D. (11 February 2017). "Scientists across the US are scrambling to save government research in 'Data Rescue' events". Business Insider. https://www.businessinsider.com/data-rescue-government-data-preservation-efforts-2017-2.
↑ Mayernik, M.S.; Downs, R. R.; Duerr, R. et al. (4 April 2017). "Stronger together: The case for cross-sector collaboration in identifying and preserving at-risk data". FigShare. https://esip.figshare.com/articles/journal_contribution/Stronger_together_the_case_for_cross-sector_collaboration_in_identifying_and_preserving_at-risk_data/4816474/1.
↑ Lamdan, S. (2018). "Lessons from DataRescue: The Limits of Grassroots Climate Change Data Preservation and the Need for Federal Records Law Reform". University of Pennsylvania Law Review Online 166 (1). https://scholarship.law.upenn.edu/penn_law_review_online/vol166/iss1/12.
↑ Cornelius, K.B.; Pasquetto, I.V. (2018). "‘What Data?’ Records and Data Policy Coordination During Presidential Transitions". Proceedings from iConference 2018: Transforming Digital Worlds: 155–63. doi:10.1007/978-3-319-78105-1_20.
↑ McGovern, N.Y. (2017). "Data rescue: Observations from an archivist". ACM SIGCAS Computers and Society 47 (2): 19–26. doi:10.1145/3112644.3112648.
↑ Allen, L.; Stewart, C.; Wright, S. (2017). "Strategic open data preservation: Roles and opportunities for broader engagement by librarians and the public". College & Research Libraries News 78 (9): 482. doi:10.5860/crln.78.9.482.
↑ Chodacki, J. (2017). "Data Mirror-Complementing Data Producers". Against the Grain 29 (6): 13. doi:10.7771/2380-176X.7877.
↑ Janz, M.M. (2017). "Maintaining Access to Public Data: Lessons from Data Refuge". Against the Grain 29 (6): 11. doi:10.7771/2380-176X.7875.
↑ Pienta, A.M.; Lyle, J. (2017). "Retirement in the 1950s: Rebuilding a Longitudinal Research Database". IASSIST Quarterly 42 (1): 12. doi:10.29173/iq19.
↑ Kichener, W.K.; Brunt, J.W.; Helly, J.J. et al. (1997). "Nongeospatial metadata for the ecological sciences". Ecological Applications 7 (1): 330–42. doi:10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2.
↑ Knapp, K.R.; Bates, J.J.; Barkstrom, B. et al. (2007). "Scientific Data Stewardship: Lessons Learned from a Satallite–Data Rescue Effort". Bulletin of the American Meteorological Society 88 (9): 1359–62. doi:10.1175/BAMS-88-9-1359.
↑ Hsu, L.; Lehnert, K.A.; Goodwillie, A. et al. (2015). "Rescue of long-tail data from the ocean bottom to the Moon: IEDA Data Rescue Mini-Awards". GeoResJ 6: 108–114. doi:10.1016/j.grj.2015.02.012.
↑ Ryan, H. (2014). "Occam’s Razor and File Format Endangerment Factors" (PDF). Proceedings of the 11th International Conference on Digital Preservation: 179–88. https://www.nla.gov.au/sites/default/files/ipres2014-proceedings-version_1.pdf.
↑ ^19.0 ^19.1 Yakel, E.; Faniel, I.; Krisberg, A. et al. (2013). "Trust in Digital Repositories". International Journal of Digital Curation 8 (1): 143–56. doi:10.2218/ijdc.v8i1.251.
↑ Yoon, A. (2016). "Data reusers' trust development". JASIST 68 (4): 946-956. doi:10.1002/asi.23730.
↑ Downs, R.R.; Chen, R.S. (2017). "Chapter 12: Curation of Scientific Data at Risk of Loss: Data Rescue and Dissemination". Curating research data - Volume one: Practical strategies for your digital repository. Association of College and Research Libraries. pp. 263–77. doi:10.7916/D8W09BMQ.
↑ Gallaher, D.; Campbell, G.G.; Meier, W. et al. (2015). "The process of bringing dark data to light: The rescue of the early Nimbus satellite data". GeoResJ 6: 124–34. doi:10.1016/j.grj.2015.02.013.
↑ Thompson, C.A.; Robertson, D.; Greenberg, J. (2014). "Where Have All the Scientific Data Gone? LIS Perspective on the Data-At-Risk Predicament". College & Research Libraries 75 (6): 842-861. doi:10.5860/crl.75.6.842.
↑ ^24.0 ^24.1 Griffin, R.E.; CODATA Task Group ‘Data At Risk’ (DAR-TG) (2015). "When are Old Data New Data?". GeoResJ 6: 92–97. doi:10.1016/j.grj.2015.02.004.
↑ Graf, R.; Ryan, H.M.; Houzanme, T. et al. (2016). "A Decision Support System to Facilitate File Format Selection for Digital Preservation". Libellarium 9 (2): 267–74. doi:10.15291/libellarium.v9i2.274.
↑ ^26.0 ^26.1 Poli, P.; Dee, D.P.; Saunders, R. et al. (2017). "Recent Advances in Satellite Data Rescue". Bulletin of the American Meteorological Society 98 (7): 1471–1484. doi:10.1175/BAMS-D-15-00194.1. Cite error: Invalid <ref> tag; name "PoliRecent17" defined multiple times with different content
↑ Levitus, S. (2012). "The UNESCO-IOC-IODE "Global Oceanographic Data Archeology and Rescue" (GODAR) Project and "World Ocean Database" Project". Data Science Journal 11: 46–71. doi:10.2481/dsj.012-014.
↑ Research Data Alliance (24 March 2017). "Guidelines to the Rescue of Data At Risk". https://www.rd-alliance.org/guidelines-rescue-data-risk.
↑ Research Data Alliance (14 August 2019). "Data Rescue IG". https://rd-alliance.org/groups/data-rescue.html.
↑ Faundeen, J. (2017). "Developing Criteria to Establish Trusted Digital Repositories". Data Science Journal 16: 22. doi:10.5334/dsj-2017-022.
↑ Peng, G.; Privette, J.L.; Kearns, E.J. et al. (2015). "A Unified Framework for Measuring Stewardship Practices Applied to Digital Environmental Datasets". Data Science Journal 13: 231–53. doi:10.2481/dsj.14-049.
↑ Peng, G. (2018). "The State of Assessing Data Stewardship Maturity – An Overview". Data Science Journal 17: 7. doi:10.5334/dsj-2018-007.
↑ Peng, G.; Milan, A.; Ritchey, N.A. et al. (2019). "Practical Application of a Data Stewardship Maturity Matrix for the NOAA OneStop Project". Data Science Journal 18 (1): 41. doi:10.5334/dsj-2019-041.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references in alphabetical order; however, this version lists them in order of appearance, by design.

[1] We list EDGI and the ESIP Data Stewardship Committee as authors due to the contributions of many individuals from both organizations to the work described in this paper. The named authors are the individuals involved in each organization who contributed directly to the paper’s text.

[2] The workshop was organized under the auspices of the Research Data Alliance (RDA) and the Committee on Data (CODATA) within the International Science Council.

[7] See https://wiki.esipfed.org/Preservation_and_Stewardship.

[CTSCore-3] 1.0 ^1.1 ^1.2 CoreTrustSeal Standards and Certification Board (2020). "CoreTrustSeal". https://www.coretrustseal.org/.

[ISO16363_12-4] 2.0 ^2.1 "ISO 16363:2012 - Space data and information transfer systems — Audit and certification of trustworthy digital repositories". International Organization for Standardization. February 2012. https://www.iso.org/standard/56510.html.

[ISO14721_12-5] "ISO 14721:2012 - Space data and information transfer systems — Open archival information system (OAIS) — Reference model". International Organization for Standardization. September 2012. https://www.iso.org/standard/56510.html.

[MaemuraOrgan17-6] Maemura, E.; Moles, N.; Becker, C. (2017). "Organizational assessment frameworks for digital preservation: A literature review and mapping". JASIST 68 (7): 1619–37. doi:10.1002/asi.23807.

[DennisScien16-8] Dennis, B. (13 December 2016). "Scientists are frantically copying U.S. climate data, fearing it might vanish under Trump". The Washington Post. https://www.washingtonpost.com/news/energy-environment/wp/2016/12/13/scientists-are-frantically-copying-u-s-climate-data-fearing-it-might-vanish-under-trump/.

[VarinskyScien16-9] Varinsky, D. (11 February 2017). "Scientists across the US are scrambling to save government research in 'Data Rescue' events". Business Insider. https://www.businessinsider.com/data-rescue-government-data-preservation-efforts-2017-2.

[MayernikStronger17-10] Mayernik, M.S.; Downs, R. R.; Duerr, R. et al. (4 April 2017). "Stronger together: The case for cross-sector collaboration in identifying and preserving at-risk data". FigShare. https://esip.figshare.com/articles/journal_contribution/Stronger_together_the_case_for_cross-sector_collaboration_in_identifying_and_preserving_at-risk_data/4816474/1.

[LamdanLessons18-11] Lamdan, S. (2018). "Lessons from DataRescue: The Limits of Grassroots Climate Change Data Preservation and the Need for Federal Records Law Reform". University of Pennsylvania Law Review Online 166 (1). https://scholarship.law.upenn.edu/penn_law_review_online/vol166/iss1/12.

[CorneliusWhat18-12] Cornelius, K.B.; Pasquetto, I.V. (2018). "‘What Data?’ Records and Data Policy Coordination During Presidential Transitions". Proceedings from iConference 2018: Transforming Digital Worlds: 155–63. doi:10.1007/978-3-319-78105-1_20.

[McGovernData17-13] McGovern, N.Y. (2017). "Data rescue: Observations from an archivist". ACM SIGCAS Computers and Society 47 (2): 19–26. doi:10.1145/3112644.3112648.

[AllenStrat17-14] Allen, L.; Stewart, C.; Wright, S. (2017). "Strategic open data preservation: Roles and opportunities for broader engagement by librarians and the public". College & Research Libraries News 78 (9): 482. doi:10.5860/crln.78.9.482.

[ChodackiData17-15] Chodacki, J. (2017). "Data Mirror-Complementing Data Producers". Against the Grain 29 (6): 13. doi:10.7771/2380-176X.7877.

[JanzMaint17-16] Janz, M.M. (2017). "Maintaining Access to Public Data: Lessons from Data Refuge". Against the Grain 29 (6): 11. doi:10.7771/2380-176X.7875.

[PientaRetire18-17] Pienta, A.M.; Lyle, J. (2017). "Retirement in the 1950s: Rebuilding a Longitudinal Research Database". IASSIST Quarterly 42 (1): 12. doi:10.29173/iq19.

[MichenerNongeo97-18] Kichener, W.K.; Brunt, J.W.; Helly, J.J. et al. (1997). "Nongeospatial metadata for the ecological sciences". Ecological Applications 7 (1): 330–42. doi:10.1890/1051-0761(1997)007[0330:NMFTES]2.0.CO;2.

[KnappScien07-19] Knapp, K.R.; Bates, J.J.; Barkstrom, B. et al. (2007). "Scientific Data Stewardship: Lessons Learned from a Satallite–Data Rescue Effort". Bulletin of the American Meteorological Society 88 (9): 1359–62. doi:10.1175/BAMS-88-9-1359.

[HsuRescue15-20] Hsu, L.; Lehnert, K.A.; Goodwillie, A. et al. (2015). "Rescue of long-tail data from the ocean bottom to the Moon: IEDA Data Rescue Mini-Awards". GeoResJ 6: 108–114. doi:10.1016/j.grj.2015.02.012.

[RyanOccam14-21] Ryan, H. (2014). "Occam’s Razor and File Format Endangerment Factors" (PDF). Proceedings of the 11th International Conference on Digital Preservation: 179–88. https://www.nla.gov.au/sites/default/files/ipres2014-proceedings-version_1.pdf.

[YakelTrust13-22] 19.0 ^19.1 Yakel, E.; Faniel, I.; Krisberg, A. et al. (2013). "Trust in Digital Repositories". International Journal of Digital Curation 8 (1): 143–56. doi:10.2218/ijdc.v8i1.251.

[YoonData17-23] Yoon, A. (2016). "Data reusers' trust development". JASIST 68 (4): 946-956. doi:10.1002/asi.23730.

[DownsCuration17-24] Downs, R.R.; Chen, R.S. (2017). "Chapter 12: Curation of Scientific Data at Risk of Loss: Data Rescue and Dissemination". Curating research data - Volume one: Practical strategies for your digital repository. Association of College and Research Libraries. pp. 263–77. doi:10.7916/D8W09BMQ.

[GallaherTheProc15-25] Gallaher, D.; Campbell, G.G.; Meier, W. et al. (2015). "The process of bringing dark data to light: The rescue of the early Nimbus satellite data". GeoResJ 6: 124–34. doi:10.1016/j.grj.2015.02.013.

[ThompsonWhere14-26] Thompson, C.A.; Robertson, D.; Greenberg, J. (2014). "Where Have All the Scientific Data Gone? LIS Perspective on the Data-At-Risk Predicament". College & Research Libraries 75 (6): 842-861. doi:10.5860/crl.75.6.842.

[GriffinWhen15-27] 24.0 ^24.1 Griffin, R.E.; CODATA Task Group ‘Data At Risk’ (DAR-TG) (2015). "When are Old Data New Data?". GeoResJ 6: 92–97. doi:10.1016/j.grj.2015.02.004.

[GrafADec16-28] Graf, R.; Ryan, H.M.; Houzanme, T. et al. (2016). "A Decision Support System to Facilitate File Format Selection for Digital Preservation". Libellarium 9 (2): 267–74. doi:10.15291/libellarium.v9i2.274.

[PoliRecent17-29] 26.0 ^26.1 Poli, P.; Dee, D.P.; Saunders, R. et al. (2017). "Recent Advances in Satellite Data Rescue". Bulletin of the American Meteorological Society 98 (7): 1471–1484. doi:10.1175/BAMS-D-15-00194.1. Cite error: Invalid <ref> tag; name "PoliRecent17" defined multiple times with different content

[LevitusTheUNESCO12-30] Levitus, S. (2012). "The UNESCO-IOC-IODE "Global Oceanographic Data Archeology and Rescue" (GODAR) Project and "World Ocean Database" Project". Data Science Journal 11: 46–71. doi:10.2481/dsj.012-014.

[RDAGuide17-31] Research Data Alliance (24 March 2017). "Guidelines to the Rescue of Data At Risk". https://www.rd-alliance.org/guidelines-rescue-data-risk.

[RDAData19-32] Research Data Alliance (14 August 2019). "Data Rescue IG". https://rd-alliance.org/groups/data-rescue.html.

[FaundeenDevelop17-33] Faundeen, J. (2017). "Developing Criteria to Establish Trusted Digital Repositories". Data Science Journal 16: 22. doi:10.5334/dsj-2017-022.

[PengAUnif15-34] Peng, G.; Privette, J.L.; Kearns, E.J. et al. (2015). "A Unified Framework for Measuring Stewardship Practices Applied to Digital Environmental Datasets". Data Science Journal 13: 231–53. doi:10.2481/dsj.14-049.

[PengTheState18-35] Peng, G. (2018). "The State of Assessing Data Stewardship Maturity – An Overview". Data Science Journal 17: 7. doi:10.5334/dsj-2018-007.

[PengPractic19-36] Peng, G.; Milan, A.; Ritchey, N.A. et al. (2019). "Practical Application of a Data Stewardship Maturity Matrix for the NOAA OneStop Project". Data Science Journal 18 (1): 41. doi:10.5334/dsj-2019-041.

[a]

[b]

[1]

[2]

[3]

[4]

[c]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

@@ Line 164: / Line 164: @@
 Mitigating risks, of whatever kind, takes effort and resources. The time required to create metadata, re-format files, create contingency plans, and communicate these efforts to user communities can be considerable. This time investment can be the greatest barrier to performing risk assessment and mitigation activities.<ref name="ThompsonWhere14">{{cite journal |title=Where Have All the Scientific Data Gone? LIS Perspective on the Data-At-Risk Predicament |journal=College & Research Libraries |author=Thompson, C.A.; Robertson, D.; Greenberg, J. |volume=75 |issue=6 |pages=842-861 |year=2014 |doi=10.5860/crl.75.6.842}}</ref> Putting focus on assessment of data risk factors may mean that “certain priorities need to be re-ordered, new skills acquired and taught, resources redirected, and new networks constructed.”<ref name="GriffinWhen15">{{cite journal |title=When are Old Data New Data? |journal=GeoResJ |author=Griffin, R.E.; CODATA Task Group ‘Data At Risk’ (DAR-TG) |volume=6 |pages=92–97 |year=2015 |doi=10.1016/j.grj.2015.02.004}}</ref> It can be possible to automate some components of risk assessment<ref name="GrafADec16">{{cite journal |title=A Decision Support System to Facilitate File Format Selection for Digital Preservation |journal=Libellarium |author=Graf, R.; Ryan, H.M.; Houzanme, T. et al. |volume=9 |issue=2 |pages=267–74 |year=2016 |doi=10.15291/libellarium.v9i2.274}}</ref>, but most of the steps require human effort. This intensive effort is vividly illustrated by the many data rescue initiatives that have taken place within government agencies and other kinds of organizations over the past few decades.
+===Data rescue initiatives within government agencies and specific disciplines===
+Legacy data are data collected in the past with different technologies and data formats than those used today. These data often face the largest numbers of risk factors that could lead to data loss. A wide range of government agencies and other organizations have conducted legacy data rescue initiatives to modernize data and make them more accessible and usable for today’s science. Each data rescue project typically faces many different kinds of data risks. For example, a recent satellite data rescue effort had to address the “loss of datasets, reconciliation of actual media contents with metadata available, deviation of the actual data format from expectations or documentation, and retiring expertise.”<ref name="PoliRecent17">{{cite journal |title=Recent Advances in Satellite Data Rescue |journal=Bulletin of the American Meteorological Society |author=Poli, P.; Dee, D.P.; Saunders, R. et al. |volume=98 |issue=7 |pages=1471–1484 |year=2017 |doi=10.1175/BAMS-D-15-00194.1}}</ref> Data rescue projects typically involve work to prevent future risk factors from manifesting, in addition to modernizing data for accessibility and usability. For example, data rescue projects migrate data to less endangered data formats, and create new metadata and quality control documentation.<ref name="LevitusTheUNESCO12">{{cite journal |title=The UNESCO-IOC-IODE "Global Oceanographic Data Archeology and Rescue" (GODAR) Project and "World Ocean Database" Project |journal=Data Science Journal |author=Levitus, S. |volume=11 |pages=46–71 |year=2012 |doi=10.2481/dsj.012-014}}</ref>
+===CODATA/RDA working groups & meetings===
+Relevant professional organizations, including the International Council for Science (ICSU) Committee on Data for Science and Technology (CODATA) and the Research Data Alliance (RDA), also have been actively identifying improvements for data stewardship practices that can reduce potential risks to data. For example, the former Data At Risk Task Group (DAR-TG), of CODATA, raised awareness about the value of heritage data and described the benefits obtained from several data rescue projects.<ref name="GriffinWhen15" /> This group also organized the 2016 “Rescue of Data At Risk” workshop mentioned in the introduction of this paper. That workshop led to a document titled “Guidelines to the Rescue of Data At Risk.”<ref name="RDAGuide17">{{cite web |url=https://www.rd-alliance.org/guidelines-rescue-data-risk |title=Guidelines to the Rescue of Data At Risk |author=Research Data Alliance |date=24 March 2017}}</ref> Subsequently, the Data Rescue Interest Group<ref name="RDAData19">{{cite web |url=https://rd-alliance.org/groups/data-rescue.html |title=Data Rescue IG |author=Research Data Alliance |date=14 August 2019}}</ref> of the Research Data Alliance (RDA), spawned from the CODATA DAR-TG, also focuses on efforts to increase awareness of data rescue projects.
+===Repository certifications and maturity assessment===
+Many data repositories have conducted self-assessments and external assessments to document their compliance with the standards for trusted repositories and attain certification of their capabilities and practices for managing data. In addition to emphasizing organizational issues, repository certification instruments, such as ISO 16363<ref name="ISO16363_12" /> and CoreTrustSeal<ref name="CTSCore" /> certification, also focus on digital object management and infrastructure capabilities. Engaging in such assessments offers benefits to repositories and their stakeholders. A key benefit is the identification of areas where improvements have been completed or need to be completed to reduce risks to data.<ref name="CTSCore" /> In an examination of perceptions of repository certification, Donaldson ''et al.''<ref name="PoliRecent17">{{cite journal |title=The Perceived Value of Acquiring Data Seals of Approval |journal=International Journal of Digital Curation |author=Donaldson, D.R.; Dillo, I.; Downs, R. et al. |volume=12 |issue=1 |pages=130–51 |year=2017 |doi=10.2218/ijdc.v12i1.481}}</ref> found that process improvement was often reported by repository staff as a benefit of repository certification.
+In addition to (or complementary to) formal certifications, data repositories may conduct data stewardship maturity assessment exercises to help in identifying data risks and informing data risk mitigation strategies.<ref name="FaundeenDevelop17">{{cite journal |title=Developing Criteria to Establish Trusted Digital Repositories |journal=Data Science Journal |author=Faundeen, J. |volume=16 |at=22 |year=2017 |doi=10.5334/dsj-2017-022}}</ref> “Maturity” is used in the sense presented by Peng ''et al.''<ref name="PengAUnif15">{{cite journal |title=A Unified Framework for Measuring Stewardship Practices Applied to Digital Environmental Datasets |journal=Data Science Journal |author=Peng, G.; Privette, J.L.; Kearns, E.J. et al. |volume=13 |pages=231–53 |year=2015 |doi=10.2481/dsj.14-049}}</ref>, refering to the level of performance attained to ensure preservability, accessibility, usability, transparency/traceability, and sustainability of data, along with the level of performance in data quality assurance, data quality control/monitoring, data quality assessment, and data integrity checks. Maturity at the institutional (or archive) level in areas such as policy, funding, and infrastructure does not necessarily translate to comprehensive maturity at the dataset level.<ref name="PengTheState18">{{cite journal |title=The State of Assessing Data Stewardship Maturity – An Overview |journal=Data Science Journal |author=Peng, G. |volume=17 |at=7 |year=2018 |doi=10.5334/dsj-2018-007}}</ref> Data stewardship maturity assessment should therefore be performed both at the institutional level and at the dataset level. It is recognized that performing stewardship maturity assessments can be time-consuming and resource-intensive. However, the stewardship organizations are encouraged to perform self-assessment using a “stage by stage” or “a la carte” approach.<ref name="PengPractic19">{{cite journal |title=Practical Application of a Data Stewardship Maturity Matrix for the NOAA OneStop Project |journal=Data Science Journal |author=Peng, G.; Milan, A.; Ritchey, N.A. et al. |volume=18 |issue=1 |at=41 |year=2019 |doi=10.5334/dsj-2019-041}}</ref> Ultimately, both formal certifications and informal maturity assessments help organizations not only gain self-awareness, but also identify better solutions for their data that might be at risk of being lost or rendered unusable.
+==Developing a data risk assessment matrix==
 ==Footnotes==

Difference between revisions of "Journal:Risk assessment for scientific data"

Revision as of 00:01, 15 December 2020

Contents

Abstract

Introduction

Background

Data risk assessment

Research on data risks

Data rescue initiatives within government agencies and specific disciplines

CODATA/RDA working groups & meetings

Repository certifications and maturity assessment

Developing a data risk assessment matrix

Footnotes

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export