Journal:A critical literature review of historic scientific analog data: Uses, successes, and challenges

From LIMSWiki
Revision as of 22:44, 21 April 2023 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title A critical literature review of historic scientific analog data: Uses, successes, and challenges
Journal Data Science Journal
Author(s) Kelly, Julia A.; Farrell, Shannon L.; Hendrickson, Lois G.; Luby, James; Mastel, Kristen L.
Author affiliation(s) University of Minnesota
Primary contact Email: jkelly at umn dot edu
Year published 2022
Volume and issue 21
Article # 14
DOI 10.5334/dsj-2022-014
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website https://datascience.codata.org/articles/10.5334/dsj-2022-014
Download https://storage.googleapis.com/jnl-up-j-dsj-files/journals/1/articles/1444/submission/proof/1444-1-10637-1-10-20220728.pdf (PDF)

Abstract

For years, scientists in fields from climate change to biodiversity to hydrology have used older data to address contemporary issues. Since the 1960s, researchers, recognizing the value of this data, have expressed concern about its management and potential for loss. No widespread solutions have emerged to address the myriad issues around its storage, access, and findability. This paper summarizes observations and concerns of researchers in various disciplines who have articulated problems associated with analog data and highlights examples of projects that have used historical data. The authors also examined selected papers to discover how researchers located historical data and how they used it. While many researchers are not producing huge amounts of analog data today, there are still large volumes of it that are at risk. To address this concern, the authors recommend the development of best practices for managing historic data. This will take communication across disciplines and the involvement of researchers, departments, institutions, and associations in the process.

Keywords: analog data, historic data, data policies, risk of loss, data rescue, dark data

Introduction

Concerns about the management of data—including its preservation, findability, and reuse—are almost entirely focused on recently-generated data in electronic, machine-readable formats. While many of the principles of the management of electronic data such as proper description and good organization apply to data in any format, the discussions about applying those principles to older data in non-electronic formats have not received much attention.

In this paper we review publications in various scientific fields that discuss older data that is in analog or print format and the use or reuse of older data in general. By analog data we mean items in print format such as numeric data as well as field or lab notebooks, photographs, drawings, and maps. Analog data may also be called historic data, legacy data, heritage data, or dark data, although these and other phrases can include older data that is not necessarily in print format. Some authors also use the term "data rescue," which has also been used to describe recent efforts to duplicate and secure electronic data that may be at risk of loss (e.g., the Data Refuge community[1]).

Our interest in this topic began when a few senior faculty members approached the University library for assistance in organizing and possibly housing their analog data. [Farrell et al. 2019] A survey of life sciences researchers on campus revealed that many held analog data and considered it valuable but were unsure of how to preserve it. [Farrell et al. 2020] Nearly all were willing to share it. Given that most researchers now either collect data digitally or quickly transfer any analog data, this is a finite problem; however, because many of the stewards of analog data are nearing retirement, it is timely. We undertook this literature review to learn how scientific researchers are dealing with the analog data in their possession and if any large scale efforts have been undertaken to address the issues.

Types of analog data

Much of the analog data that exists in offices, labs, homes, archives, and other locations is numeric in nature. It was probably collected before electronic spreadsheets were commonly available for both capturing and analyzing data. The format could be loose notebook paper, index cards, large data sheets, or bound or unbound notebooks. It could also take the form of a log, possibly combining numeric and descriptive data in chronological order.

The data may also be descriptive in nature and contained in field notebooks or diaries. The tags associated with museum and herbarium specimens are often mined for the data that they note such as species, location, dates, and other parameters. Although they are inextricably tied, when we discuss analog data we are not including the specimens themselves but just the information on the tags.

Drawings and photographs may accompany other forms of data or may stand on their own, hopefully with enough description to make them useful to current researchers. The same is true of maps, which may be printed or hand drawn.

History of concern about analog data

A number of authors have written about analog data over the last 50+ years, often noting its potential value and lamenting the lack of procedures, funding, and best practices to help support its ongoing use and preservation. Psychologists in the 1960s and 1970s noted not only the importance of new observations coming from the re-examination of older data but also the practice of comparing newly gathered data to historic data. [Johnson 1964; Craig & Reese 1973] Speaking about data that authors have not retained, Wolins (1962) suggests a role for professional associations: "If it were clearly set forth by the[American Psychological Association] ... the responsibility for retaining raw data ... this dilemma would not exist." For a time the U.S. government played a role through the American Documentation Institute at the Library of Congress, which accepted some raw data to be preserved. [Craig & Reese 1973] Recently, Buma made use of photographs in Glacier Bay to longitudinally track plant growth and establishment and noted that if it was easier to learn of the existence of older data and to obtain copies, its value would grow. [Buma 2018; Buma et al. 2019]

In most cases, authors limit their discussions to the situation in their own subspecialty, although a few have taken a broader view. A notable example is the final report of the Ecological Society of America (ESA) committee on the Future of Long-term Ecological Data. [Gross & Pake 1995] The lengthy report details the situation and offers numerous recommendations for the future. Although it does not exclusively focus on analog data, it states "[a]mong the least secure are data in the hands of an individual researcher who has made little or no provision for long-term curation." [Gross & Pake 1995]

Also in 1995, the National Research Council published both Finding the Forest in the Trees: The Challenge of Combining Diverse Environmental Data and Preserving Scientific Data on Our Physical Universe. [National Research Council (U.S.) 1995a; National Research Council (U.S.) 1995b] The first report highlights the variables, measures, and data management, and puts forward 18 recommendations. These call on professional societies, research institutions, funding agencies, and individual researchers to collaborate, carefully plan, focus on interoperability, create rich metadata, and make data more widely available.[National Research Council (U.S.) 1995a] The latter report notes many problems and few solutions, stating "[t]he most important deficiencies are in the documentation, access, and long-term preservation of data in usable form." [National Research Council (U.S.) 1995b] Again, analog data was not the focus of these works, but it was covered.

Easterday et al. note the "potential of historical dark data to contribute to the modern digital ecological data landscape." [Easterday et al. 2018] They note the importance of metadata and the need to promote the data and the best practices around it. In his book Repurposing Legacy Data, Berman states that "data repurposing creates value where none was expected." [Berman 2015] It includes case studies from a variety of disciplines and has chapters on identifying data that might lend itself to repurposing and understanding the organization of older data. [Berman 2015]

Griffin advocates for the value of "heritage data," noting that much of it is at risk and in order to secure it for future use, "certain priorities need to be re-ordered, new skills acquired and taught, resources redirected, and new networks constructed." [Griffin 2015] Griffin was active in the CODATA Data at Risk Task Group which, along with its successor, the Research Data Alliance’s Data Rescue Interest Group, worked to highlight the value of older data and promote projects that used or preserved it.[2] Patil and Siegel note that bringing more dark data to the forefront will require different incentives from all those involved: "journals, citation indexes, funding agencies, academic institutions and, not least, the researchers themselves." [Patil and Siegel 2009] Although they write from a health sciences perspective, this probably applies more broadly.

A number of authors have drawn attention to the use or potential use of analog data in their particular fields. In fisheries, Singer et al. [Singer et al. 2020] surveyed fellow researchers to get a better idea of how and why they used fish collections in order to inform both researchers and those who manage the collections. The value and possible reuse of data collected at biological field stations has been noted since at least the 1980s. [Bowser 1986] Bowser emphasized the importance of data management and suggested that field station data might be deposited with libraries, historical societies or federal agencies. [Bowser 1986] Easterday et al. [2018] make their observations about the use of data science principles by highlighting work from three California field stations, and Michener et al. [2009] wrote an article entitled "Biological Field Stations: Research Legacies and Sites for Serendipity."

Ecological researchers have long mined analog data and historical records in their work, according to Beans. [Beans 2018] While she focuses on journal entries, maps, and photos, she highlights common challenges such as locating material and working with someone else’s organizational scheme. She highlights Loren McClenachan [2009, 2012, 2017], a marine ecologist who utilizes historical data in her research and also published a policy-oriented article on the benefits of using older data to set baselines in marine studies. Over 20 years ago, Olson and McCord [1998, 2000] wrote two book chapters on data archiving in the ecological sciences. Although the emphasis is on digital data, they spell out recommendations on incentives, metadata, and components of an archive that apply to analog and digital materials.

Kwok [2017] reports on the use of older data in the fields of both ecology and climate science. In the area of climate science, Brönnimann et al. are mainly concerned with digital data but provide an overview of efforts to locate and digitize analog data, commenting that "the fraction of yet-to-be-digitized data is difficult to quantify," implying that it is large indeed. [Brönnimann et al. 2018]

Geological researchers sometimes have an added reason to want to discover and use older data: it may have been collected using methods that are now difficult or impossible to employ due to stricter regulations. Diviacco et al. [2015] writes about a project where data was both analog and digital and had been obtained using dynamite. Vearncombe et al. [2016, 2017], using examples from the mining industry, note that "upcycling" of data can mean cost savings as well as new insights from reexamination of data.

A number of disciplines have employed citizen science projects to assist in the analog data efforts. These take the form of both mining older citizen science projects for their data or initiating new projects that provide person-hours to reformat or otherwise transform or collate analog data. Clavero et al. [2014, 2017] examine species lists to study trout decline, Hof and Bright [2016] look at previous counts of hedgehogs, and Snall et al. [2011] consider the use of presence data from bird monitoring. A recent citizen science project on the Zooniverse platform involves identifying data in papers written by students at the University of Michigan Field Station.[3]

While many authors bemoan the unfortunate state of older data in their subdisciplines, a few areas offer success stories. Researchers working in biodiversity—many of whom are connected with museums or herbaria which hold physical specimens and their metadata-rich identification tags—are an example. They have built networks and secured funding for several international biodiversity-related projects that address data tied to specimens as well as the objects themselves. Projects include Integrated Digitized Biocollections (iDigBio), Global Biodiversity Information Facility (GBIF), and Distributed System of Scientific Collections (DiSSCo). The progress in digitization and dissemination of biodiversity data over the last 20 years is summarized by Nelson and Ellis. [2019]

Climate researchers have also made great strides in gathering disparate data in analog and digital format and making it accessible to the global community of scientists. The EU-based Copernicus Climate Change Service (C3S) and International Data Rescue Portal (I-DARE) serve as examples.

Some contemporary groups that rescue and reuse older analog data have very narrowly focused subject areas. The Living Data Project, sponsored by the Canadian Institute of Ecology and Evolution, funds new projects each year with topics such as "Species ranges, diversity and life history of Neotropical birds" and "Responses of freshwater zooplankton to road salt pollution: A global perspective." Another project, based at the USDA National Agricultural Library (Data Rescue Case Study: Long-Term Livestock Production Data), gathered older data from throughout the US, converted it to electronic formats, and deposited it in AgData Commons. [Patton et al. 2022]

Field and lab notebooks have been the focus of a number of digitization projects. They may be held in archives, museums, libraries, or research facilities, as well as by individuals. The Biodiversity Heritage Library, in conjunction with several other institutions including the Smithsonian, includes nearly 3,000 scanned field books.[4] On a smaller scale, Texas A&M Libraries has digitized the field notebooks and specimen catalogs of W. B. Davis (1930–1981) and they have been viewed over 1,000 times.[5] Thomer et al. [2012] propose a method for efficiently extracting species data from handwritten field notebooks.

Ways that older analog data is utilized

Researchers may use older data in a variety of ways. Some strive to repeat an earlier survey or experiment as closely as possible. [Lannoo et al. 1994; Gent & Morgan 2007; Hédl, Petřík & Boublík 2011; Riddell et al. 2021] Others reexamine older data or incorporate portions of it into their current work. [Trisurat et al. 2020; Azeria et al. 2006; Brodman, Cortwright & Resetar 2002; Fellers & Drost 1993] Authors may also have consulted earlier data as they developed their research plans. Mandates for the preservation of data that have emerged in the last 15 years have elevated the topic of data reuse, although most recent research has considered only digital data. [Curty et al. 2017; Khan, Thelwall & Kousha 2021; Yoon & Kim 2017]

The methods that researchers use to obtain older data often remain a mystery. Large data collections such as iDigBio provide background, training, examples, and other resources for potential data users and authors are likely to mention or cite these collections. This is often not true for projects that use older data. In a preliminary investigation, we conducted examinations of 66 scientific papers that used analog data, and only seven spelled out how the authors located it (see Figure 1). None of the authors of this set of papers mention going back to the original authors of the publications to obtain more detailed information, although it is hard to imagine that none of them took that step.


Fig1 Kelly DataSciJourn22 21.png

Figure 1. Description of sources of historic data for scientific researchers who had re-used it in publications. Scientific papers (N = 66) that illustrated evidence of use of this data were examined to determine the source of the data and how it was identified and located by the researchers.


References

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original organizes references in alphabetical order; this version organizes them by order of appearance, by design. Several inline URLs in the original were turned into formal citations for this version.