Journal:A robust, format-agnostic scientific data transfer framework

Full article title	A robust, format-agnostic scientific data transfer framework
Journal	Data Science Journal
Author(s)	Hester, James
Author affiliation(s)	Australian Nuclear Science and Technology Organisation
Primary contact	Email: jxh at ansto dot gov dot au
Year published	2016
Volume and issue	15
Page(s)	12
DOI	10.5334/dsj-2016-012
ISSN	1683-1470
Distribution license	Creative Commons Attribution 4.0 International
Website	http://datascience.codata.org/articles/10.5334/dsj-2016-012/
Download	http://datascience.codata.org/articles/10.5334/dsj-2016-012/galley/605/download/ (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

The olog approach of Spivak and Kent^[1] is applied to the practical development of data transfer frameworks, yielding simple rules for construction and assessment of data transfer standards. The simplicity, extensibility and modularity of such descriptions allows discipline experts unfamiliar with complex ontological constructs or toolsets to synthesize multiple pre-existing standards, potentially including a variety of file formats, into a single overarching ontology. These ontologies nevertheless capture all scientifically-relevant prior knowledge, and when expressed in machine-readable form are sufficiently expressive to mediate translation between legacy and modern data formats. A format-independent programming interface informed by this ontology consists of six functions, of which only two handle data. Demonstration software implementing this interface is used to translate between two common diffraction image formats using such an ontology in place of an intermediate format.

Keywords: metadata, ontology, knowledge representation, data formats

Introduction

For most of scientific history, results and data were communicated using words and numbers on paper, with correct interpretation of this information reliant on the informal standards created by scholarly reference works, linguistic background, and educational traditions. Modern scientists increasingly rely on computers to perform such data transfer, and in this context the sender and receiver agree on the meaning of the data via a specification as interpreted by authors of the sending and receiving software. Recent calls to preserve raw data^[2]^[3] and a growing awareness of a need to manage the explosion in the variety and quantity of data produced by modern large-scale experimental facilities (big data) have led to an increase in the number and coverage of these data transfer standards. Overlap in the areas of knowledge covered by each standard is increasingly common, either because the newer standards aim to replace older ad hoc or de facto standards, or because of natural expansion into the territory of ontologically “neighboring” standards. One example of such overlap is found in single-crystal diffraction: the newer NeXus standard for raw data^[4] partly covers the same ontological space as the older imgCIF standard^[5], and both aim to replace the multiplicity of ad hoc standards for diffraction images.

Authors of scientific software faced with multiple standards generally write custom input or output modules for each standard. For example, the HKL Research, Inc. suite of diffraction image processing programs accepts over 300 different formats.^[6] In such software, broadly useful information on equivalences and transformations is crystallized in code that is specific to a programming language and software environment and is therefore difficult for other authors faced with the same problems to reuse, even if code is freely available. Such uniform processing and merging of disparate standards has been extensively studied by the knowledge representation community: it is one outcome of "ontological alignment" or "ontological mapping," which has been the subject of hundreds of publications over the last decade.^[7] Despite the availability of ontological mapping tools, Otero-Cerdeira, Rodríguez-Martínez, & Gómez-Rodríguez note that relatively few ontology matching systems are put to practical use (see their section 4.5). One barrier to adoption is likely to be the need for the discipline experts driving standards development to learn ontological concepts and terminology in order to evaluate and use ontological tools: the effort required to master these tools may not be judged to yield commensurate benefits in situations where communities have historically been able to transfer data reliably without such formal approaches. Introduction of ontological ideas into data transfer would therefore stand more chance of success if those ideas are simple to understand and implement, as well as offering tangible benefits over the status quo. Indeed one of the challenges noted by Otero-Cerdeira et al. is to "define good tools that are easy to use for non-experts."

Much of the research listed by Otero-Cerdeira et al. has understandably been predicated on reducing human involvement in the mapping process, although expert human intervention is still currently required. In contrast to the thousands of terms found in ontologies tackled by ontological mapping projects, data files in the experimental sciences usually contain information relating to a few dozen well-defined scientific concepts, and so manual handling of ontologies is feasible. The present paper therefore adopts the practical position that, if involvement of discipline experts is unavoidable, then the method of representing the ontology should be as accessible as possible to those experts. An easily-applied framework for scientist-driven formalization, development and assessment of data transfer standards is presented, aimed at minimizing the complexity of the task, while promoting interoperability and minimizing duplication of programmer and domain expert effort.

After describing the framework in the next section, we demonstrate the utility of these concepts by discussing schemes for standards development and later semiautomatic data file translation.

A conceptual framework for data file standards

The framework described here covers systems for automated transfer and manipulation of scientific data. In other words, following creation of the reading and writing software in consultation with the data standard, no further human intervention is necessary in order to automatically create, ingest, and perform calculations on data from standards-conformant data files. Note that simple transfer of information found in the data file to a human reader — for example, presentation of text or graphics — is of minor significance in this context, as such operations, while useful, do not require any interpretation of the data by the computer and are in essence identical to traditional paper-based transfer of information from writer to reader.

Terminology used in this paper is defined in Table 1. The process of scientific data transfer is described using these terms as follows: in consultation with the ontology, authors of file output software determine the required or possible list of datanames for their particular application, then correlate concepts handled by their code to these datanames, arranging for the appropriate values to be linked to the datanames within the output data format according to the specifications within the format adapter. A file in this format is then transferred or archived. At some point, software written in consultation with the same format adapter and ontology extracts datavalues from the file and processes them correctly.

Term	Description
Table 1. Definitions of terms used in this paper
Dataname	A name for a concept with which one or more values can be associated
Data item	A single item of information, consisting of a dataname and one or more associated data values
Ontology	A collection of datanames and associated meanings, including relationships. Once "ologs" have been defined, "ontology" usually refers to an ontology expressed using an olog.
Data format	The structures in which the data are encapsulated for transfer, for example XML or HDF5. Informal discussions often use the word "format" to encompass both the file format and the ontology used to interpret the data items found in it. To avoid confusion, the word "format" is here used to refer only to the file structure.
Data bundle	A collection of data items
Dataname list	The subset of datanames from an ontology that are included in a given data bundle
Format adapter	A description of how the values associated with datanames are encoded in a particular data format
Transfer specification	The combination of a format adapter with a dataname list

Following Shvaiko & Euzenat^[8], the word "ontology" as used in this paper refers to a system of interrelated terms and their meanings, regardless of the way in which those meanings are represented or described. Under this definition, Table 1 is itself an ontology for use solely by the human reader in understanding the present paper. An ontology may be encoded using a language such as OWL^[9] to produce a human-and machine-readable document allowing some level of machine verification, deduction and manipulation.

This paper makes frequent reference to two established data transfer standards in the area of experimental science: the Crystallographic Information Framework (CIF)^[10] and the NeXus standard.^[11]

References

↑ Spivak, D.I.; Kent, R.E. (2012). "Ologs: A categorical framework for knowledge representation". PLoS One 7 (1): e24274. doi:10.1371/journal.pone.0024274. PMC PMC3269434. PMID 22303434. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3269434.
↑ Boulton, G. (2012). "Open your minds and share your results". Nature 486 (7404): 441. doi:10.1038/486441a. PMID 22739274.
↑ Kroon-Batenburg, L.M.; Helliwell, J.R. (2014). "Experiences with making diffraction image data available: What metadata do we need to archive?". Acta Crystallographica Section D Biological Crystallography 70 (Pt. 10): 2502-9. doi:10.1107/S1399004713029817. PMC PMC4187998. PMID 25286836. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4187998.
↑ "NXmx – Nexus: Manual 3.1 documentation". NeXusformat.org. NIAC. 2015. http://download.nexusformat.org/doc/html/classes/applications/NXmx.html.
↑ Bernstein, H.J. (2006). "Classification and use of image data". International Tables for Crystallography G (3.7): 199–205. doi:10.1107/97809553602060000739.
↑ "Detectors & Formats recognized by the HKL/HKL-2000/HKL-3000 Software". HKL Research, Inc. 2016. http://www.hkl-xray.com/detectors-formats-recognized-hklhkl-2000hkl-3000-software.
↑ Otero-Cerdeira, L.; Rodríguez-Martínez, F.J.; Gómez-Rodríguez, A. (2015). "Ontology matching: A literature review". Expert Systems with Applications 42 (2): 949–971. doi:10.1016/j.eswa.2014.08.032.
↑ Shvaiko, P.; Euzenat, J. (2013). "Ontology matching: State of the art and future challenges". IEEE Transactions on Knowledge and Data Engineering 25 (1): 158–176. doi:10.1109/TKDE.2011.253.
↑ Hitzler, P.; Krötzsch, M.; Parsia, B. et al. (11 December 2012). "OWL 2 Web Ontology Language Primer (Second Edition)". W3C Recommendations. W3C. https://www.w3.org/TR/2012/REC-owl2-primer-20121211/.
↑ Hall, S.R.; McMahon, B., ed. (2005). International Tables for Crystallography Volume G: Definition and exchange of crystallographic data. Springer Netherlands. doi:10.1107/97809553602060000107. ISBN 9781402042904.
↑ Könnecke, M.; Akeroyd, F.A.; Bernstein, H.J. et al. (2015). "The NeXus data format". Journal of Applied Crystallography 48 (Pt. 1): 301–305. doi:10.1107/S1600576714027575. PMC PMC4453170. PMID 26089752. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4453170.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version — by design — lists them in order of appearance.