Journal:Methods for specifying scientific data standards and modeling relationships with applications to neuroscience

From LIMSWiki
Revision as of 19:51, 20 February 2017 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Methods for specifying scientific data standards and modeling relationships with applications to neuroscience
Journal Frontiers in Neuroinformatics
Author(s) Rübel, Oliver; Dougherty, Max; Prabhat; Denes, Peter; Conant, David; Chang, Edward F.; Bouchard, Kristofer
Author affiliation(s) Lawrence Berkeley National Laboratory; University of California, San Francisco
Primary contact Email: oruebel at lbl dot gov -or- kebouchard at lbl dot gov
Editors Ghosh, Satrajit S.
Year published 2016
Volume and issue 10
Page(s) 48
DOI 10.3389/fninf.2016.00048
ISSN 1662-5196
Distribution license Creative Commons Attribution 4.0 International
Website http://journal.frontiersin.org/article/10.3389/fninf.2016.00048/full
Download http://journal.frontiersin.org/article/10.3389/fninf.2016.00048/pdf (PDF)

Abstract

Neuroscience continues to experience a tremendous growth in data; in terms of the volume and variety of data, the velocity at which data is acquired, and in turn the veracity of data. These challenges are a serious impediment to sharing of data, analyses, and tools within and across labs. Here, we introduce BRAINformat, a novel data standardization framework for the design and management of scientific data formats. The BRAINformat library defines application-independent design concepts and modules that together create a general framework for standardization of scientific data. We describe the formal specification of scientific data standards, which facilitates sharing and verification of data and formats. We introduce the concept of "managed objects," enabling semantic components of data formats to be specified as self-contained units, supporting modular and reusable design of data format components and file storage. We also introduce the novel concept of "relationship attributes" for modeling and use of semantic relationships between data objects. Based on these concepts we demonstrate the application of our framework to design and implement a standard format for electrophysiology data and show how data standardization and relationship-modeling facilitate data analysis and sharing. The format uses HDF5, enabling portable, scalable, and self-describing data storage and integration with modern high-performance computing for data-driven discovery. The BRAINformat library is open-source, easy-to-use, and provides detailed user and developer documentation and is freely available at https://bitbucket.org/oruebel/brainformat.

Introduction

Neuroscience research is facing an increasingly challenging data problem due to the growing complexity of experiments and the volume/variety of data being collected from many acquisition modalities. Neuroscientists are routinely collecting data in a broad range of formats that are often highly domain-specific, ad-hoc and/or designed for efficiency with respect to very specific tools and data types. Even for single experiments, scientists are interacting with often tens of different formats — one for each recording device and/or analysis — while many formats are not well-described or are only accessible via proprietary software. Navigating this quagmire of formats hinders efficient analysis, data sharing, and collaboration and can lead to errors and misinterpretations. File formats and standards that can represent neuroscience data and make the data easily accessible play a key role in enabling scientific discovery, development of reusable tools for analysis, and progress toward fostering collaboration in the neuroscience community.

The requirements toward a data format standard for neuroscience are highly complex and go far beyond the needs of traditional, modality-specific formats (e.g., image, audio, or video formats). A neuroscience data format needs to support the management and organization of large collections of data from many modalities and sources such as neurological recordings, external stimuli, recordings of external responses and events (e.g., motion-tracking, video, audio, etc.), derived analytic results, and many others. To enable data interpretation and analysis, the format needs to also support storage of complex metadata, such as descriptions of recording devices, experiments, or subjects, among others.

In addition, a usable and sustainable neuroscience data format needs to satisfy many technical requirements. For example, the format should be self-describing, easy-to-use, efficient, portable, scalable, verifiable, extensible, easy-to-share, and support self-contained and modular storage. Meeting all these complex needs is a daunting challenge. Arguably, the focus of a neuroscience data standard should be on addressing the application-centric needs of organizing scientific data and metadata, rather than on reinventing file storage methods. We here focus on the design of a framework for standardization of data formats while utilizing HDF5 as the underlying data model and storage format. Using HDF5 has the advantage that it already satisfies most of the basic, technical format requirements; HDF5 is self-describing, portable, extensible, widely supported by programming languages and analysis tools, and is optimized for storage and I/O of large-scale scientific data.

In this manuscript we introduce BRAINformat, a novel data format standardization framework and API for scientific data, developed at the Lawrence Berkeley National Labs in collaboration with neuroscientists at the University of California, Berkeley and the University of California, San Francisco. BRAINformat supports the formal specification and verification of scientific data formats and supports the organization of data in a modular, extensible, and reusable fashion via the concept of managed objects (Section 3.1). We introduce the novel concept or relationship attributes for modeling of direct relationships between data objects. Relationship attributes support the specification of structural and semantic links between data, enabling users and developers to formally document and utilize object-to-object relationships in a well-structured and programmatic fashion (Section 3.2). We demonstrate the use of chains of object-to-object relationships to model complex relationships between multi-dimensional arrays based on data registration via the concept of advanced index map relationships (Section 3.2.4). We demonstrate the application of our framework to design and implement a standard format for electrophysiology data and show how data standardization and relationship-modeling facilitate multi-modal data analysis and data sharing (Section 4).

Background and related work

The scientific community utilizes a broad range of data formats. Basic formats explicitly specify how data is laid out and formatted in binary or text data files (e.g., CSV, BOF, etc). While such basic formats are common, they generally suffer from a lack of portability, scalability and a rigorous specification. For text-based files, languages and formats, such as the Extensible Markup Language (XML)[1] or the JavaScript Object Notation (JSON)[2], have become popular means to standardize documents for data exchange. XML, JSON and other text-based standards (in combination with character-encoding schema, e.g., ASCII or Unicode) play a critical role in practice in the exchange of usually relatively small, structured documents but are impractical for storage and exchange of large scientific data arrays.

For storage of large scientific data, HDF5[3] and NetCDF[4] among others, have gained wide popularity. HDF5 is a data model, library, and file format for storing and managing large and complex data. HDF5 supports groups, datasets, and attributes as core data object primitives, which in combination provide the foundation for data organization and storage. HDF5 is portable, scalable, self-describing, and extensible and is widely supported across programming languages and systems, e.g., R, Matlab, Python, C, Fortran, VisIt, or ParaView. The HDF5 technology suite includes tools for managing, manipulating, viewing, and analyzing HDF5 files. HDF5 has been adopted as a base format across a broad range of application sciences, ranging from physics to bio-sciences and beyond.[5] Self-describing formats address the critical need for standardized storage and exchange of complex and large scientific data.

Self-describing formats like HDF5 provide general capabilities for organizing data, but they do not prescribe a data organization. The structure, layout, names, and descriptions of storage objects, hence, often still differ greatly between applications and experiments. This diversity makes the development of common and reusable tools challenging. VizSchema[6] and XDMF[7] among others, propose to bridge this gap between general-purpose, self-describing formats and the need for standardized tools via additional lightweight, low-level schema (often based on XML) to further standardize the description of the low-level data organization to facilitate data exchange and tool development.

Application-oriented formats then generally focus on specifying the organization of data in a semantically meaningful fashion, including but not limited to the specification of storage object names, locations, and descriptions. Many application formats build on existing self-describing formats, e.g., NeXus[8] (neutron, x-ray, and muon data), OpenMSI[9] (mass spectrometry imaging), CXIDB[10] (coherent x-ray imaging), or NetCDF[4] in combination with CF and COARDS metadata conventions for climate data, and many others. Application formats are commonly described by documents specifying the location and names of data items and often provide application-programmer interfaces (API) to facilitate reading and writing of format files. Some formats are further governed by formal, computer-readable, and verifiable specifications. For example, NeXus uses the NXDL[11] XML-based format and schema to define the nomenclature and arrangement of information in a NeXus data file. On the level of HDF5 groups, NeXus also uses the notion of "classes" to define the fields that a group should contain in a reusable and extensible fashion.

The critical need for data standards in neuroscience has been recognized by several efforts over the course of the last several years[12]; however, much work remains. Here, our goal is to contribute to this discussion by providing much-needed methods and tools for the effective design of sustainable neuroscience data standards and demonstration of the methods in practice toward the design and implementation of a usable and extensible format with an initial focus on electrocardiography data. The developers of the KlustaKwik suite[13] have proposed an HDF5-based data format for storage of spike sorting data. Orca (also called BORG)[14] is an HDF5-based format developed by the Allen Institute for Brain Science designed to store electrophysiology and optophysiology data. The NIX project[15] has developed a set of standardized methods and models for storing electrophysiology and other neuroscience data together with their metadata in one common file format based on HDF5. Rather than an application-specific format, NIX defines highly generic models for data as well as for metadata that can be linked to terminologies (defined via odML) to provide a domain-specific context for elements. The open metadata Markup Language odML[16] is a metadata markup language based on XML with the goal to define and establish an open and flexible format to transport neuroscience metadata. NeuroML[17] is also an XML-based format with a particular focus on defining and exchanging descriptions of neuronal cell and network models. The Neurodata Without Borders (NWB)[18] initiative is a recent project with the specific goal "[…] to produce a unified data format for cellular-based neurophysiology data based on representative use cases initially from four laboratories — the Buzsaki group at NYU, the Svoboda group at Janelia Farm, the Meister group at Caltech, and the Allen Institute for Brain Science in Seattle." Members of the NIX, KWIK, Orca, BRAINformat, and other development teams have been invited and contributed to the NWB effort. NWB has adopted concepts and methods from a range of these formats, including from the here-described BRAINformat.

References

  1. Bray, T.; Paoli, J.; Sperberg-McQueen, C.M. et al. (26 November 2008). "Extensible Markup Language (XML) 1.0 (Fifth Edition)". World Wide Web Consortium. https://www.w3.org/TR/2008/REC-xml-20081126/. 
  2. Crockford, D.. "Introducing JSON". http://json.org/. 
  3. "Welcome to the HDF5 Home Page". The HDF Group. https://support.hdfgroup.org/HDF5/. 
  4. 4.0 4.1 Rew, R.; Davis, G. (1990). "NetCDF: An interface for scientific data access". IEEE Computer Graphics and Applications 10 (4): 76–82. doi:10.1109/38.56302. 
  5. Habermann, T.; Collette, A.; Vincerna, S. et al. (29 September 2014). "The Hierarchical Data Format (HDF): A Foundation for Sustainable Data and Software". The HDF Group. http://dx.doi.org/10.6084/m9.figshare.1112485. 
  6. "VizSchema – Visualization interface for scientific data". IADIS International Conference Computer Graphics, Visualization, Computer Vision and Image Processing 2009 2009: 49–56. 2009. ISBN 9789728924843. http://www.iadisportal.org/digital-library/vizschema-%C2%96-visualization-interface-for-scientific-data. 
  7. "Enhancements to the eXtensible Data Model and Format (XDMF)". DoD High Performance Computing Modernization Program Users Group Conference, 2007 2007. 2007. doi:10.1109/HPCMP-UGC.2007.30. ISBN 9780769530885. 
  8. Klosowski, P.; Kiennecke, M.; Tischler, J.Z.; Osborn, R. (1997). "NeXus: A common format for the exchange of neutron and synchroton data". Physica B: Condensed Matter 241—243: 151–153. doi:10.1016/S0921-4526(97)00865-X. 
  9. Rübel, O.; Greiner, A.; Cholia, S. et al. (2013). "OpenMSI: A High-Performance Web-Based Platform for Mass Spectrometry Imaging". Analytical Chemistry 85 (21): 10354–10361. doi:10.1021/ac402540a. 
  10. Maia, F.R. (2012). "The Coherent X-ray Imaging Data Bank". Nature Methods 9 (9): 854-5. doi:10.1038/nmeth.2110. PMID 22936162. 
  11. "3.2 NXDL: The NeXus Definition Language". NeXus User Manual and Reference Documentation. NeXus International Advisory Committee. http://download.nexusformat.org/doc/html/nxdl.html. 
  12. Teeters, J.; Benda, J.; Davison, A. et al. (3 June 2016). "Requirements for storing electrophysiology data". Cornell University Library. https://arxiv.org/abs/1605.07673. 
  13. Kadir, S.N.; Goodman, D.F.M.; Harris, K.D.. "Kwik format". http://klusta.readthedocs.io/en/latest/kwik/. 
  14. Godfrey, K. (20 November 2014). "Orca file format for e-phys and o-phys neurodata" (PDF). Neurodata Without Borders Hackathon. http://crcns.org/files/data/nwb/h1/NWBh1_09_Keith_Godfrey.pdf. 
  15. Stoewer, A.; Kellner, C.J.; Grewe, J. (2016). "Nix: Home". https://github.com/G-Node/nix/wiki. 
  16. Grewe, J.; Wachtler, T.; Benda, J. (2011). "A bottom-up approach to data annotation in neurophysiology". Frontiers in Neuroinformatics 5: 16. doi:10.3389/fninf.2011.00016. 
  17. Gleeson, P.; Crook, S.; Cannon, R.C. (2010). "NeuroML: A language for describing data driven models of neurons and networks with a high degree of biological detail". PLoS Computational Biology 6 (6): e1000815. doi:10.1371/journal.pcbi.1000815. PMC PMC2887454. PMID 20585541. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887454. 
  18. Teeters, J.L.; Godfrey, K.; Young, R. et al. (2015). "Neurodata Without Borders: Creating a common data format for neurophysiology". Neuron 88 (4): 629–34. doi:10.1016/j.neuron.2015.10.025. PMID 26590340. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. References are in order of appearance rather than alphabetical order (as the original was). The original URL to klusta Kwik format was no longer valid, and an updated URL was used.