Journal:Building open access to research (OAR) data infrastructure at NIST

From LIMSWiki
Revision as of 17:54, 26 August 2019 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Building open access to research (OAR) data infrastructure at NIST
Journal Data Science Journal
Author(s) Greene, Gretchen; Plante, Raymond; Hanisch, Robert
Author affiliation(s) National Institute of Standards and Technology
Primary contact Email: gretchen dot greene at nist dot gov
Year published 2019
Volume and issue 18(1)
Page(s) 30
DOI 10.5334/dsj-2019-030
ISSN 1683-1470
Distribution license Creative Commons Attribution 4.0 International
Website https://datascience.codata.org/articles/10.5334/dsj-2019-030/
Download https://datascience.codata.org/articles/10.5334/dsj-2019-030/galley/861/download/ (PDF)

Abstract

As a National Metrology Institute (NMI), the U.S. National Institute of Standards and Technology (NIST) scientists, engineers, and technology experts conduct research across a full spectrum of physical science domains. NIST is a non-regulatory agency within the U.S. Department of Commerce with a mission to promote U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology in ways that enhance economic security and improve our quality of life. NIST research results in the production and distribution of standard reference materials, [[calibration services, and datasets. These are generated from a wide range of complex laboratory instrumentation, expert analyses, and calibration processes. In response to a government open data policy, and in collaboration with the broader research community, NIST has developed a federated Open Access to Research (OAR) scientific data infrastructure aligned with FAIR (findable, accessible, interoperable, reusable) data principles. Through the OAR initiatives, NIST's Material Measurement Laboratory Office of Data and Informatics (ODI) recently released a new scientific data discovery portal and public data repository. These science-oriented applications provide dissemination and public access for data from across the broad spectrum of NIST research disciplines, including chemistry, biology, materials science (such as crystallography, nanomaterials, etc.), physics, disaster resilience, cyberinfrastructure, communications, forensics, and others. NIST's public data consist of carefully curated Standard Reference Data, legacy high valued data, and new research data publications. The repository is thus evolving both in content and features as the nature of research progresses. Implementation of the OAR infrastructure is key to NIST's role in sharing high-integrity, reproducible research for measurement science in a rapidly changing world.

Keywords: data repository, FAIR, research metadata, metrology, data portal, government

Introduction

NIST research is predominantly characterized as “long tail” in terms of the data produced, i.e., small datasets that are highly varied in topic and content.[1] This is colloquially described as “a mile wide and an inch deep” and may be classified as big data in context of variety and veracity. Newer, more modern laboratory instrumentation such as nuclear magnetic resonance spectrometers, electron microscopes, synchrotron beamlines, and high-performance computers usher NIST into the realm of managing the velocity and volume of big data. Furthermore, new strategic initiatives in the areas of artificial intelligence (AI) require an infrastructure designed to support digital mining and transformation. Management and exchange of the underlying research domain-specific data with both internal and external communities are important considerations for the OAR architecture and implementation.

The overarching goal of OAR is to deliver a robust research data infrastructure to share the results of NIST research with the community at large. Our strategy for achieving this goal involves collaborative data science as demonstrated through usage statistics from astronomical archives’ data discovery and access patterns.[2] Organizations face many challenges striving to balance rapid advancements in technology and data driven research with internal operational costs and constraints. To meet these challenges, NIST assembled a diverse group of experts with key leaders and engaged stakeholders via cross-organizational advisors. This resulted in a joint effort to build an integrated system engineered to support data workflow processes, systems infrastructure, and public dissemination with secure publicly accessible platforms for scientific collaboration.

At the onset of the OAR project, priority was placed on developing a system that would allow us to comply with government open data policy.[3] This resulted in a baseline Minimum Viable Product (MVP), delivering a NIST public data listing (PDL) which enforces adherence to a new government data standard semantic model, the Project Open Data (POD) schema. The NIST PDL continues to be routinely harvested by the Department of Commerce and made available through the U.S. data.gov web portal, which hosts records of all POD-compliant government public datasets. Following enactment of the OPEN Government Data Act[4], updates and compliance of our OAR infrastructure will be further advanced.

However, to achieve FAIR (findable, accessible, interoperable, reusable) capabilities[5], the OAR infrastructure supporting a science data portal and public data repository was designed to extend the limited MVP to include standard open formats, protocols, and demonstrated best practices in data management and publication to harness the full potential for community re-use of NIST research data products. The data portal provides both discovery and data access (distribution) capabilities through a science-oriented web user interface and REST (Representation State Transfer)[6] application programming interfaces (APIs). The repository enables interoperability for scientific disciplines such as crystallography, biology, and chemistry as shown in the organization context (Figure 1) by supporting programmatic access to semantically rich data structures captured through the NIST data publication process. Key to the reuse of these data is the implementation of data citation for each of the records, along with the inclusion of provenance metadata and a link to usage policy.

Fig1 Greene DataScienceJ2019 18-1.jpg

Figure 1. NIST OAR organizational context. NIST laboratory sites are in Gaithersburg, Maryland and Boulder, Colorado in the U.S., in addition to partner remote site locations as listed in the figure.

Architecture

The OAR architecture is in large part consistent with the consolidated Federal Enterprise Architecture[7] reference models. FEA systems are fundamentally designed to identify common assets and shared technologies through a combination of enterprise-class and open-source solutions to ensure long term sustainability. In the case of OAR, FEA implementation was achieved through process models, data and logical workflows, application design, and host infrastructure, architected to synergistically address stakeholder requirements. Adopting this robust architecture has demonstrated through iterative improvements in the OAR design—e.g., data review and usability features—that this model facilitates sustainability. Using agile methods, change may occur independently targeting different aspects of the FEA to streamline and modernize functionality. One realization with OAR maintenance is the risk associated with COTS (commercial off-the-shelf) enterprise solutions, i.e., budgeting funds for high license costs and rigidity in functionality, whereas the open source platforms are demonstrating benefit in the broader community context in keeping pace with evolving technologies, especially in the areas of standard data semantic, syntactic, and schematic practices.

Figure 2 illustrates the high-level OAR application workflow for data publication. NIST researchers upload data products (files and metadata), which are generated from their laboratory information management systems (LIMS), to the OAR infrastructure via the NIST Management of Institutional Data Assets (MIDAS) tool. MIDAS also manages the data review process, and reporting/accountability for determining compliance with policies. Persistent identifiers are automatically assigned through a direct service interface to DataCite. Following approval from the review and curation processes, data are automatically preserved through a publishing service to the Public Data Repository (PDR) in a standard BagIt format.[8] The public repository datasets may subsequently be discovered through the Scientific Data Portal (SDP) on the NIST website. NIST has implemented the government-recommended cloud strategy as part of the OAR infrastructure, such that the OAR preserved datasets are hosted in a NIST Amazon Web Service (AWS) public enclave using the AWS Simple Cloud Storage Service (S3), and data are additionally copied to AWS Glacier storage as a long term “safestore.” The AWS Elastic Compute Cloud (EC2) server platform is used to host the repository and science data portal applications.

References

  1. "Long Tail of Data: e-IRG Task Force Report" (PDF). e-IRG Secretariat. September 2016. http://e-irg.eu/documents/10920/238968/LongTailOfData2016.pdf. Retrieved 29 January 2019. 
  2. White, R.L.; Accomazzi, A.; Berriman, G.B. et al. (2009). "The High Impact of Astronomical Data Archives". Astro2010: The Astronomy and Astrophysics Decadal Survey: 64. https://ui.adsabs.harvard.edu/abs/2009astro2010P..64W/abstract. 
  3. Burwell, S.M.; VanRoekel, S.; Park, T.; Mancini, D.J. (9 May 2013). "Open Data Policy—Managing Information as an Asset" (PDF). M-13-13 Memorandum for the Heads of Executive Departments and Agencies. https://obamawhitehouse.archives.gov/sites/default/files/omb/memoranda/2013/m-13-13.pdf. Retrieved 20 April 2019. 
  4. "Title II - Open Government Data Act". HR 4174: Foundations for Evidence-Based Policymaking Act of 2018. 115th Congress. 2018. https://www.congress.gov/bill/115th-congress/house-bill/4174/text#toc-H8E449FBAEFA34E45A6F1F20EFB13ED95. Retrieved 29 January 2019. 
  5. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J. et al. (2016). "The FAIR Guiding Principles for scientific data management and stewardship". Scientific Data 3: 160018. doi:10.1038/sdata.2016.18. PMC PMC4792175. PMID 26978244. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4792175. 
  6. Booth, D.; Haas, H.; McCabe, F. et al. (11 February 2004). "3.1.3 Relationship to the World Wide Web and REST Architectures". Web Services Architecture. W3C. https://www.w3.org/TR/2004/NOTE-ws-arch-20040211/#relwwwrest. 
  7. Office of Management and Budget (29 January 2013). "Federal Enterprise Architecture Framework - Version 2" (PDF). https://obamawhitehouse.archives.gov/sites/default/files/omb/assets/egov_docs/fea_v2.pdf. Retrieved 29 January 2019. 
  8. "The BagIt Packaging Standard for Interoperability and Preservation". Astronomical Data Analysis Software & Systems Conference 2018. University of Maryland. 12 November 2018. http://adass2018.umd.edu/abstracts/I11.1.html. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original article lists references alphabetically, but this version—by design—lists them in order of appearance. The original used Wikipedia as a source for "REST," which is frowned upon; we replaced it with the first citation of the Wikipedia entry.