User:Shawndouglas/sandbox/sublevel2

From LIMSWiki
< User:Shawndouglas‎ | sandbox
Revision as of 17:39, 10 August 2018 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search

Sandbox begins below

Full article title A data quality strategy to enable FAIR, programmatic access across large,
diverse data collections for high performance data analysis
Journal Informatics
Author(s) Evans, Ben; Druken, Kelsey; Wang, Jingbo; Yang, Rui; Richards, Clare; Wyborn, Lesley
Author affiliation(s) Australian National University
Primary contact Email: Jingbo dot Wang at anu dot edu dot au
Editors Ge, Mouzhi; Dohnal, Vlastislav
Year published 2017
Volume and issue 4(4)
Page(s) 45
DOI 10.3390/informatics4040045
ISSN 2227-9709
Distribution license Creative Commons Attribution 4.0 International
Website http://www.mdpi.com/2227-9709/4/4/45/htm
Download http://www.mdpi.com/2227-9709/4/4/45/pdf (PDF)

Abstract

To ensure seamless, programmatic access to data for high-performance computing (HPC) and analysis across multiple research domains, it is vital to have a methodology for standardization of both data and services. At the Australian National Computational Infrastructure (NCI) we have developed a data quality strategy (DQS) that currently provides processes for: (1) consistency of data structures needed for a high-performance data (HPD) platform; (2) quality control (QC) through compliance with recognized community standards; (3) benchmarking cases of operational performance tests; and (4) quality assurance (QA) of data through demonstrated functionality and performance across common platforms, tools, and services. By implementing the NCI DQS, we have seen progressive improvement in the quality and usefulness of the datasets across different subject domains, and demonstrated the ease by which modern programmatic methods can be used to access the data, either in situ or via web services, and for uses ranging from traditional analysis methods through to emerging machine learning techniques. To help increase data re-usability by broader communities, particularly in high-performance environments, the DQS is also used to identify the need for any extensions to the relevant international standards for interoperability and/or programmatic access.

Keywords: data quality, quality control, quality assurance, benchmarks, performance, data management policy, netCDF, high-performance computing, HPC, fair data

Introduction

The National Computational Infrastructure (NCI) manages one of Australia’s largest and more diverse repositories (10+ petabytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences.[1] Within these domains, data can be of different types such as gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as having diverse coordinate reference projections and resolutions. NCI has been following the Force 11 FAIR data principles to make data findable, accessible, interoperable, and reusable.[2] These principles provide guidelines for a research data repository to enable data-intensive science, and enable researchers to answer problems such as how to trust the scientific quality of data and determine if the data is usable by their software platform and tools.

To ensure broader reuse of the data and enable transdisciplinary integration across multiple domains, as well as enabling programmatic access, a dataset must be usable and of value to a broad range of users from different communities.[3] Therefore, a set of standards and "best practices" for ensuring the quality of scientific data products is a critical component in the life cycle of data management. We undertake both QC through compliance with recognized community standards (e.g., checking the header of the files to make sure it is compliant with community convention standard) and QA of data through demonstrated functionality and performance across common platforms, tools, and services (e.g., verifying the data to be functioning with designated software and libraries).

The Earth Science Information Partners (ESIP) Information Quality Cluster (IQC) has been established for collecting such standards and best practices and then assisting data producers in their implementation, and users in their taking advantage of them.[4] ESIP considers four different aspects of information quality in close relation to different stages of data products in their four-stage life cycle[4]: (1) define, develop, and validate; (2) produce, access, and deliver; (3) maintain, preserve, and disseminate; and (4) enable use, provide support, and service.

Science teams or data producers are responsible for managing data quality during the first two stages, while data publishers are responsible for the latter two stages. As NCI is both a digital repository, which manages the storage and distribution of reference data for a range of users, as well as the provider of high-end compute and data analysis platforms, the data quality processes are focused on the latter two stages. A check on the scientific correctness is considered to be part of the first two stages and is not included in the definition of "data quality" that is described in this paper.

NCI's data quality strategy (DQS)

NCI developed a DQS to establish a level of assurance, and hence confidence, for our user community and key stakeholders as an integral part of service provision.[5] It is also a step on the pathway to meet the technical requirements of a trusted digital repository, such as the CoreTrustSeal certification.[6] As meeting these requirements involves the systematic application of agreed policies and procedures, our DQS provides a suite of guidelines, recommendations, and processes for: (1) consistency of data structures suitable for the underlying high-performance data (HPD) platform; (2) QC through compliance with recognized community standards; (3) benchmarking performance using operational test cases; and (4) QA through demonstrated functionality and benchmarking across common platforms, tools, and services.

NCI’s DQS was developed iteratively through firstly a review of other approaches for management of data QC and data QA (e.g., Ramapriyan et al.Cite error: Closing </ref> missing for <ref> tag) to establish the DQS methodology and secondly applying this to selected use cases at NCI which captured existing and emerging requirements, particularly the use cases that relate to HPC.

References

  1. Wang, J.; Evans, B.J.K.; Bastrakova, I. et al. (2014). "Large-Scale Data Collection Metadata Management at the National Computation Infrastructure". Proceedings from the American Geophysical Union, Fall Meeting 2014: IN14B-07. 
  2. "The FAIR Data Principles". Force11. https://www.force11.org/group/fairgroup/fairprinciples. Retrieved 23 August 2017. 
  3. Evans, B.J.K.; Wyborn, L.A.; Druken, K.A. et al. (2016). "Extending the Common Framework for Earth Observation Data to other Disciplinary Data and Programmatic Access". Proceedings from the American Geophysical Union, Fall General Assembly 2016: IN22A-05. 
  4. 4.0 4.1 Ramapriyan, H.; Peng, G.; Moroni, D.; Shie, C.-L. (2017). "Ensuring and Improving Information Quality for Earth Science Data and Products". D-Lib Magazine 23 (7/8). doi:10.1045/july2017-ramapriyan. 
  5. Atkin, B.; Brooks, A.. "Chapter 8: Service Specifications, Service Level Agreements and Performance". Total Facilities Management. Wiley. ISBN 9781405127905. 
  6. "Data Repositories Requirements". CoreTrustSeal. https://www.coretrustseal.org/why-certification/requirements/. Retrieved 24 October 2017. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.