Difference between revisions of "User:Shawndouglas/sandbox/sublevel2"

Full article title	A data quality strategy to enable FAIR, programmatic access across large,; diverse data collections for high performance data analysis
Journal	Informatics
Author(s)	Evans, Ben; Druken, Kelsey; Wang, Jingbo; Yang, Rui; Richards, Clare; Wyborn, Lesley
Author affiliation(s)	Australian National University
Primary contact	Email: Jingbo dot Wang at anu dot edu dot au
Editors	Ge, Mouzhi; Dohnal, Vlastislav
Year published	2017
Volume and issue	4(4)
Page(s)	45
DOI	10.3390/informatics4040045
ISSN	2227-9709
Distribution license	Creative Commons Attribution 4.0 International
Website	http://www.mdpi.com/2227-9709/4/4/45/htm
Download	http://www.mdpi.com/2227-9709/4/4/45/pdf (PDF)

Revision as of 17:19, 10 August 2018

This is sublevel2 of my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see my discussion page instead.

Sandbox begins below

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

To ensure seamless, programmatic access to data for high-performance computing (HPC) and analysis across multiple research domains, it is vital to have a methodology for standardization of both data and services. At the Australian National Computational Infrastructure (NCI) we have developed a data quality strategy (DQS) that currently provides processes for: (1) consistency of data structures needed for a high-performance data (HPD) platform; (2) quality control (QC) through compliance with recognized community standards; (3) benchmarking cases of operational performance tests; and (4) quality assurance (QA) of data through demonstrated functionality and performance across common platforms, tools, and services. By implementing the NCI DQS, we have seen progressive improvement in the quality and usefulness of the datasets across different subject domains, and demonstrated the ease by which modern programmatic methods can be used to access the data, either in situ or via web services, and for uses ranging from traditional analysis methods through to emerging machine learning techniques. To help increase data re-usability by broader communities, particularly in high-performance environments, the DQS is also used to identify the need for any extensions to the relevant international standards for interoperability and/or programmatic access.

Keywords: data quality, quality control, quality assurance, benchmarks, performance, data management policy, netCDF, high-performance computing, HPC, fair data

Introduction

The National Computational Infrastructure (NCI) manages one of Australia’s largest and more diverse repositories (10+ petabytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, bioinformatics, and the social sciences.^[1] Within these domains, data can be of different types such as gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as having diverse coordinate reference projections and resolutions. NCI has been following the Force 11 FAIR data principles to make data findable, accessible, interoperable, and reusable.^[2] These principles provide guidelines for a research data repository to enable data-intensive science, and enable researchers to answer problems such as how to trust the scientific quality of data and determine if the data is usable by their software platform and tools.

To ensure broader reuse of the data and enable transdisciplinary integration across multiple domains, as well as enabling programmatic access, a dataset must be usable and of value to a broad range of users from different communities.^[3] Therefore, a set of standards and "best practices" for ensuring the quality of scientific data products is a critical component in the life cycle of data management. We undertake both QC through compliance with recognized community standards (e.g., checking the header of the files to make sure it is compliant with community convention standard) and QA of data through demonstrated functionality and performance across common platforms, tools, and services (e.g., verifying the data to be functioning with designated software and libraries).

The Earth Science Information Partners (ESIP) Information Quality Cluster (IQC) has been established for collecting such standards and best practices and then assisting data producers in their implementation, and users in their taking advantage of them.^[4] ESIP considers four different aspects of information quality in close relation to different stages of data products in their four-stage life cycle^[4]: (1) define, develop, and validate; (2) produce, access, and deliver; (3) maintain, preserve, and disseminate; and (4) enable use, provide support, and service.

Science teams or data producers are responsible for managing data quality during the first two stages, while data publishers are responsible for the latter two stages. As NCI is both a digital repository, which manages the storage and distribution of reference data for a range of users, as well as the provider of high-end compute and data analysis platforms, the data quality processes are focused on the latter two stages. A check on the scientific correctness is considered to be part of the first two stages and is not included in the definition of "data quality" that is described in this paper.

References

↑ Wang, J.; Evans, B.J.K.; Bastrakova, I. et al. (2014). "Large-Scale Data Collection Metadata Management at the National Computation Infrastructure". Proceedings from the American Geophysical Union, Fall Meeting 2014: IN14B-07.
↑ "The FAIR Data Principles". Force11. https://www.force11.org/group/fairgroup/fairprinciples. Retrieved 23 August 2017.
↑ Evans, B.J.K.; Wyborn, L.A.; Druken, K.A. et al. (2016). "Extending the Common Framework for Earth Observation Data to other Disciplinary Data and Programmatic Access". Proceedings from the American Geophysical Union, Fall General Assembly 2016: IN22A-05.
↑ ^4.0 ^4.1 Ramapriyan, H.; Peng, G.; Moroni, D.; Shie, C.-L. (2017). "Ensuring and Improving Information Quality for Earth Science Data and Products". D-Lib Magazine 23 (7/8). doi:10.1045/july2017-ramapriyan.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.

[WangLarge14-1] Wang, J.; Evans, B.J.K.; Bastrakova, I. et al. (2014). "Large-Scale Data Collection Metadata Management at the National Computation Infrastructure". Proceedings from the American Geophysical Union, Fall Meeting 2014: IN14B-07.

[F11FAIR-2] "The FAIR Data Principles". Force11. https://www.force11.org/group/fairgroup/fairprinciples. Retrieved 23 August 2017.

[EvansExtend16-3] Evans, B.J.K.; Wyborn, L.A.; Druken, K.A. et al. (2016). "Extending the Common Framework for Earth Observation Data to other Disciplinary Data and Programmatic Access". Proceedings from the American Geophysical Union, Fall General Assembly 2016: IN22A-05.

[RamapriyanEnsuring17-4] 4.0 ^4.1 Ramapriyan, H.; Peng, G.; Moroni, D.; Shie, C.-L. (2017). "Ensuring and Improving Information Quality for Earth Science Data and Products". D-Lib Magazine 23 (7/8). doi:10.1045/july2017-ramapriyan.

[1]

[2]

[3]

[4]

@@ Line 12: / Line 12: @@
 |alt          = <!-- Alternative text for images -->
 |caption      =
-|title_full   = A data quality strategy to enable FAIR, programmatic access across large, diverse data collections for high performance data analysis
+|title_full   = A data quality strategy to enable FAIR, programmatic access across large,<br />diverse data collections for high performance data analysis
 |journal      = ''Informatics''
 |authors      = Evans, Ben; Druken, Kelsey; Wang, Jingbo; Yang, Rui; Richards, Clare; Wyborn, Lesley
@@ Line 38: / Line 38: @@
 ==Introduction==
+The National Computational Infrastructure (NCI) manages one of Australia’s largest and more diverse repositories (10+ petabytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, [[bioinformatics]], and the social sciences.<ref name="WangLarge14">{{cite journal |title=Large-Scale Data Collection Metadata Management at the National Computation Infrastructure |journal=Proceedings from the American Geophysical Union, Fall Meeting 2014 |author=Wang, J.; Evans, B.J.K.; Bastrakova, I. et al. |pages=IN14B-07 |year=2014}}</ref> Within these domains, data can be of different types such as gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as having diverse coordinate reference projections and resolutions. NCI has been following the Force 11 FAIR data principles to make data findable, accessible, interoperable, and reusable.<ref name="F11FAIR">{{cite web |url=https://www.force11.org/group/fairgroup/fairprinciples |title=The FAIR Data Principles |publisher=Force11 |accessdate=23 August 2017}}</ref> These principles provide guidelines for a research data repository to enable data-intensive science, and enable researchers to answer problems such as how to trust the scientific quality of data and determine if the data is usable by their software platform and tools.
+To ensure broader reuse of the data and enable transdisciplinary integration across multiple domains, as well as enabling programmatic access, a dataset must be usable and of value to a broad range of users from different communities.<ref name="EvansExtend16">{{cite journal |title=Extending the Common Framework for Earth Observation Data to other Disciplinary Data and Programmatic Access |journal=Proceedings from the American Geophysical Union, Fall General Assembly 2016 |author=Evans, B.J.K.; Wyborn, L.A.; Druken, K.A. et al. |pages=IN22A-05 |year=2016}}</ref> Therefore, a set of standards and "best practices" for ensuring the quality of scientific data products is a critical component in the life cycle of data management. We undertake both QC through compliance with recognized community standards (e.g., checking the header of the files to make sure it is compliant with community convention standard) and QA of data through demonstrated functionality and performance across common platforms, tools, and services (e.g., verifying the data to be functioning with designated software and libraries).
+The Earth Science Information Partners (ESIP) Information Quality Cluster (IQC) has been established for collecting such standards and best practices and then assisting data producers in their implementation, and users in their taking advantage of them.<ref name="RamapriyanEnsuring17">{{cite journal |title=Ensuring and Improving Information Quality for Earth Science Data and Products |journal=D-Lib Magazine |author=Ramapriyan, H.; Peng, G.; Moroni, D.; Shie, C.-L. |volume=23 |issue=7/8 |year=2017 |doi=10.1045/july2017-ramapriyan}}</ref> ESIP considers four different aspects of [[information]] quality in close relation to different stages of data products in their four-stage life cycle<ref name="RamapriyanEnsuring17" />: (1) define, develop, and validate; (2) produce, access, and deliver; (3) maintain, preserve, and disseminate; and (4) enable use, provide support, and service.
+Science teams or data producers are responsible for managing data quality during the first two stages, while data publishers are responsible for the latter two stages. As NCI is both a digital repository, which manages the storage and distribution of reference data for a range of users, as well as the provider of high-end compute and data analysis platforms, the data quality processes are focused on the latter two stages. A check on the scientific correctness is considered to be part of the first two stages and is not included in the definition of "data quality" that is described in this paper.
 ==References==

Difference between revisions of "User:Shawndouglas/sandbox/sublevel2"

Revision as of 17:19, 10 August 2018

Contents

Sandbox begins below

Abstract

Introduction

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export