Difference between revisions of "User:Shawndouglas/sandbox/sublevel1"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(46 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<div class="nonumtoc">__TOC__</div>
{{Saved book
{{ombox
|title=Laboratory Informatics Buyer's Guide for Medical Diagnostics and Research
| type      = notice
|subtitle=2020 Edition
| style    = width: 960px;
|cover-image=Vial of blood to be tested.jpg
| text      = This is sublevel2 of my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see [[User_talk:Shawndouglas|my discussion page]] instead.<p></p>
|cover-color=#00FFFF
| setting-papersize = A4
| setting-showtoc = 1
| setting-columns = 1
}}
}}


==Sandbox begins below==
==''Laboratory Informatics Buyer's Guide for Medical Diagnostics and Research'', 2020 Edition==
{{Infobox journal article
'''Title''': ''Laboratory Informatics Buyer's Guide for Medical Diagnostics and Research'', 2020 Edition
|name        =
|image        =
|alt          = <!-- Alternative text for images -->
|caption      =
|title_full  = A data quality strategy to enable FAIR, programmatic access across large,<br />diverse data collections for high performance data analysis
|journal      = ''Informatics''
|authors      = Evans, Ben; Druken, Kelsey; Wang, Jingbo; Yang, Rui; Richards, Clare; Wyborn, Lesley
|affiliations = Australian National University
|contact      = Email: Jingbo dot Wang at anu dot edu dot au
|editors      = Ge, Mouzhi; Dohnal, Vlastislav
|pub_year    = 2017
|vol_iss      = '''4'''(4)
|pages        = 45
|doi          = [http://10.3390/informatics4040045 10.3390/informatics4040045]
|issn        = 2227-9709
|license      = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website      = [http://www.mdpi.com/2227-9709/4/4/45/htm http://www.mdpi.com/2227-9709/4/4/45/htm]
|download    = [http://www.mdpi.com/2227-9709/4/4/45/pdf http://www.mdpi.com/2227-9709/4/4/45/pdf] (PDF)
}}
{{ombox
| type      = content
| style    = width: 500px;
| text      = This article should not be considered complete until this message box has been removed. This is a work in progress.
}}
==Abstract==
To ensure seamless, programmatic access to data for high-performance computing (HPC) and [[Data analysis|analysis]] across multiple research domains, it is vital to have a methodology for standardization of both data and services. At the Australian National Computational Infrastructure (NCI) we have developed a data quality strategy (DQS) that currently provides processes for: (1) consistency of data structures needed for a high-performance data (HPD) platform; (2) [[quality control]] (QC) through compliance with recognized community standards; (3) benchmarking cases of operational performance tests; and (4) [[quality assurance]] (QA) of data through demonstrated functionality and performance across common platforms, tools, and services. By implementing the NCI DQS, we have seen progressive improvement in the quality and usefulness of the datasets across different subject domains, and demonstrated the ease by which modern programmatic methods can be used to access the data, either ''in situ'' or via web services, and for uses ranging from traditional analysis methods through to emerging machine learning techniques. To help increase data re-usability by broader communities, particularly in high-performance environments, the DQS is also used to identify the need for any extensions to the relevant international standards for interoperability and/or programmatic access.
 
'''Keywords''': data quality, quality control, quality assurance, benchmarks, performance, data management policy, netCDF, high-performance computing, HPC, fair data
 
==Introduction==
The National Computational Infrastructure (NCI) manages one of Australia’s largest and more diverse repositories (10+ petabytes) of research data collections spanning datasets from climate, coasts, oceans, and geophysics through to astronomy, [[bioinformatics]], and the social sciences.<ref name="WangLarge14">{{cite journal |title=Large-Scale Data Collection Metadata Management at the National Computation Infrastructure |journal=Proceedings from the American Geophysical Union, Fall Meeting 2014 |author=Wang, J.; Evans, B.J.K.; Bastrakova, I. et al. |pages=IN14B-07 |year=2014}}</ref> Within these domains, data can be of different types such as gridded, ungridded (i.e., line surveys, point clouds), and raster image types, as well as having diverse coordinate reference projections and resolutions. NCI has been following the Force 11 FAIR data principles to make data findable, accessible, interoperable, and reusable.<ref name="F11FAIR">{{cite web |url=https://www.force11.org/group/fairgroup/fairprinciples |title=The FAIR Data Principles |publisher=Force11 |accessdate=23 August 2017}}</ref> These principles provide guidelines for a research data repository to enable data-intensive science, and enable researchers to answer problems such as how to trust the scientific quality of data and determine if the data is usable by their software platform and tools.
 
To ensure broader reuse of the data and enable transdisciplinary integration across multiple domains, as well as enabling programmatic access, a dataset must be usable and of value to a broad range of users from different communities.<ref name="EvansExtend16">{{cite journal |title=Extending the Common Framework for Earth Observation Data to other Disciplinary Data and Programmatic Access |journal=Proceedings from the American Geophysical Union, Fall General Assembly 2016 |author=Evans, B.J.K.; Wyborn, L.A.; Druken, K.A. et al. |pages=IN22A-05 |year=2016}}</ref> Therefore, a set of standards and "best practices" for ensuring the quality of scientific data products is a critical component in the life cycle of data management. We undertake both QC through compliance with recognized community standards (e.g., checking the header of the files to make sure it is compliant with community convention standard) and QA of data through demonstrated functionality and performance across common platforms, tools, and services (e.g., verifying the data to be functioning with designated software and libraries).
 
The Earth Science Information Partners (ESIP) Information Quality Cluster (IQC) has been established for collecting such standards and best practices and then assisting data producers in their implementation, and users in their taking advantage of them.<ref name="RamapriyanEnsuring17">{{cite journal |title=Ensuring and Improving Information Quality for Earth Science Data and Products |journal=D-Lib Magazine |author=Ramapriyan, H.; Peng, G.; Moroni, D.; Shie, C.-L. |volume=23 |issue=7/8 |year=2017 |doi=10.1045/july2017-ramapriyan}}</ref> ESIP considers four different aspects of [[information]] quality in close relation to different stages of data products in their four-stage life cycle<ref name="RamapriyanEnsuring17" />: (1) define, develop, and validate; (2) produce, access, and deliver; (3) maintain, preserve, and disseminate; and (4) enable use, provide support, and service.
 
Science teams or data producers are responsible for managing data quality during the first two stages, while data publishers are responsible for the latter two stages. As NCI is both a digital repository, which manages the storage and distribution of reference data for a range of users, as well as the provider of high-end compute and data analysis platforms, the data quality processes are focused on the latter two stages. A check on the scientific correctness is considered to be part of the first two stages and is not included in the definition of "data quality" that is described in this paper.
 
==NCI's data quality strategy (DQS)==
NCI developed a DQS to establish a level of assurance, and hence confidence, for our user community and key stakeholders as an integral part of service provision.<ref name="AtkinTotal05">{{cite book |chapter=Chapter 8: Service Specifications, Service Level Agreements and Performance |title=Total Facilities Management |author=Atkin, B.; Brooks, A. |publisher=Wiley |isbn=9781405127905}}</ref> It is also a step on the pathway to meet the technical requirements of a trusted digital repository, such as the CoreTrustSeal certification.<ref name="CTSData">{{cite web |url=https://www.coretrustseal.org/why-certification/requirements/ |title=Data Repositories Requirements |publisher=CoreTrustSeal |accessdate=24 October 2017}}</ref> As meeting these requirements involves the systematic application of agreed policies and procedures, our DQS provides a suite of guidelines, recommendations, and processes for: (1) consistency of data structures suitable for the underlying high-performance data (HPD) platform; (2) QC through compliance with recognized community standards; (3) benchmarking performance using operational test cases; and (4) QA through demonstrated functionality and benchmarking across common platforms, tools, and services.
 
NCI’s DQS was developed iteratively through firstly a review of other approaches for management of data QC and data QA (e.g., Ramapriyan ''et al.''<ref name="RamapriyanEnsuring17" /> and Stall<ref name="StallAGU16">{{cite web |url=https://www.scidatacon.org/2016/sessions/100/ |title=AGU's Data Management Maturity Model |work=Auditing of Trustworthy Data Repositories |author=Stall, S.; Downs, R.R.; Kempler, S.J. |publisher=SciDataCon 2016 |date=2016}}</ref>) to establish the DQS methodology and secondly applying this to selected use cases at NCI which captured existing and emerging requirements, particularly the use cases that relate to HPC.
 
Our approach is consistent with the American Geophysical Union (AGU) Data Management Maturity (DMM)SM model<ref name="StallAGU16" /><ref name="StallTheAmerican16">{{cite journal |title=The American Geophysical Union Data Management Maturity Program |journal=Proceedings from the eResearch Australasia Conference 2016 |author=Stall, S.; Hanson, B.; Wyborn, L. |pages=72 |year=2016 |url=https://eresearchau.files.wordpress.com/2016/03/eresau2016_paper_72.pdf}}</ref>, which was developed in partnership the Capability Maturity Model Integration (CMMI) Institute and adapted for their DMMSM<ref name="CMMIDataMan">{{cite web |url=https://cmmiinstitute.com/store/data-management-maturity-(dmm) |title=Data Management Maturity (DMM) |publisher=CMMI Institute LLC}}</ref> model for applications in the Earth and space sciences. The AGU DMMSM model aims to provide guidance on how to improve data quality and consistency and facilitate reuse in the data life cycle. It enables both producers of data and repositories that store data to ensure that datasets are "fit-for-purpose," repeatable, and trustworthy. The Data Quality Process Areas in the AGU DMMSM model define a collaborative approach for receiving, assessing, cleansing, and curating data to ensure "fitness" for intended use in the scientific community.
 
After several iterations, the NCI DQS was established as part of the formal data publishing process and is applied throughout the cycle from submission of data to the NCI repository through to its final publication. The approach is also being adopted by the data producers who now engage with the process from the preparation stage, prior to ingestion onto the NCI data platform. Early consultation and feedback has greatly improved both the quality of the data as well as the timeliness for publication. To improve the efficiency further, one of our major data suppliers is including our DQS requirements in their data generation processes to ensure data quality is considered earlier in data production.
 
The technical requirements and implementation of our DQS will be described as four major but related data components: structure, QC, benchmarking, and QA.
 
===Data structure===
NCI's research data collections are particularly focused on enabling programmatic access, required by: (1) NCI core services such as the NCI supercomputer and NCI cloud-based capabilities; (2) community virtual [[Laboratory|laboratories]] and virtual research environments; (3) those that require remote access through established scientific standards-based protocols that use data services; and, (4) increasingly, by international data federations. To enable these different types of programmatic access, datasets must be registered in the central NCI catalogue<ref name="NCIDataPortal">{{cite web |url=https://geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#/home |title=NCI Data Portal |publisher=National Computational Infrastructure}}</ref>, which records their location for access both on the filesystems and via data services.
 
This requires the data to be well-organized and compliant with uniform, professionally managed standards and consistent community conventions wherever possible. For example, the climate community Coupled Model Intercomparison Project (CMIP) experiments use the Data Reference Syntax (DRS)<ref name="TaylorCMIP12">{{cite web |url=https://pcmdi.llnl.gov/mips/cmip5/docs/cmip5_data_reference_syntax.pdf |format=PDF |title=CMIP5 Data Reference Syntax (DRS) and Controlled Vocabularies |author=Taylor, K.E.; Balaji, V.; Hankin, S. et al. |publisher=Program for Climate Model Diagnosis & Intercomparison |date=13 June 2012}}</ref>, whilst the National Aeronautics and Space Administration (NASA) recommends a specific name convention for Landsat satellite image products.<ref name="USGSLandsat">{{cite web |url=https://landsat.usgs.gov/what-are-naming-conventions-landsat-scene-identifiers |title=What are the naming conventions for Landsat scene identifiers? |publisher=U.S. Geological Survey |accessdate=23 August 2017}}</ref> The NCI data collection catalogue manages the details of each dataset through a uniform application of ISO 19115:2003<ref name="ISO19115">{{cite web |url=https://www.iso.org/standard/53798.html |title=ISO 19115-1:2014 Geographic information -- Metadata -- Part 1: Fundamentals |publisher=International Organization for Standardization |date=April 2014 |accessdate=25 May 2016}}</ref>, an international schema used for describing geographic information and services. Essentially, each catalogue entry points to the location of the data within the NCI data infrastructure. The catalogue entries also point to the services endpoints such as a standard data download point, data subsetting interface, as well as Open Geospatial Consortium (OGC) Web Mapping Service (WMS) and Web Coverage Services (WCS). NCI can publish data through several different servers, and as such the specific endpoint for each of these service capabilities is listed.
 
NCI has developed a catalogue and directory policy, which provides guidelines for the organization of datasets within the concepts of data collections and data sub-collections and includes a comprehensive definition for each hierarchical layer. The definitions are:
 
* A ''data collection'' is the highest in the hierarchy of data groupings at NCI. It is comprised of either an exclusive grouping of data subcollections, or it is a tiered structure with an exclusive grouping of lower tiered data collections, where the lowest tier data collection will only contain data subcollections.
 
* A ''data subcollection'' is an exclusive grouping of datasets (i.e., belonging to only one subcollection) where the constituent datasets are tightly managed. It must have responsibilities within one organization with responsibility for the underlying management of its constituent datasets. A data subcollection constitutes a strong connection between the component datasets, and is organized coherently around a single scientific element (e.g., model, instrument). A subcollection must have compatible licenses such that constituent datasets do not need different access arrangements.
 
* A ''dataset'' is a compilation of data that constitutes a programmable data unit that has been collected and organized using a self-contained process. For this purpose it must have a named data owner, a single license, one set of semantics, ontologies, vocabularies, and has a single data format and internal data convention. A dataset must include its version.
 
* A ''dataset granule'' is used for some scientific domains that require a finer level of granularity (e.g., in satellite Earth Observation datasets). A granule refers to the smallest aggregation of data that can be independently described, inventoried, and retrieved as defined by NASA.<ref name="NASAGlossary">{{cite web |url=https://earthdata.nasa.gov/user-resources/glossary#ed-glossary-g |title=Granule |work=EarthData Glossary |accessdate=23 August 2017}}</ref> Dataset granules have their own metadata and support values associated with the additional attributes defined by parent datasets.
 
In addition we use the term "data category" to identify common contents/themes across all levels of the hierarchy.
 
* A ''data category'' allows a broad spectrum of options to encode relationships between data. A data category can be anything that weakly relates datasets, with the primary way of discovering the groupings within the data by key terms (e.g., keywords, attributes, vocabularies, ontologies). Datasets are not exclusive to a single category.
 
====Organization of data within the data structure====
NCI has organized data collections according to this hierarchical structure on both filesystem and within our catalogue system. Figure 1 shows how these datasets are organized. Figure 2 provides an example of how the CMIP 5 data collection demonstrates the hierarchical directory structure.
 
 
[[File:Fig1 Evans Informatics2017 4-4.png|700px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="700px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Illustration of the different levels of metadata and community standards used for each</blockquote>
|-
|}
|}
 
 
[[File:Fig2 Evans Informatics2017 4-4.jpg|550px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="550px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' Example schematic of the National Computational Infrastructure (NCI)’s data organizational structure using the Coupled Model Intercomparison Project (CMIP)) 5 collection. The CMIP 5 collection housed at NCI includes three sub-collections from The Commonwealth Scientific and Industrial Research Organisation (CSIRO) and Australian Bureau of Meteorology (BOM): (1) the ACCESS-1.0 model, (2) ACCESS-1.3 model, and (3) Mk 3.6.0 model. Each sub-collection then contains a number of datasets, such as “piControl” (pre-industrial control experiment), which then contains numerous granules (e.g., precipitation, “pr”). A complete description on the range of CMIP5 contents can be found at: https://pcmdi.llnl.gov/mips/cmip5/experiment_design.html.</blockquote>
|-
|}
|}
 
===Data QC===
Data QC measures are intended to ensure that all datasets hosted at NCI adhere, wherever possible, to existing community standards for metadata and data. For Network Common Data Form (netCDF) (and Hierarchical Data Format v5 (HDF5)-based) file formats, these include the Climate and Forecast (CF) Convention<ref name="LLNLCFConv">{{cite web |url=http://cfconventions.org/ |title=CF Conventions and Metadata |publisher=Lawrence Livermore National Laboratory |accessdate=23 August 2017}}</ref> and the Attribute Convention for Data Discovery<ref name="ESIPAttri">{{cite web |url=http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_(ACDD) |title=Attribute Convention for Data Discovery 1-3 |publisher=Federation of Earth Science Information Partners |accessdate=23 August 2017}}</ref> (see Table 1).
 
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="3"|'''Table 1.''' The NCI Quality Control (QC) mandatory requirements. A full list of the Attribute Convention for Data Discovery (ACDD) metadata requirements used by NCI is provided in Appendix A.
|-
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Convention/Standard
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|NCI Requirements
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Further Information
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|CF
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Mandatory CF criteria, e.g., no “errors” result from any of the recommended compliance checkers
  | style="background-color:white; padding-left:10px; padding-right:10px;"|[http://cfconventions.org http://cfconventions.org]
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|ACDD (Modified version)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Required attributes are included within each file: 1. title, 2. summary, 3. source, 4. date_created
  | style="background-color:white; padding-left:10px; padding-right:10px;"|[http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_1-3 http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery_1-3]
|-
|}
|}
 
====Climate and forecast (CF) convention====
NCI requires that all geospatial datasets meet the minimum mandatory CF convention metadata criteria at the time of publication, and, where scientifically applicable, we require they meet the relevant recommended CF criteria. These requirements are detailed in the latest CF convention document provided on their website.<ref name="LLNLCFConv" />
 
The CF convention is the primary community standard for netCDF data, which was originally developed by the climate community and is now being adapted for other domains, e.g., marine and geosciences. It defines metadata requirements for information on each variable contained within the file as well as spatial and temporal properties of the data, so that contents are fully “self-described.” For example, no additional companion files or external sources are required to describe any information about how to read or utilize the data contents within the file. The metadata requirements also provide important guidelines on how to structure spatial data. This includes recommendations on the order of dimensions, the handling of gridded and non-gridded (time series, point and trajectory) data, coordinate reference system descriptions, standardized units, and cell measures (i.e., information relating to the size, shape, or location of grid cells). CF requires that all metadata information be equally readable and understandable by humans and software, which has the benefit of allowing software tools to easily display and dynamically perform associated operations.
 
====Attribute Convention for Data Discovery (ACDD)====
The ACDD is another common standard for netCDF data that complements the CF convention requirements.<ref name="ESIPAttri" /> The ACDD primarily governs metadata information written at the file-level (i.e., netCDF global attributes), while the CF convention pertains mainly to variable-level metadata and structure information. Therefore, when combined these two standards help to fully describe both the higher-level metadata relevant to the entire file (e.g., dataset title, custodian, data created, etc.) and the lower-level information about each individual variable or dimension (e.g., name, units, bounds, fill values, etc.). ACDD also provides the ability to link to even higher-levels such as the dataset parent and grandparent ISO 19115 metadata entries.
 
NCI has applied this convention, along with CF, as summarized in Table 1 as part of our data QC. As the ACDD has no “required” fields in its current specification, NCI has applied a modified version that requires all published datasets meet the minimum of four required ACDD catalogue metadata fields at the time of publication. These are “title,” “summary,” “source,” and “date_created” and have been ranked as “required” to aid with NCI’s data services and data discovery. A complete list of ACDD metadata attributes and NCI requirements are available in Appendix A.
 
===Benchmarking methodology===
Any reference datasets made available on NCI must be well organized and accessible in a form suitable for the known class of users. Datasets also need to be more broadly available to other users from different domains, with the expectation that the collection will continue to have long-term and enduring value not just to the research community but also to others (e.g., government, general public, industry). To ensure that these expectations are clearly understood across the range of use-cases and environments, NCI has adopted a benchmarking methodology as part of their DQS process. Benchmarks register their functionality and performance, which helps to define expectations around data accessibility and provide an effective, defined measure of usability.
 
To substantiate this, NCI works with both the data producers and the users to establish benchmarks for specific areas, which are then included as part of the registry of data QA measures. These tests are then verified by both NCI and by wider community representatives to ensure that the benchmark is appropriate for the requested access. The benchmark methodology also provides a way to systematically consider how current users will be affected when considering any future developments or evolution in technology, standards, or reorganization of data. The benchmark cases then substantiate the original intention, and they can be reviewed against any subsequent changes. For example, benchmark cases that were previously specified to use data in a particular format may have been updated to use an alternative, more acceptable format that is better for use in high-performance environments or improves accessibility across multiple domains. The original benchmark cases can then be re-evaluated against both the functionality and performance required to assess how to make such a transformation. Further, if there are any upgrades or changes to the production services, the benchmark cases are used to perform prerelease tests on the data servers before implementing the changes into production.
 
The benchmarks consist of explicit current examples using tools, libraries, services, packages, software, and processes that are executed at NCI. These benchmarks explore the required access and identify supporting standards that are critical to the utility of the service, whether access be through the filesystem or by API protocols provided by NCI data services. Where benchmarks are shown to be beyond the capability of the current data service, the benchmark case will be recorded for future application.
 
Furthermore, the results of the testing of each benchmark are reviewed with the data producer in light of any issues raised. This may require action by the user to revise the access pattern and/or by the data producer to modify the data to ensure that the reliability of NCI’s production service is not compromised. Alternatively, NCI may be able to provide a temporary separate service to accommodate some aspects of the usage pattern. For example, the data might be released via a modified server that can address shortcomings of a specific benchmark case but would not be applicable generally. This may be a short-term measure until a better server solution is found, or it may address current local issues on either the data or client application side.
 
===Data QA===
To ensure that the data is usable across a range of use-cases and environments, the QA approach uses benchmarks for testing data located on the local filesystem, as well as remotely via the data service endpoints. The QA process is designed to verify that software and libraries used are functioning properly with the most commonly used tools in the community.
 
The following are a list of data services that are available under NCI’s Unidata Thematic Real-time Environmental Distributed Data Services (THREDDS):
 
* Open-source Project for a Network Data Access Protocol (OPeNDAP): a protocol enabling data access and subsetting through the web;
* NetCDF Subset Service (NCSS): web service for subsetting files that can be read by the netCDF java library;
* WMS: OGC web service for requesting raster images of data;
* WCS: OGC web service for requesting data in some output format;
* Godiva2 Data Viewer: tool for simple visualization of data; and
* HTTP File Download: for direct downloading of data.
 
The data is tested through each of the required services as part of the QA process, with the basic usability functionality tests applied to each service as shown in Table 2. Should an issue be discovered during these functionality tests, the issue is investigated further. This may lead to additional modifications of the data so as to pass the functionality or performance requirements, and in doing so requires further communication with the data producer to ensure that such changes are acceptable and can be corrected in any future data production process. More detailed functionality can also be recorded for scientific use around the data. Such tests tend to be specific for the data use-case but follow the same methodology as that described here.


'''Author for citation''': Shawn E. Douglas and Alan Vaughan


{|
'''License for content''': [https://creativecommons.org/licenses/by-sa/4.0/ Creative Commons Attribution-ShareAlike 4.0 International]
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''Table 2.''' Description of basic accessibility and functionality tests that are applied for commonly used tools as part of NCI’s QA tests
|-
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Test
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Measures of Success
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|netCDF C-Library
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Using the <tt><nowiki>ncdump-h <file></nowiki></tt> function from command line, the file is readable and displays the file header information about the file dimensions, variables, and metadata.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|GDAL
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Using the <tt><nowiki>gdalinfo <file></nowiki></tt> function from command line, the file is readable and displays the file header information about the file dimensions, variables, and metadata.<br />Using the <tt><nowiki>gdalinfo NETCDF:<file>:<subdataset></nowiki></tt> function from command line, the subdatasets are readable and corresponding metadata for each subdataset is displayed.<br />The <tt>Open</tt> and <tt>GetMetadata</tt> functions return non-empty values that correspond to the netCDF file contents.<br />The <tt>GetProjection</tt> function (of the appropriate file or subdataset) returns a non-empty result corresponding to the data coordinate reference system information.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|NCO (NetCDF Operators)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Using the <tt><nowiki>ncks -m <file></nowiki></tt> function from command line, the file is readable and displays file metadata.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|CDO (Climate Data Operators)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Using the <tt><nowiki>cdo sinfon <file></nowiki></tt> function from command line, the file is readable and displays information on the included variables, grids, and coordinates.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Ferret
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Using <tt><nowiki>SET DATA “<file>”</nowiki></tt> followed by <tt>SHOW DATA</tt> displays information on file contents.<br /> Using <tt><nowiki>SET DATA “<file>”</nowiki></tt> followed by <tt><nowiki>SHADE <variable></nowiki></tt> (or another plotting command) produces a plot of the requested data.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Thredds Data Server
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Dataset index catalog page loads without timeout and within reasonable time expectations (<10 s).
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Thredds Data Service Endpoints
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''HTTP Download''': File download commences when selected the HTTPServer option from the THREDDS catalog page for the file.<br />'''OPeNDAP''': When selecting OPeNDAP from the THREDDS catalog page for the file, the OPeNDAP Dataset Access Form page loads without error. From the OPeNDAP Dataset Access Form page, a data subset is returned in ASCII format after selecting data and clicking the Get ASCII option at the top of the page.<br />'''Godiva2''': When selecting the Godiva2 viewer option from the THREDDS catalog page for the file, the viewer displays the file contents.<br />'''WMS''': When selecting the WMS option from the THREDDS catalog page for the file, the web browser displays the GetCapabilities information in xml format. After constructing a GetMap request, the web browser displays the corresponding map.<br />'''WCS''': When selecting the WCS option from the THREDDS catalog page for the file, the web browser displays the GetCapabilities information in XML format. After constructing a GetCoverage request, file download of coverage commences.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Panoply
  | style="background-color:white; padding-left:10px; padding-right:10px;"|From the File → Open menu, the file can be opened. File contents and metadata displayed.<br />Using Create Plot for a selected variable, data is displayed correctly in new plot window.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|QGIS
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Using the Add WMS/WMTS menu option, QGIS can request GetCapabilities and/or GetMap operations, and the layer is visible.<br />The ncWMS GetCapabilities URL accepts and adds the NCI THREDDS Server, the request displays the available layers to select from, and a selected layer displays according to user expectations.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|NASA Web WorldWind
  | style="background-color:white; padding-left:10px; padding-right:10px;"|The ncWMS GetCapabilities URL accepts and adds the NCI THREDDS Server, the request displays the available layers to select from, and a selected layer displays according to user expectations.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|PYTHON cdms2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|The file can be opened by the <tt>Open</tt> function.<br />File metadata is displayed using <tt>Attributes</tt> function.<br />File data contents are displayed when using <tt>Variables</tt> function.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|PYTHON netCDF4
  | style="background-color:white; padding-left:10px; padding-right:10px;"|The file can be opened by the <tt>Dataset</tt> function.<br />File metadata is displayed using <tt>ncattrs</tt> object.<br />File data contents are displayed using <tt>variables</tt> (and/or <tt>groups</tt>) objects.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|PYTHON h5py
  | style="background-color:white; padding-left:10px; padding-right:10px;"|The netcdf file can be opened by the <tt>File</tt> function.<br />The metadata and variables are displayed by the <tt>keys</tt> and <tt>attrs</tt> objects.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|ParaView
  | style="background-color:white; padding-left:10px; padding-right:10px;"|From the File → Open menu, the file can be opened and displayed as a layer in the Pipeline Browser. Enabling layer visibility results in data displaying in the Layout window.
|-
|}
|}


==Examples of tests and reports undertaken on NCI datasets prior to publication==
'''Publication date''': TBD
===Metadata QC checker reports===
To assess the CF and ACDD compliance, NCI runs a QC checker prior to data publication and works with the data producer to rectify problems. The NCI checker is based on the U.S. Integrated Ocean Observing System (IOOS) Compliance Checker<ref name="IOOSCompliance">{{cite web |url=https://github.com/ioos/compliance-checker |title=ioos/compliance-checker |publisher=GitHub |accessdate=22 November 2017}}</ref> but has been modified to include additional checks relevant to NCI’s data services as well as the modified ACDD convention. Appendix B shows an example QC checker report (Figure A1) with metadata that is 100% compliant with NCI’s requirements. In practice, the process usually needs to be run several times as the datasets are checked, feedback is given, and then re-run against the timestamp for each version to keep a record of metadata update provenance. The reports are shared with the data producers with comments and additional feedback provided in the “high/medium/low-priority suggestions” section at the end of the report, depending on the potential impact of non-compliance.


Due to the large number of data files that can be involved, NCI’s QC checker has been modified to enable parallelization so that multiple processes can be run simultaneously, thus increasing performance of the checking process. For instance, it takes less than a minute to check hundreds of files, and about 10 minutes for tens of thousands. For the largest datasets, the QC checker can typically run on more than one million files at a time.
Insert description of book here.


The QC checker also helps to find corrupted or temporary files, which can be easily overlooked or not detected by the data producers, especially during a batch production process.
:[[User:Shawndouglas/sandbox/sublevel2|1. Introduction to medical diagnostics and research laboratories]]
::1.1 Medical diagnostics lab
:::1.1.1 Pathology
::::1.1.1.1 Anatomical vs. clinical pathology
::::1.1.1.2 Forensic pathology
:::1.1.2 Physician office lab
:::1.1.3 Integrative medicine lab
::1.2 Public health lab
::1.3 Toxicology lab
::1.4 Blood bank and transfusion lab
::1.5 Central and contract research lab
:::1.5.1 Medical and other research in academia
::1.6 Genetic diagnostics lab
:::1.6.1 Cytogenetics lab
::1.7 Medical cannabis testing lab


===Functionality test QA reports===
:[[User:Shawndouglas/sandbox/sublevel3|2. Choosing laboratory informatics software for your lab]]
Appendix B provides an example report (Figure A2) of the QA results from checking three data files when accessed directly on the filesystem and their service endpoints for access via THREDDS. The functionality test shows that the variable structure within the data of two files (2 GB and 4 GB) are too large to load the files into several commonly used data viewers, such as ncview (v2.1.1) and Panoply (v4.5.1), and they have similar issues on opening files through the service endpoints. In this case, our advice for mitigation is to reduce the requested size of the image by using a lower resolution or to work ''in situ'' with this particular data file, as recorded in the comments of Figure A2, sections b and c.
::2.1 Evaluation and selection
:::2.1.1 Technology considerations
::::2.1.1.1 Laboratory informatics options
:::2.1.2 Features and functions
:::2.1.3 Cybersecurity considerations
:::2.1.4 Regulatory compliance considerations
:::2.1.5 Cost considerations
::2.2 Implementation
:::2.2.1 Internal and external integrations
::2.3 MSW, updates, and other contracted services
::2.4 How a user requirements specification fits into the entire process (LIMSpec)


===Benchmarking use cases===
:3. Additional resources for selecting and implementing informatics solutions
In the benchmark tests several popular tools and APIs are run to evaluate their elapsed time on accessing data either residing on the local filesystem or being accessed via data services. The test files in the example NCI functionality QA test report (Figure A2) are used in the benchmark tests, and their data structures are listed in Table 3. We access the 2D variable in each file, which is recorded at (lat, lon), chunked at (128,128) and deflated at level 2.
::[[User:Shawndouglas/sandbox/sublevel4|Part 1: Laboratory informatics vendors]]
::3.1 Laboratory informatics vendors
:::3.1.1 LIMS vendors
:::3.1.2 LIS vendors
:::3.1.3 ELN vendors
:::3.1.4 Middleware vendors
::[[User:Shawndouglas/sandbox/sublevel5|Part 2: Other vendors and service providers]]
::3.2 Medical diagnostics instrumentation vendors
::3.3 EHR vendors
::3.4 Laboratory business intelligence and workflow solution vendors
::3.5 Laboratory billing service providers
::[[User:Shawndouglas/sandbox/sublevel6|Part 3: Industry and community resources]]
::3.6 Trade organizations
::3.7 Conferences and trade shows
::3.8 User communities
::3.9 Books and journals
::3.10 Standards
::3.11 LIMSpec


{|
:[[User:Shawndouglas/sandbox/sublevel37|4. Taking the next step]]
| STYLE="vertical-align:top;"|
::[https://www.lablynxpress.com/index.php?title=4.1_Develop_a_specification_document_(LIMSpec)_tailored_to_your_lab%27s_needs 4.1 Develop a specification document (LIMSpec) tailored to your lab's needs]
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="80%"
::[https://www.lablynxpress.com/index.php?title=4.2_Issue_the_specification_as_a_request_for_information_(RFI) 4.2 Issue the specification as a request for information (RFI)]
|-
::[https://www.lablynxpress.com/index.php?title=4.3_Acquire_information_and_proposals_from_vendors 4.3 Acquire information and proposals from vendors]
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="5"|'''Table 3.''' Data structure of the sample files used in the benchmark tests
:::[https://www.lablynxpress.com/index.php?title=4.3.1_The_value_of_demonstrations 4.3.1 The value of demonstrations]
|-
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;" colspan="2"|Attributes
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|File 1
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|File 2
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|File 3
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" rowspan="2"|lon (double)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Size
  | style="background-color:white; padding-left:10px; padding-right:10px;"|5717
  | style="background-color:white; padding-left:10px; padding-right:10px;"|59501
  | style="background-color:white; padding-left:10px; padding-right:10px;"|40954
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Chunksize
  | style="background-color:white; padding-left:10px; padding-right:10px;"|128
  | style="background-color:white; padding-left:10px; padding-right:10px;"|128
  | style="background-color:white; padding-left:10px; padding-right:10px;"|128
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" rowspan="2"|lat (double)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Size
  | style="background-color:white; padding-left:10px; padding-right:10px;"|4182
  | style="background-color:white; padding-left:10px; padding-right:10px;"|41882
  | style="background-color:white; padding-left:10px; padding-right:10px;"|34761
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Chunksize
  | style="background-color:white; padding-left:10px; padding-right:10px;"|128
  | style="background-color:white; padding-left:10px; padding-right:10px;"|128
  | style="background-color:white; padding-left:10px; padding-right:10px;"|128
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" rowspan="3"|Variable(float)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Name
  | style="background-color:white; padding-left:10px; padding-right:10px;"|grav_ir_anomaly
  | style="background-color:white; padding-left:10px; padding-right:10px;"|mag_tmi_rtp_anomaly
  | style="background-color:white; padding-left:10px; padding-right:10px;"|rad_air_dose_rate
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Size
  | style="background-color:white; padding-left:10px; padding-right:10px;"|(4182,5717)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|(41882,59501)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|(34761,40954)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Chunksize
  | style="background-color:white; padding-left:10px; padding-right:10px;"|(128,128)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|(128,128)
  | style="background-color:white; padding-left:10px; padding-right:10px;"|(128,128)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Deflate Level
  | style="background-color:white; padding-left:10px; padding-right:10px;"|
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Format
  | style="background-color:white; padding-left:10px; padding-right:10px;"|
  | style="background-color:white; padding-left:10px; padding-right:10px;"|netCDF-4 classic model
  | style="background-color:white; padding-left:10px; padding-right:10px;"|netCDF-4 classic model
  | style="background-color:white; padding-left:10px; padding-right:10px;"|netCDF-4 classic model
|-
|}
|}


==References==
:[[User:Shawndouglas/sandbox/sublevel38|5. Closing remarks]]
{{Reflist|colwidth=30em}}


==Notes==
: Appendix 1. Blank LIMSpec template for medical diagnostics and research labs
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Several URL from the original were dead, and more current URLs were substituted.
::[[LII:LIMSpec/Introduction and methodology|A1.1 Introduction and methodology]]
::[[LII:LIMSpec/Primary Laboratory Workflow|A1.2 Primary Laboratory Workflow]]
::[[LII:LIMSpec/Maintaining Laboratory Workflow and Operations|A1.3 Maintaining Laboratory Workflow and Operations]]
::[[User:Shawndouglas/sandbox/sublevel39|A1.4 Specialty Laboratory Functions]]
::[[LII:LIMSpec/Technology and Performance Improvements|A1.5 Technology and Performance Improvements]]
::[[LII:LIMSpec/Security and Integrity of Systems and Operations|A1.6 Security and Integrity of Systems and Operations]]
::[[LII:LIMSpec/Putting LIMSpec to use|A1.7 Putting LIMSpec to Use]]


<!--Place all category tags here-->
: Appendix 2. Completed example of LIMSpec for medical diagnostics and research labs
[[Category:LIMSwiki journal articles (added in 2018)‎]]
::[[User:Shawndouglas/sandbox/sublevel40|A2.1 Primary Laboratory Workflow]]
[[Category:LIMSwiki journal articles (all)‎]]
::[[User:Shawndouglas/sandbox/sublevel41|A2.2 Maintaining Laboratory Workflow and Operations]]
[[Category:LIMSwiki journal articles on data quality]]
::[[User:Shawndouglas/sandbox/sublevel42|A2.3 Specialty Laboratory Functions]]
[[Category:LIMSwiki journal articles on informatics‎‎]]
::[[User:Shawndouglas/sandbox/sublevel43|A2.4 Technology and Performance Improvements]]
::[[User:Shawndouglas/sandbox/sublevel44|A2.5 Security and Integrity of Systems and Operations]]

Revision as of 21:01, 1 February 2020

Laboratory Informatics Buyer's Guide for Medical Diagnostics and Research
2020 Edition
Vial of blood to be tested.jpg
This user book is a user-generated collection of LIMSWiki articles that can be easily saved, rendered electronically, and ordered as a printed book.
If you are the creator of this book and need help, see Help:Books.

Edit this book: Book Creator · Wikitext
Select format to download:

PDF (A4) · PDF (Letter)

Order a printed copy from these publishers: PediaPress
Start ] [ FAQ ] [ Basic help ] [ Advanced help ] [ Feedback ] [ Recent Changes ]


Laboratory Informatics Buyer's Guide for Medical Diagnostics and Research, 2020 Edition

Title: Laboratory Informatics Buyer's Guide for Medical Diagnostics and Research, 2020 Edition

Author for citation: Shawn E. Douglas and Alan Vaughan

License for content: Creative Commons Attribution-ShareAlike 4.0 International

Publication date: TBD

Insert description of book here.

1. Introduction to medical diagnostics and research laboratories
1.1 Medical diagnostics lab
1.1.1 Pathology
1.1.1.1 Anatomical vs. clinical pathology
1.1.1.2 Forensic pathology
1.1.2 Physician office lab
1.1.3 Integrative medicine lab
1.2 Public health lab
1.3 Toxicology lab
1.4 Blood bank and transfusion lab
1.5 Central and contract research lab
1.5.1 Medical and other research in academia
1.6 Genetic diagnostics lab
1.6.1 Cytogenetics lab
1.7 Medical cannabis testing lab
2. Choosing laboratory informatics software for your lab
2.1 Evaluation and selection
2.1.1 Technology considerations
2.1.1.1 Laboratory informatics options
2.1.2 Features and functions
2.1.3 Cybersecurity considerations
2.1.4 Regulatory compliance considerations
2.1.5 Cost considerations
2.2 Implementation
2.2.1 Internal and external integrations
2.3 MSW, updates, and other contracted services
2.4 How a user requirements specification fits into the entire process (LIMSpec)
3. Additional resources for selecting and implementing informatics solutions
Part 1: Laboratory informatics vendors
3.1 Laboratory informatics vendors
3.1.1 LIMS vendors
3.1.2 LIS vendors
3.1.3 ELN vendors
3.1.4 Middleware vendors
Part 2: Other vendors and service providers
3.2 Medical diagnostics instrumentation vendors
3.3 EHR vendors
3.4 Laboratory business intelligence and workflow solution vendors
3.5 Laboratory billing service providers
Part 3: Industry and community resources
3.6 Trade organizations
3.7 Conferences and trade shows
3.8 User communities
3.9 Books and journals
3.10 Standards
3.11 LIMSpec
4. Taking the next step
4.1 Develop a specification document (LIMSpec) tailored to your lab's needs
4.2 Issue the specification as a request for information (RFI)
4.3 Acquire information and proposals from vendors
4.3.1 The value of demonstrations
5. Closing remarks
Appendix 1. Blank LIMSpec template for medical diagnostics and research labs
A1.1 Introduction and methodology
A1.2 Primary Laboratory Workflow
A1.3 Maintaining Laboratory Workflow and Operations
A1.4 Specialty Laboratory Functions
A1.5 Technology and Performance Improvements
A1.6 Security and Integrity of Systems and Operations
A1.7 Putting LIMSpec to Use
Appendix 2. Completed example of LIMSpec for medical diagnostics and research labs
A2.1 Primary Laboratory Workflow
A2.2 Maintaining Laboratory Workflow and Operations
A2.3 Specialty Laboratory Functions
A2.4 Technology and Performance Improvements
A2.5 Security and Integrity of Systems and Operations