Difference between revisions of "Journal:One tool to find them all: A case of data integration and querying in a distributed LIMS platform"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 33: Line 33:
LIMS aim at assisting the researchers in their daily laboratory practice, improving the accessibility of instruments, and tracking biological samples and their related [[information]].
LIMS aim at assisting the researchers in their daily laboratory practice, improving the accessibility of instruments, and tracking biological samples and their related [[information]].


In the past decade, several open-source as well as proprietary LIMS have been developed. Commercial solutions are typically large, complex, and feature-rich products designed to easily support large laboratories. Their license fees can be prohibitive, and extra features may come at additional costs.<ref name="CLIPaper">{{cite web |url=https://www.it.uu.se/edu/course/homepage/lims/vt12/ComprehensiveLaboratoryInformatics.pdf |archiveurl=https://web.archive.org/web/20170825181932/https://www.it.uu.se/edu/course/homepage/lims/vt12/ComprehensiveLaboratoryInformatics.pdf |format=PDF |title=Comprehensive Laboratory Informatics: A Multilayer Approach |author=Wood, S. |publisher=American Laboratory |date=September 2007 |pages=3 |archivedate=25 August 2017}}</ref> To reduce these costs, the last generation of commercial LIMS adopt web-oriented software technologies, particularly the [[Software as a service|software-as-a-service]] distribution model, which reduces the customer’s final expenditure on license fees, hardware, and maintenance. Examples of commercial solutions include [[Abbott Informatics Corporation|STARLIMS]<ref name="AbbottStarlims18">{{cite web |url=https://www.informatics.abbott/us/en/offerings/lims |title=Starlims |publisher=Abbott |date=2018}}</ref>, [[Sapio Sciences, LLC|Exemplar LIMS]]<ref name="SapioSapio18">{{cite web |url=https://www.sapiosciences.com/ |title=Sapio Sciences |publisher=Sapio Sciences, LLC |date=2018}}</ref>, and [[LabVantage Solutions, Inc.|LabVantage]].<ref name="LabVantageLabVantage18">{{cite web |url=https://www.labvantage.com/ |title=LabVantage |publisher=LabVantage Solutions, Inc |date=2018}}</ref>
In the past decade, several open-source as well as proprietary LIMS have been developed. Commercial solutions are typically large, complex, and feature-rich products designed to easily support large laboratories. Their license fees can be prohibitive, and extra features may come at additional costs.<ref name="CLIPaper">{{cite web |url=https://www.it.uu.se/edu/course/homepage/lims/vt12/ComprehensiveLaboratoryInformatics.pdf |archiveurl=https://web.archive.org/web/20170825181932/https://www.it.uu.se/edu/course/homepage/lims/vt12/ComprehensiveLaboratoryInformatics.pdf |format=PDF |title=Comprehensive Laboratory Informatics: A Multilayer Approach |author=Wood, S. |publisher=American Laboratory |date=September 2007 |pages=3 |archivedate=25 August 2017}}</ref> To reduce these costs, the last generation of commercial LIMS adopt web-oriented software technologies, particularly the [[Software as a service|software-as-a-service]] distribution model, which reduces the customer’s final expenditure on license fees, hardware, and maintenance. Examples of commercial solutions include [[Abbott Informatics Corporation|STARLIMS]]<ref name="AbbottStarlims18">{{cite web |url=https://www.informatics.abbott/us/en/offerings/lims |title=Starlims |publisher=Abbott |date=2018}}</ref>, [[Sapio Sciences, LLC|Exemplar LIMS]]<ref name="SapioSapio18">{{cite web |url=https://www.sapiosciences.com/ |title=Sapio Sciences |publisher=Sapio Sciences, LLC |date=2018}}</ref>, and [[LabVantage Solutions, Inc.|LabVantage]].<ref name="LabVantageLabVantage18">{{cite web |url=https://www.labvantage.com/ |title=LabVantage |publisher=LabVantage Solutions, Inc |date=2018}}</ref>


Commercial LIMS tend to offer features based on common laboratory procedures and best practices, which may not fit highly specific settings well. For instance, LabVantage provides a large set of features, such as sample and batch management, quality control, advanced storage and logistics, and task scheduling. However, the life cycle of xenopatients (i.e., biological models for cancer research based on the transplantation of human tumors in mice) is not available in the standard software and should be implemented as a custom module by the software developer. Another issue that affects commercial LIMSs is the management and standardization of [[Genomics|genomic]] data. To the best of our knowledge, these systems do not exploit any knowledge base related to the genomic data and do not provide any validation and analysis of different genomic data stored in the system.
Commercial LIMS tend to offer features based on common laboratory procedures and best practices, which may not fit highly specific settings well. For instance, LabVantage provides a large set of features, such as sample and batch management, quality control, advanced storage and logistics, and task scheduling. However, the life cycle of xenopatients (i.e., biological models for cancer research based on the transplantation of human tumors in mice) is not available in the standard software and should be implemented as a custom module by the software developer. Another issue that affects commercial LIMSs is the management and standardization of [[Genomics|genomic]] data. To the best of our knowledge, these systems do not exploit any knowledge base related to the genomic data and do not provide any validation and analysis of different genomic data stored in the system.


Other open-source solutions like [[Galaxy (biomedical software)|Galaxy]]<ref name="BlankenbergGalaxy10">{{cite journal |title=Galaxy: a web-based genome analysis tool for experimentalists |journal=Current Protocols in Molecular Biology |author=Blankenberg, D.; Von Kuster, G.; Coraor, N. et al. |year=2010 |volume=19 |issue=Unit 19.10.1–21 |doi=10.1002/0471142727.mb1910s89 |pmid=20069535 |pmc=PMC4264107}}</ref><ref name="GoecksGalaxy10">{{cite journal |title=Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences |journal=Genome Biology |author=Goecks, J.; Nekrutenko, A.; Taylor, J.; The Galaxy Team |year=2010 |volume=11 |issue=8 |pages=R86 |doi=10.1186/gb-2010-11-8-r86 |pmid=20738864 |pmc=PMC2945788}}</ref> focus instead on specific sub-domains, addressing DNA sequencing and annotation, or [[SeqWare]]<ref name="O'ConnorSeqWare10">{{cite journal |journal=BMC Bioinformatics |year=2010 |volume=11 |issue=Suppl. 12 |pages=S2 |title=SeqWare Query Engine: storing and searching sequence data in the cloud |author=O'Connor, B.D.; Merriman, B.; Nelson, S.F. |url=http://www.biomedcentral.com/1471-2105/11/S12/S2 |doi=10.1186/1471-2105-11-S12-S2}}</ref>, which tracks ''in vivo'' and ''in vitro'' experiments and allows for complex analysis workflows.
Other open-source solutions like [[Galaxy (biomedical software)|Galaxy]]<ref name="BlankenbergGalaxy10">{{cite journal |title=Galaxy: a web-based genome analysis tool for experimentalists |journal=Current Protocols in Molecular Biology |author=Blankenberg, D.; Von Kuster, G.; Coraor, N. et al. |year=2010 |volume=19 |issue=Unit 19.10.1–21 |doi=10.1002/0471142727.mb1910s89 |pmid=20069535 |pmc=PMC4264107}}</ref><ref name="GoecksGalaxy10">{{cite journal |title=Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences |journal=Genome Biology |author=Goecks, J.; Nekrutenko, A.; Taylor, J.; The Galaxy Team |year=2010 |volume=11 |issue=8 |pages=R86 |doi=10.1186/gb-2010-11-8-r86 |pmid=20738864 |pmc=PMC2945788}}</ref> focus instead on specific sub-domains, addressing DNA sequencing and annotation, or [[SeqWare]]<ref name="O'ConnorSeqWare10">{{cite journal |journal=BMC Bioinformatics |year=2010 |volume=11 |issue=Suppl. 12 |pages=S2 |title=SeqWare Query Engine: storing and searching sequence data in the cloud |author=O'Connor, B.D.; Merriman, B.; Nelson, S.F. |url=http://www.biomedcentral.com/1471-2105/11/S12/S2 |doi=10.1186/1471-2105-11-S12-S2}}</ref>, which tracks ''in vivo'' and ''in vitro'' experiments and allows for complex analysis workflows.
For this reason, many institutions have invested in the development of in-house solutions and/or have adapted open-source projects to their own requirements. In this way, the developed solutions can provide functionality that meet the specific needs of the researchers in their institution laboratories. From an engineering perspective, developing in-house solutions may also permit the exploration and adoption of new technologies, in order to define better data models and improve system performance.
To address a substantial mismatch between the LIMS solutions on offer and the functional requirements dictated by research practice, in 2011 the Institute for Cancer Research at Candiolo (Italy) started to implement its own LIMS, named the Laboratory Assistant Suite (LAS) platform (9). The main purpose of the platform was to assist researchers in different laboratory and research activities, allowing management of different kinds of raw data (e.g., biological, molecular), tracking experimental data, supporting decision-making tasks, and integrating heterogeneous data for complex analyses. As development progressed, several new features and modules were included to (i) track clinical data, (ii) include support to the newest technologies exploited for molecular experiments, and (iii) standardize the description of genomic data by means of semantic web technologies. Thanks to these new features, scientists can gain better insight into tumor development by jointly studying the clinical evolution of the disease and the experimental results derived from ''in vivo'' and ''in vitro'' experimentation. The experimental pipelines exploited in the translational research context are the primary focus of the LAS, which targets the standardization of the genomic data to allow a comparison of results coming from different technologies.
Still, unlike the other commercial and open-source platforms, the LAS makes an attempt at covering a wide range of diverse laboratory procedures and, thanks to its versatile and general-purpose structure, it can be extended to support new ones with limited effort.
Thanks to the vast variety of different experimental technologies supported by the LAS and their high level of specificity, large amounts of heterogeneous and complex information are collected in separate databases. To enable the users to extract and correlate information from the different databases exploited by the platform, a Multi-Dimensional Data Manager (MDMM) module was developed. The module takes care of merging data from the different LAS databases and provides a simple graphical user interface to extract the information of interest without any knowledge of a query language. A tool to visualize biological entities and their related information with a hierarchical tree structure is also available, while other powerful visualization tools are currently under development. To the best of our knowledge, no similar tools applied to biological data and distributed databases exist.
This paper presents the main characteristics of the LAS and its exploitation in the research laboratories of the Institute for Cancer Research at Candiolo by its researchers and research partners. Next, the main features of the MDDM are described. Afterwards, current and future research directions are presented.
==LAS==
Translational research aims at enhancing patient care and transferring scientific discoveries from the laboratory to a real clinical context. It is a kind of metaphorical scientific cycle from bench to bedside and back again through complex iterative processes, operating between laboratory (i.e., preclinical research) and clinic. To the aim of managing and integrating preclinical and clinical information, a robust but flexible data management platform is needed. In particular, different types of information (e.g., biological data, molecular data, procedure tracking data, and sample tracking data)—some of which can be highly complex—should be independently managed by the platform but, at the same time, interconnected to permit integrated analyses.
The LAS platform is freely available upon request to the authors. The software is distributed by means of a Docker-based approach to allow interested organizations to configure it according to their constraints. Moreover, the use of Docker allows system administrators to run the software on different servers using the Docker Swarm configuration for balancing the workload, as well as the associated data resources. We usually recommend installing the LAS on at least two servers, one dedicated to the containers running the software and the other for databases. The servers characteristics depend on several aspects, such as the number of simultaneously logged users, the number of biological entities tracked, and the dimension of raw data stored. We suggest as initial setup a server with at least 16 GB of RAM and a storage space of 2 TB. Interested users may refer to the video tutorials (available at http://lasrep.ircc.it/) to explore main system features and as a reference guide during its usage. The documentation of the platform is provided with the software and can be downloaded from the documentation section of the [https://las.ircc.it/ LAS instance] of the Institute for Cancer Research at Candiolo.
What follows is a description of the data architecture and the main functionalities included in the platform.
===Data models===


==References==
==References==

Revision as of 22:24, 20 January 2020

Full article title One tool to find them all: A case of data integration and querying in a distributed LIMS platform
Journal Database
Author(s) Grand, Alberto; Geda, Emanuele; Mignone, Andrea; Bertotti, Andrea; Fiori, Alessandro
Author affiliation(s) Candiolo Cancer Institute, University of Torino
Primary contact Email: alessandro dot fiori at ircc dot it
Year published 2019
Volume and issue 219
Page(s) baz004
DOI 10.1093/database/baz004
ISSN 1758-0463
Distribution license Creative Commons Attribution 4.0 International
Website https://academic.oup.com/database/article/doi/10.1093/database/baz004/5304001
Download https://academic.oup.com/database/article-pdf/doi/10.1093/database/baz004/27643896/baz004.pdf (PDF)

Abstract

In recent years, laboratory information management systems (LIMS) have been growing from mere inventory systems into increasingly comprehensive software platforms, spanning functionalities as diverse as data search, annotation, and analysis. In 2011, our institution started a LIMS project named the Laboratory Assistant Suite with the purpose of assisting researchers throughout all of their laboratory activities, providing graphical tools to support decision-making tasks and building complex analyses on integrated data. The modular architecture of the system exploits multiple databases with different technologies. To provide an efficient and easy tool for retrieving information of interest, we developed the Multi-Dimensional Data Manager (MDDM). By means of intuitive interfaces, scientists can execute complex queries without any knowledge of query languages or database structures, and easily integrate heterogeneous data stored in multiple databases. Together with the other software modules making up the platform, the MDDM has helped improve the overall quality of the data, substantially reduced the time spent with manual data entry and retrieval, and ultimately broadened the spectrum of interconnections among the data, offering novel perspectives to biomedical analysts.

Introduction

The introduction of automation and high-throughput technologies in laboratory environments has raised diverse issues related to the amount and heterogeneity of the data produced, the adoption of robust procedures for sample tracking, and the management of computer-based workflows needed to process and analyze the raw data. Laboratory information management systems (LIMS) have gained increasing popularity because they can ensure good levels of quality control over laboratory activities and efficiently handle the large amounts of data produced.[1]

LIMS aim at assisting the researchers in their daily laboratory practice, improving the accessibility of instruments, and tracking biological samples and their related information.

In the past decade, several open-source as well as proprietary LIMS have been developed. Commercial solutions are typically large, complex, and feature-rich products designed to easily support large laboratories. Their license fees can be prohibitive, and extra features may come at additional costs.[2] To reduce these costs, the last generation of commercial LIMS adopt web-oriented software technologies, particularly the software-as-a-service distribution model, which reduces the customer’s final expenditure on license fees, hardware, and maintenance. Examples of commercial solutions include STARLIMS[3], Exemplar LIMS[4], and LabVantage.[5]

Commercial LIMS tend to offer features based on common laboratory procedures and best practices, which may not fit highly specific settings well. For instance, LabVantage provides a large set of features, such as sample and batch management, quality control, advanced storage and logistics, and task scheduling. However, the life cycle of xenopatients (i.e., biological models for cancer research based on the transplantation of human tumors in mice) is not available in the standard software and should be implemented as a custom module by the software developer. Another issue that affects commercial LIMSs is the management and standardization of genomic data. To the best of our knowledge, these systems do not exploit any knowledge base related to the genomic data and do not provide any validation and analysis of different genomic data stored in the system.

Other open-source solutions like Galaxy[6][7] focus instead on specific sub-domains, addressing DNA sequencing and annotation, or SeqWare[8], which tracks in vivo and in vitro experiments and allows for complex analysis workflows.

For this reason, many institutions have invested in the development of in-house solutions and/or have adapted open-source projects to their own requirements. In this way, the developed solutions can provide functionality that meet the specific needs of the researchers in their institution laboratories. From an engineering perspective, developing in-house solutions may also permit the exploration and adoption of new technologies, in order to define better data models and improve system performance.

To address a substantial mismatch between the LIMS solutions on offer and the functional requirements dictated by research practice, in 2011 the Institute for Cancer Research at Candiolo (Italy) started to implement its own LIMS, named the Laboratory Assistant Suite (LAS) platform (9). The main purpose of the platform was to assist researchers in different laboratory and research activities, allowing management of different kinds of raw data (e.g., biological, molecular), tracking experimental data, supporting decision-making tasks, and integrating heterogeneous data for complex analyses. As development progressed, several new features and modules were included to (i) track clinical data, (ii) include support to the newest technologies exploited for molecular experiments, and (iii) standardize the description of genomic data by means of semantic web technologies. Thanks to these new features, scientists can gain better insight into tumor development by jointly studying the clinical evolution of the disease and the experimental results derived from in vivo and in vitro experimentation. The experimental pipelines exploited in the translational research context are the primary focus of the LAS, which targets the standardization of the genomic data to allow a comparison of results coming from different technologies.

Still, unlike the other commercial and open-source platforms, the LAS makes an attempt at covering a wide range of diverse laboratory procedures and, thanks to its versatile and general-purpose structure, it can be extended to support new ones with limited effort.

Thanks to the vast variety of different experimental technologies supported by the LAS and their high level of specificity, large amounts of heterogeneous and complex information are collected in separate databases. To enable the users to extract and correlate information from the different databases exploited by the platform, a Multi-Dimensional Data Manager (MDMM) module was developed. The module takes care of merging data from the different LAS databases and provides a simple graphical user interface to extract the information of interest without any knowledge of a query language. A tool to visualize biological entities and their related information with a hierarchical tree structure is also available, while other powerful visualization tools are currently under development. To the best of our knowledge, no similar tools applied to biological data and distributed databases exist.

This paper presents the main characteristics of the LAS and its exploitation in the research laboratories of the Institute for Cancer Research at Candiolo by its researchers and research partners. Next, the main features of the MDDM are described. Afterwards, current and future research directions are presented.

LAS

Translational research aims at enhancing patient care and transferring scientific discoveries from the laboratory to a real clinical context. It is a kind of metaphorical scientific cycle from bench to bedside and back again through complex iterative processes, operating between laboratory (i.e., preclinical research) and clinic. To the aim of managing and integrating preclinical and clinical information, a robust but flexible data management platform is needed. In particular, different types of information (e.g., biological data, molecular data, procedure tracking data, and sample tracking data)—some of which can be highly complex—should be independently managed by the platform but, at the same time, interconnected to permit integrated analyses.

The LAS platform is freely available upon request to the authors. The software is distributed by means of a Docker-based approach to allow interested organizations to configure it according to their constraints. Moreover, the use of Docker allows system administrators to run the software on different servers using the Docker Swarm configuration for balancing the workload, as well as the associated data resources. We usually recommend installing the LAS on at least two servers, one dedicated to the containers running the software and the other for databases. The servers characteristics depend on several aspects, such as the number of simultaneously logged users, the number of biological entities tracked, and the dimension of raw data stored. We suggest as initial setup a server with at least 16 GB of RAM and a storage space of 2 TB. Interested users may refer to the video tutorials (available at http://lasrep.ircc.it/) to explore main system features and as a reference guide during its usage. The documentation of the platform is provided with the software and can be downloaded from the documentation section of the LAS instance of the Institute for Cancer Research at Candiolo.

What follows is a description of the data architecture and the main functionalities included in the platform.

Data models

References

  1. Chen, Y.; Lin, Y.; Yuan, X. et al.. "Chapter 9: LIMS and Clinical Data Management". In Shen, B.; Tang, H.; Jiang, X.. Translational Biomedical Informatics: A Precision Medicine Perspective. Springer. pp. 225–240. doi:10.1007/978-981-10-1503-8_9. ISBN 9789811015038. 
  2. Wood, S. (September 2007). "Comprehensive Laboratory Informatics: A Multilayer Approach" (PDF). American Laboratory. pp. 3. Archived from the original on 25 August 2017. https://web.archive.org/web/20170825181932/https://www.it.uu.se/edu/course/homepage/lims/vt12/ComprehensiveLaboratoryInformatics.pdf. 
  3. "Starlims". Abbott. 2018. https://www.informatics.abbott/us/en/offerings/lims. 
  4. "Sapio Sciences". Sapio Sciences, LLC. 2018. https://www.sapiosciences.com/. 
  5. "LabVantage". LabVantage Solutions, Inc. 2018. https://www.labvantage.com/. 
  6. Blankenberg, D.; Von Kuster, G.; Coraor, N. et al. (2010). "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology 19 (Unit 19.10.1–21). doi:10.1002/0471142727.mb1910s89. PMC PMC4264107. PMID 20069535. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264107. 
  7. Goecks, J.; Nekrutenko, A.; Taylor, J.; The Galaxy Team (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. PMC PMC2945788. PMID 20738864. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945788. 
  8. O'Connor, B.D.; Merriman, B.; Nelson, S.F. (2010). "SeqWare Query Engine: storing and searching sequence data in the cloud". BMC Bioinformatics 11 (Suppl. 12): S2. doi:10.1186/1471-2105-11-S12-S2. http://www.biomedcentral.com/1471-2105/11/S12/S2. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, spelling, and grammar. We also added PMCID and DOI when they were missing from the original reference.