Difference between revisions of "Journal:One tool to find them all: A case of data integration and querying in a distributed LIMS platform"

Full article title	One tool to find them all: A case of data integration and querying in a distributed LIMS platform
Journal	Database
Author(s)	Grand, Alberto; Geda, Emanuele; Mignone, Andrea; Bertotti, Andrea; Fiori, Alessandro
Author affiliation(s)	Candiolo Cancer Institute, University of Torino
Primary contact	Email: alessandro dot fiori at ircc dot it
Year published	2019
Volume and issue	219
Page(s)	baz004
DOI	10.1093/database/baz004
ISSN	1758-0463
Distribution license	Creative Commons Attribution 4.0 International
Website	https://academic.oup.com/database/article/doi/10.1093/database/baz004/5304001
Download	https://academic.oup.com/database/article-pdf/doi/10.1093/database/baz004/27643896/baz004.pdf (PDF)

Revision as of 23:16, 20 January 2020

This article should be considered a work in progress and incomplete. Consider this article incomplete until this notice is removed.

Abstract

In recent years, laboratory information management systems (LIMS) have been growing from mere inventory systems into increasingly comprehensive software platforms, spanning functionalities as diverse as data search, annotation, and analysis. In 2011, our institution started a LIMS project named the Laboratory Assistant Suite with the purpose of assisting researchers throughout all of their laboratory activities, providing graphical tools to support decision-making tasks and building complex analyses on integrated data. The modular architecture of the system exploits multiple databases with different technologies. To provide an efficient and easy tool for retrieving information of interest, we developed the Multi-Dimensional Data Manager (MDDM). By means of intuitive interfaces, scientists can execute complex queries without any knowledge of query languages or database structures, and easily integrate heterogeneous data stored in multiple databases. Together with the other software modules making up the platform, the MDDM has helped improve the overall quality of the data, substantially reduced the time spent with manual data entry and retrieval, and ultimately broadened the spectrum of interconnections among the data, offering novel perspectives to biomedical analysts.

Introduction

The introduction of automation and high-throughput technologies in laboratory environments has raised diverse issues related to the amount and heterogeneity of the data produced, the adoption of robust procedures for sample tracking, and the management of computer-based workflows needed to process and analyze the raw data. Laboratory information management systems (LIMS) have gained increasing popularity because they can ensure good levels of quality control over laboratory activities and efficiently handle the large amounts of data produced.^[1]

LIMS aim at assisting the researchers in their daily laboratory practice, improving the accessibility of instruments, and tracking biological samples and their related information.

In the past decade, several open-source as well as proprietary LIMS have been developed. Commercial solutions are typically large, complex, and feature-rich products designed to easily support large laboratories. Their license fees can be prohibitive, and extra features may come at additional costs.^[2] To reduce these costs, the last generation of commercial LIMS adopt web-oriented software technologies, particularly the software-as-a-service distribution model, which reduces the customer’s final expenditure on license fees, hardware, and maintenance. Examples of commercial solutions include STARLIMS^[3], Exemplar LIMS^[4], and LabVantage.^[5]

Commercial LIMS tend to offer features based on common laboratory procedures and best practices, which may not fit highly specific settings well. For instance, LabVantage provides a large set of features, such as sample and batch management, quality control, advanced storage and logistics, and task scheduling. However, the life cycle of xenopatients (i.e., biological models for cancer research based on the transplantation of human tumors in mice) is not available in the standard software and should be implemented as a custom module by the software developer. Another issue that affects commercial LIMSs is the management and standardization of genomic data. To the best of our knowledge, these systems do not exploit any knowledge base related to the genomic data and do not provide any validation and analysis of different genomic data stored in the system.

Other open-source solutions like Galaxy^[6]^[7] focus instead on specific sub-domains, addressing DNA sequencing and annotation, or SeqWare^[8], which tracks in vivo and in vitro experiments and allows for complex analysis workflows.

For this reason, many institutions have invested in the development of in-house solutions and/or have adapted open-source projects to their own requirements. In this way, the developed solutions can provide functionality that meet the specific needs of the researchers in their institution laboratories. From an engineering perspective, developing in-house solutions may also permit the exploration and adoption of new technologies, in order to define better data models and improve system performance.

To address a substantial mismatch between the LIMS solutions on offer and the functional requirements dictated by research practice, the Institute for Cancer Research at Candiolo (Italy) started to implement its own LIMS, named the Laboratory Assistant Suite (LAS) platform, in 2011.^[9] The main purpose of the platform was to assist researchers in different laboratory and research activities, allowing management of different kinds of raw data (e.g., biological, molecular), tracking experimental data, supporting decision-making tasks, and integrating heterogeneous data for complex analyses. As development progressed, several new features and modules were included to (i) track clinical data, (ii) include support to the newest technologies exploited for molecular experiments, and (iii) standardize the description of genomic data by means of semantic web technologies. Thanks to these new features, scientists can gain better insight into tumor development by jointly studying the clinical evolution of the disease and the experimental results derived from in vivo and in vitro experimentation. The experimental pipelines exploited in the translational research context are the primary focus of the LAS, which targets the standardization of the genomic data to allow a comparison of results coming from different technologies.

Still, unlike the other commercial and open-source platforms, the LAS makes an attempt at covering a wide range of diverse laboratory procedures and, thanks to its versatile and general-purpose structure, it can be extended to support new ones with limited effort.

Thanks to the vast variety of different experimental technologies supported by the LAS and their high level of specificity, large amounts of heterogeneous and complex information are collected in separate databases. To enable the users to extract and correlate information from the different databases exploited by the platform, a Multi-Dimensional Data Manager (MDMM) module was developed. The module takes care of merging data from the different LAS databases and provides a simple graphical user interface to extract the information of interest without any knowledge of a query language. A tool to visualize biological entities and their related information with a hierarchical tree structure is also available, while other powerful visualization tools are currently under development. To the best of our knowledge, no similar tools applied to biological data and distributed databases exist.

This paper presents the main characteristics of the LAS and its exploitation in the research laboratories of the Institute for Cancer Research at Candiolo by its researchers and research partners. Next, the main features of the MDDM are described. Afterwards, current and future research directions are presented.

LAS

Translational research aims at enhancing patient care and transferring scientific discoveries from the laboratory to a real clinical context. It is a kind of metaphorical scientific cycle from bench to bedside and back again through complex iterative processes, operating between laboratory (i.e., preclinical research) and clinic. To the aim of managing and integrating preclinical and clinical information, a robust but flexible data management platform is needed. In particular, different types of information (e.g., biological data, molecular data, procedure tracking data, and sample tracking data)—some of which can be highly complex—should be independently managed by the platform but, at the same time, interconnected to permit integrated analyses.

The LAS platform is freely available upon request to the authors. The software is distributed by means of a Docker-based approach to allow interested organizations to configure it according to their constraints. Moreover, the use of Docker allows system administrators to run the software on different servers using the Docker Swarm configuration for balancing the workload, as well as the associated data resources. We usually recommend installing the LAS on at least two servers, one dedicated to the containers running the software and the other for databases. The servers characteristics depend on several aspects, such as the number of simultaneously logged users, the number of biological entities tracked, and the dimension of raw data stored. We suggest as initial setup a server with at least 16 GB of RAM and a storage space of 2 TB. Interested users may refer to the video tutorials (available at http://lasrep.ircc.it/) to explore main system features and as a reference guide during its usage. The documentation of the platform is provided with the software and can be downloaded from the documentation section of the LAS instance of the Institute for Cancer Research at Candiolo.

What follows is a description of the data architecture and the main functionalities included in the platform.

Data models

The LAS platform has been developed using different database technologies to fit the needs of the application, and to handle in a suitable way the heterogeneous data characteristics.

The platform makes use of relational databases to track biological entities and their properties, as well as the information about the various experimental procedures. Since the platform includes different modules managing substantially different types of entities and/or specific laboratory procedures, different database instances are exploited. The core biological entities (i.e., Aliquot, Biomouse, and Cell Line) are identified by a unique and mnemonic key named GenealogyID that encodes relevant information regarding the history of the entity. This key is automatically generated by the LAS platform through formal rules and may be used to link the data across the databases.

Parallel to the relational databases storing operating data, a graph database is exploited. It is used to represent the complex inherent hierarchy of biological entities and their relationships. Being able to easily and efficiently reconstruct the genealogical tree of each entity is indeed an essential feature of the platform, allowing the user to perform ad hoc queries and to isolate specific sub-trees of biological entities involved in the experimental pipeline. Moreover, the graph database has been exploited to store a knowledge base for the heterogeneous domains managed by the LAS modules. By using a graph representation, all these domains can be easily interconnected, while the knowledge base can be continuously updated and augmented with new layers of information and different levels of abstraction (e.g., proteomics, clinical, etc.). Finally, a social network of users and research groups using the LAS platform is also stored in the graph, to model data ownership, resolve data access conflicts, and manage data sharing and collaboration among different groups or users.

A document database, MongoDB, is also used to store files associated with biological entities and metadata generated by both the LAS Genomic Data Manager modules and the MDDM. The latter usage will be discussed in detail in the next section.

Functionalities

The LAS architecture includes a set of software modules, i.e., fully-fledged web applications, each addressing a different type of biological entity and its associated experimental procedures. Modules may interact with each other by means of web application programming interfaces (APIs), e.g., to exchange data and/or to carry out operations that span multiple entities or domains. (The modules currently included in the LAS platform are described in the following.)

Even if the platform has been developed since 2011, we always took into account security issues during the design and development processes of the platform. In this way, our software is compliant with the constraints of the General Data Protection Regulation (GDPR), with enforcement beginning on May 25, 2018. Indeed, the management of data produced by different users and/or groups requires that the access to functionalities and information are restricted according to several criteria such as group and/or project membership, and user role. For these reasons, the platform manages users and their privileges following these concepts:

Working Group: A Working Group (WG) is a set of users in the LAS platform that work together toward a specific goal (e.g., project, research activity). The data produced by the users of the same group are private, unless they intentionally share data with other groups.

User Profiles: Each user belonging to a WG has a set of permissions to access the LAS functionalities they have been enabled to use. These functionalities are defined according to the role selected during the user registration process. The manager of their WG or the system administrator can assign new functionalities upon request.

To collect data, the user is required to specify the informed consent signed by the patient for specific research activities (e.g., preclinical trials) involving personal samples and information. This document is defined by a committee to accomplish all the constraints included in the GDPR. Since the data are collected for research purposes, the patient can only revoke the usage of the biological samples, but not the information (e.g., experimental results) collected by the researchers. Only the researchers that are included in the research project can manage these samples and track the experimental processes according to their profile. The platform tracks all the procedures performed by each user in order to identify malicious usage of the software.

For each patient, the Clinical module tracks both contextual information (i.e., personal data, Medical Center of the Trial, etc.) and relevant clinical events through a case report form. All data are related to the relative informed consent that grants data and specimen sampling.

The BioBanking Management module covers a wide range of activities, including management of biological samples and associated pathological information, as well as support for a number of laboratory-related procedures. For instance, the module can handle the collection of biological material from surgical intervention and the acquisition of aliquots from external laboratories. Aliquots stored in the system are characterized by features such as tumor type (e.g., colorectal), tissue type (e.g., liver metastasis), source hospital, or laboratory and pathological information. Measurement of aliquot physical characteristics such as volume, concentration, purity, and quality can be tracked by the module, as well as the derivation of new biological materials (e.g., DNA and cDNA) and the planning of molecular experiments.

The biological material used in our laboratories is stored by means of several types of containers (e.g., freezers, racks, plates, and tubes). Their mutual interactions (i.e., which types of containers can host other containers) can change according to characteristics such as the layout and the laboratory procedure. Additionally, the Storage Management module allows managing any kind of container by defining and applying different rules to them, and it tracks the relationships between the containers and the biological entities.

Different types of molecular analyses can be conducted on biological samples, to investigate various aspects of their genetic constituents that may have an impact on the development of oncogenic behavior. For instance, biologists may be interested in analyzing mutations for a target gene involved in tumor proliferation. In an effort to closely track the translational research pipeline from the collection of samples to their analysis, the LAS provides support to tracking the most frequently used molecular profiling techniques in our institution {e.g., Sanger sequencing, real-time polymerase chain reaction (PCR), and Sequenom). Each molecular module queries the knowledge base of the Genomic Annotation Manager (GAM) to retrieve the description of its reagents, as well as a specification of all possible alterations (e.g., sequence alterations and gene copy number variations) known in the literature, to allow both the experiment definition and the evaluation of experimental results.

The GAM provides a higher-level, qualitative insight into the genomic features of biological samples. This information is shaped in the form of annotations, i.e., a set of semantic labels attached to a sample, pointing out some of its relevant features. To ensure semantic coherence and adopt a standardized nomenclature, all relevant concepts from the genomic and biological domains used for labeling samples have been drawn from a number of public, freely accessible databases and ontologies.^[10]^[11]^[12]^[13] This information has been structured into a knowledge base, modeled as a graph, and stored in a graph database.^[14] Concepts are interlinked with one another according to both general-purpose semantic relationships such as containment ("part of") or generalization ("is a"), and domain-specific relationships (e.g., indicating an underlying biochemical process, as in "is transcribed from"). New concepts and relationships, as well as new domains of interest, may be added or layered as needed to account for novel findings and broaden the spectrum of investigation. Within the GAM, every annotation is a semantic statement establishing a relationship, expressed by means of a predicate, between a biological sample (the subject of the statement) and a concept (the object of the statement), such as a genetic mutation. It is represented within the graph database as a node of type "annotation" with a pair of incoming and outgoing edges, one linking the biological sample to the annotation node by means of a has_annotation relationship, and the other linking the annotation node to the reference node in the knowledge base by means of a has_reference relationship. The annotation node is often linked to other nodes, such as the process that produced the annotation or the raw experimental data.

References

↑ Chen, Y.; Lin, Y.; Yuan, X. et al.. "Chapter 9: LIMS and Clinical Data Management". In Shen, B.; Tang, H.; Jiang, X.. Translational Biomedical Informatics: A Precision Medicine Perspective. Springer. pp. 225–240. doi:10.1007/978-981-10-1503-8_9. ISBN 9789811015038.
↑ Wood, S. (September 2007). "Comprehensive Laboratory Informatics: A Multilayer Approach" (PDF). American Laboratory. pp. 3. Archived from the original on 25 August 2017. https://web.archive.org/web/20170825181932/https://www.it.uu.se/edu/course/homepage/lims/vt12/ComprehensiveLaboratoryInformatics.pdf.
↑ "Starlims". Abbott. 2018. https://www.informatics.abbott/us/en/offerings/lims.
↑ "Sapio Sciences". Sapio Sciences, LLC. 2018. https://www.sapiosciences.com/.
↑ "LabVantage". LabVantage Solutions, Inc. 2018. https://www.labvantage.com/.
↑ Blankenberg, D.; Von Kuster, G.; Coraor, N. et al. (2010). "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology 19 (Unit 19.10.1–21). doi:10.1002/0471142727.mb1910s89. PMC PMC4264107. PMID 20069535. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264107.
↑ Goecks, J.; Nekrutenko, A.; Taylor, J.; The Galaxy Team (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. PMC PMC2945788. PMID 20738864. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945788.
↑ O'Connor, B.D.; Merriman, B.; Nelson, S.F. (2010). "SeqWare Query Engine: storing and searching sequence data in the cloud". BMC Bioinformatics 11 (Suppl. 12): S2. doi:10.1186/1471-2105-11-S12-S2. http://www.biomedcentral.com/1471-2105/11/S12/S2.
↑ Baralis, E.; Bertotti, A.; Fiori, A.; Grand, A. (2012). "LAS: A software platform to support oncological data management". Journal of Medical System 36 (Suppl. 1): S81–90. doi:10.1007/s10916-012-9891-6. PMID 23117791.
↑ Forbes, S.A.; Beare, D.; Gunasekaran, P. et al. (2015). "COSMIC: Exploring the world's knowledge of somatic mutations in human cancer". Nucleic Acids Research 43 (DB1): D805–11. doi:10.1093/nar/gku1075. PMC PMC4383913. PMID 25355519. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383913.
↑ Forbes, S.A.; Bindal, N.; Beare, D. et al. (2016). "Abstract 5285: COSMIC: Comprehensively exploring oncogenomics". Cancer Research 76 (14 Suppl.): 5285. doi:10.1158/1538-7445.AM2016-5285.
↑ Kitts, A.; Phan, L.; Ward, M. et al. (2013). "The Database of Short Genetic Variation (dbSNP)". The NCBI Handbook (2nd ed.). National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/books/NBK174586/.
↑ Eilbeck, K.; Lewis, S.E.; Mungall, C.J. et al. (2005). "The Sequence Ontology: A tool for the unification of genome annotations". Genome Biology 6 (5): R44. doi:10.1186/gb-2005-6-5-r44. PMC PMC1175956. PMID 15892872. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1175956.
↑ Vukotic, A.; Watt, N.; Abedrabbo, T. et al. (2014). Neo4j in Action. Manning Publications. ISBN 9781617290763.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation, spelling, and grammar. We also added PMCID and DOI when they were missing from the original reference.

[ChenLIMS16-1] Chen, Y.; Lin, Y.; Yuan, X. et al.. "Chapter 9: LIMS and Clinical Data Management". In Shen, B.; Tang, H.; Jiang, X.. Translational Biomedical Informatics: A Precision Medicine Perspective. Springer. pp. 225–240. doi:10.1007/978-981-10-1503-8_9. ISBN 9789811015038.

[CLIPaper-2] Wood, S. (September 2007). "Comprehensive Laboratory Informatics: A Multilayer Approach" (PDF). American Laboratory. pp. 3. Archived from the original on 25 August 2017. https://web.archive.org/web/20170825181932/https://www.it.uu.se/edu/course/homepage/lims/vt12/ComprehensiveLaboratoryInformatics.pdf.

[AbbottStarlims18-3] "Starlims". Abbott. 2018. https://www.informatics.abbott/us/en/offerings/lims.

[SapioSapio18-4] "Sapio Sciences". Sapio Sciences, LLC. 2018. https://www.sapiosciences.com/.

[LabVantageLabVantage18-5] "LabVantage". LabVantage Solutions, Inc. 2018. https://www.labvantage.com/.

[BlankenbergGalaxy10-6] Blankenberg, D.; Von Kuster, G.; Coraor, N. et al. (2010). "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology 19 (Unit 19.10.1–21). doi:10.1002/0471142727.mb1910s89. PMC PMC4264107. PMID 20069535. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264107.

[GoecksGalaxy10-7] Goecks, J.; Nekrutenko, A.; Taylor, J.; The Galaxy Team (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. PMC PMC2945788. PMID 20738864. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945788.

[O.27ConnorSeqWare10-8] O'Connor, B.D.; Merriman, B.; Nelson, S.F. (2010). "SeqWare Query Engine: storing and searching sequence data in the cloud". BMC Bioinformatics 11 (Suppl. 12): S2. doi:10.1186/1471-2105-11-S12-S2. http://www.biomedcentral.com/1471-2105/11/S12/S2.

[BaralisLAS12-9] Baralis, E.; Bertotti, A.; Fiori, A.; Grand, A. (2012). "LAS: A software platform to support oncological data management". Journal of Medical System 36 (Suppl. 1): S81–90. doi:10.1007/s10916-012-9891-6. PMID 23117791.

[ForbesCOSMIC15-10] Forbes, S.A.; Beare, D.; Gunasekaran, P. et al. (2015). "COSMIC: Exploring the world's knowledge of somatic mutations in human cancer". Nucleic Acids Research 43 (DB1): D805–11. doi:10.1093/nar/gku1075. PMC PMC4383913. PMID 25355519. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383913.

[ForbesAbstract16-11] Forbes, S.A.; Bindal, N.; Beare, D. et al. (2016). "Abstract 5285: COSMIC: Comprehensively exploring oncogenomics". Cancer Research 76 (14 Suppl.): 5285. doi:10.1158/1538-7445.AM2016-5285.

[KittsTheData13-12] Kitts, A.; Phan, L.; Ward, M. et al. (2013). "The Database of Short Genetic Variation (dbSNP)". The NCBI Handbook (2nd ed.). National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/books/NBK174586/.

[EilbeckTheSeq05-13] Eilbeck, K.; Lewis, S.E.; Mungall, C.J. et al. (2005). "The Sequence Ontology: A tool for the unification of genome annotations". Genome Biology 6 (5): R44. doi:10.1186/gb-2005-6-5-r44. PMC PMC1175956. PMID 15892872. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1175956.

[VukoticNeo4j14-14] Vukotic, A.; Watt, N.; Abedrabbo, T. et al. (2014). Neo4j in Action. Manning Publications. ISBN 9781617290763.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

@@ Line 41: / Line 41: @@
 For this reason, many institutions have invested in the development of in-house solutions and/or have adapted open-source projects to their own requirements. In this way, the developed solutions can provide functionality that meet the specific needs of the researchers in their institution laboratories. From an engineering perspective, developing in-house solutions may also permit the exploration and adoption of new technologies, in order to define better data models and improve system performance.
-To address a substantial mismatch between the LIMS solutions on offer and the functional requirements dictated by research practice, in 2011 the Institute for Cancer Research at Candiolo (Italy) started to implement its own LIMS, named the Laboratory Assistant Suite (LAS) platform (9). The main purpose of the platform was to assist researchers in different laboratory and research activities, allowing management of different kinds of raw data (e.g., biological, molecular), tracking experimental data, supporting decision-making tasks, and integrating heterogeneous data for complex analyses. As development progressed, several new features and modules were included to (i) track clinical data, (ii) include support to the newest technologies exploited for molecular experiments, and (iii) standardize the description of genomic data by means of semantic web technologies. Thanks to these new features, scientists can gain better insight into tumor development by jointly studying the clinical evolution of the disease and the experimental results derived from ''in vivo'' and ''in vitro'' experimentation. The experimental pipelines exploited in the translational research context are the primary focus of the LAS, which targets the standardization of the genomic data to allow a comparison of results coming from different technologies.
+To address a substantial mismatch between the LIMS solutions on offer and the functional requirements dictated by research practice, the Institute for Cancer Research at Candiolo (Italy) started to implement its own LIMS, named the Laboratory Assistant Suite (LAS) platform, in 2011.<ref name="BaralisLAS12">{{cite journal |title=LAS: A software platform to support oncological data management |journal=Journal of Medical System |author=Baralis, E.; Bertotti, A.; Fiori, A.; Grand, A. |volume=36 |issue=Suppl. 1 |pages=S81–90 |year=2012 |doi=10.1007/s10916-012-9891-6 |pmid=23117791}}</ref> The main purpose of the platform was to assist researchers in different laboratory and research activities, allowing management of different kinds of raw data (e.g., biological, molecular), tracking experimental data, supporting decision-making tasks, and integrating heterogeneous data for complex analyses. As development progressed, several new features and modules were included to (i) track clinical data, (ii) include support to the newest technologies exploited for molecular experiments, and (iii) standardize the description of genomic data by means of semantic web technologies. Thanks to these new features, scientists can gain better insight into tumor development by jointly studying the clinical evolution of the disease and the experimental results derived from ''in vivo'' and ''in vitro'' experimentation. The experimental pipelines exploited in the translational research context are the primary focus of the LAS, which targets the standardization of the genomic data to allow a comparison of results coming from different technologies.
 Still, unlike the other commercial and open-source platforms, the LAS makes an attempt at covering a wide range of diverse laboratory procedures and, thanks to its versatile and general-purpose structure, it can be extended to support new ones with limited effort.
@@ Line 57: / Line 57: @@
 ===Data models===
+The LAS platform has been developed using different database technologies to fit the needs of the application, and to handle in a suitable way the heterogeneous data characteristics.
+The platform makes use of relational databases to track biological entities and their properties, as well as the information about the various experimental procedures. Since the platform includes different modules managing substantially different types of entities and/or specific laboratory procedures, different database instances are exploited. The core biological entities (i.e., Aliquot, Biomouse, and Cell Line) are identified by a unique and mnemonic key named GenealogyID that encodes relevant information regarding the history of the entity. This key is automatically generated by the LAS platform through formal rules and may be used to link the data across the databases.
+Parallel to the relational databases storing operating data, a graph database is exploited. It is used to represent the complex inherent hierarchy of biological entities and their relationships. Being able to easily and efficiently reconstruct the genealogical tree of each entity is indeed an essential feature of the platform, allowing the user to perform ''ad hoc'' queries and to isolate specific sub-trees of biological entities involved in the experimental pipeline. Moreover, the graph database has been exploited to store a knowledge base for the heterogeneous domains managed by the LAS modules. By using a graph representation, all these domains can be easily interconnected, while the knowledge base can be continuously updated and augmented with new layers of information and different levels of abstraction (e.g., proteomics, clinical, etc.). Finally, a social network of users and research groups using the LAS platform is also stored in the graph, to model data ownership, resolve data access conflicts, and manage data sharing and collaboration among different groups or users.
+A document database, MongoDB, is also used to store files associated with biological entities and metadata generated by both the LAS Genomic Data Manager modules and the MDDM. The latter usage will be discussed in detail in the next section.
+===Functionalities===
+The LAS architecture includes a set of software modules, i.e., fully-fledged web applications, each addressing a different type of biological entity and its associated experimental procedures. Modules may interact with each other by means of web [[application programming interface]]s (APIs), e.g., to exchange data and/or to carry out operations that span multiple entities or domains. (The modules currently included in the LAS platform are described in the following.)
+Even if the platform has been developed since 2011, we always took into account security issues during the design and development processes of the platform. In this way, our software is compliant with the constraints of the [[General Data Protection Regulation]] (GDPR), with enforcement beginning on May 25, 2018. Indeed, the management of data produced by different users and/or groups requires that the access to functionalities and information are restricted according to several criteria such as group and/or project membership, and user role. For these reasons, the platform manages users and their privileges following these concepts:
+* Working Group: A Working Group (WG) is a set of users in the LAS platform that work together toward a specific goal (e.g., project, research activity). The data produced by the users of the same group are private, unless they intentionally share data with other groups.
+* User Profiles: Each user belonging to a WG has a set of permissions to access the LAS functionalities they have been enabled to use. These functionalities are defined according to the role selected during the user registration process. The manager of their WG or the system administrator can assign new functionalities upon request.
+To collect data, the user is required to specify the informed consent signed by the patient for specific research activities (e.g., preclinical trials) involving personal samples and information. This document is defined by a committee to accomplish all the constraints included in the GDPR. Since the data are collected for research purposes, the patient can only revoke the usage of the biological samples, but not the information (e.g., experimental results) collected by the researchers. Only the researchers that are included in the research project can manage these samples and track the experimental processes according to their profile. The platform tracks all the procedures performed by each user in order to identify malicious usage of the software.
+For each patient, the Clinical module tracks both contextual information (i.e., personal data, Medical Center of the Trial, etc.) and relevant clinical events through a case report form. All data are related to the relative informed consent that grants data and specimen sampling.
+The BioBanking Management module covers a wide range of activities, including management of biological samples and associated pathological information, as well as support for a number of laboratory-related procedures. For instance, the module can handle the collection of biological material from surgical intervention and the acquisition of aliquots from external laboratories. Aliquots stored in the system are characterized by features such as tumor type (e.g., colorectal), tissue type (e.g., liver metastasis), source [[hospital]], or laboratory and pathological information. Measurement of aliquot physical characteristics such as volume, concentration, purity, and quality can be tracked by the module, as well as the derivation of new biological materials (e.g., DNA and cDNA) and the planning of molecular experiments.
+The biological material used in our laboratories is stored by means of several types of containers (e.g., freezers, racks, plates, and tubes). Their mutual interactions (i.e., which types of containers can host other containers) can change according to characteristics such as the layout and the laboratory procedure. Additionally, the Storage Management module allows managing any kind of container by defining and applying different rules to them, and it tracks the relationships between the containers and the biological entities.
+Different types of molecular analyses can be conducted on biological samples, to investigate various aspects of their genetic constituents that may have an impact on the development of oncogenic behavior. For instance, biologists may be interested in analyzing mutations for a target gene involved in tumor proliferation. In an effort to closely track the translational research pipeline from the collection of samples to their analysis, the LAS provides support to tracking the most frequently used molecular profiling techniques in our institution {e.g., Sanger sequencing, real-time polymerase chain reaction (PCR), and Sequenom). Each molecular module queries the knowledge base of the Genomic Annotation Manager (GAM) to retrieve the description of its reagents, as well as a specification of all possible alterations (e.g., sequence alterations and gene copy number variations) known in the literature, to allow both the experiment definition and the evaluation of experimental results.
+The GAM provides a higher-level, qualitative insight into the genomic features of biological samples. This information is shaped in the form of annotations, i.e., a set of semantic labels attached to a sample, pointing out some of its relevant features. To ensure semantic coherence and adopt a standardized nomenclature, all relevant concepts from the genomic and biological domains used for labeling samples have been drawn from a number of public, freely accessible databases and ontologies.<ref name="ForbesCOSMIC15">{{cite journal |title=COSMIC: Exploring the world's knowledge of somatic mutations in human cancer |journal=Nucleic Acids Research |author=Forbes, S.A.; Beare, D.; Gunasekaran, P. et al. |volume=43 |issue=DB1 |pages=D805–11 |year=2015 |doi=10.1093/nar/gku1075 |pmid=25355519 |pmc=PMC4383913}}</ref><ref name="ForbesAbstract16">{{cite journal |title=Abstract 5285: COSMIC: Comprehensively exploring oncogenomics |journal=Cancer Research |author=Forbes, S.A.; Bindal, N.; Beare, D. et al. |volume=76 |issue=14 Suppl. |page=5285 |year=2016 |doi=10.1158/1538-7445.AM2016-5285}}</ref><ref name="KittsTheData13">{{cite book |chapter=The Database of Short Genetic Variation (dbSNP) |title=The NCBI Handbook |author=Kitts, A.; Phan, L.; Ward, M. et al. |publisher=National Center for Biotechnology Information |edition=2nd |year=2013 |url=https://www.ncbi.nlm.nih.gov/books/NBK174586/}}</ref><ref name="EilbeckTheSeq05">{{cite journal |title=The Sequence Ontology: A tool for the unification of genome annotations |journal=Genome Biology |author=Eilbeck, K.; Lewis, S.E.; Mungall, C.J. et al. |volume=6 |issue=5 |pages=R44 |year=2005 |doi=10.1186/gb-2005-6-5-r44 |pmid=15892872 |pmc=PMC1175956}}</ref> This information has been structured into a knowledge base, modeled as a graph, and stored in a graph database.<ref name="VukoticNeo4j14">{{cite book |title=Neo4j in Action |author=Vukotic, A.; Watt, N.; Abedrabbo, T. et al. |publisher=Manning Publications |year=2014 |isbn=9781617290763}}</ref> Concepts are interlinked with one another according to both general-purpose semantic relationships such as containment ("part of") or generalization ("is a"), and domain-specific relationships (e.g., indicating an underlying biochemical process, as in "is transcribed from"). New concepts and relationships, as well as new domains of interest, may be added or layered as needed to account for novel findings and broaden the spectrum of investigation. Within the GAM, every annotation is a semantic statement establishing a relationship, expressed by means of a predicate, between a biological sample (the subject of the statement) and a concept (the object of the statement), such as a genetic mutation. It is represented within the graph database as a node of type "annotation" with a pair of incoming and outgoing edges, one linking the biological sample to the annotation node by means of a <tt>has_annotation</tt> relationship, and the other linking the annotation node to the reference node in the knowledge base by means of a <tt>has_reference</tt> relationship. The annotation node is often linked to other nodes, such as the process that produced the annotation or the raw experimental data.
 ==References==

Difference between revisions of "Journal:One tool to find them all: A case of data integration and querying in a distributed LIMS platform"

Revision as of 23:16, 20 January 2020

Contents

Abstract

Introduction

LAS

Data models

Functionalities

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export