Difference between revisions of "Journal:STATegra EMS: An experiment management system for complex next-generation omics experiments"

Full article title	STATegra EMS: An experiment management system for complex next-generation omics experiments
Journal	BMC Systems Biology
Author(s)	Hernández-de-Diego, R.; Boix-Chova, N.; Gómez-Cabrero, D.; Tegner, J.; Abugessaisa, I.; Conesa, A.
Author affiliation(s)	Centro de Investigación Príncipe Felipe and the Karolinska Institute
Primary contact	None given
Year published	2014
Volume and issue	8(Suppl 2)
Page(s)	S9
DOI	10.1186/1752-0509-8-S2-S9
ISSN	1752-0509
Distribution license	Creative Commons Attribution 2.0 Generic
Website	http://bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-8-S2-S9
Download	http://bmcsystbiol.biomedcentral.com/track/pdf/10.1186/1752-0509-8-S2-S9 (PDF)

Revision as of 23:14, 14 March 2016

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

High-throughput sequencing assays are now routinely used to study different aspects of genome organization. As decreasing costs and widespread availability of sequencing enable more laboratories to use sequencing assays in their research projects, the number of samples and replicates in these experiments can quickly grow to several dozens of samples and thus require standardized annotation, storage and management of preprocessing steps. As a part of the STATegra project, we have developed an Experiment Management System (EMS) for high throughput omics data that supports different types of sequencing-based assays such as RNA-seq, ChIP-seq, Methyl-seq, etc, as well as proteomics and metabolomics data. The STATegra EMS provides metadata annotation of experimental design, samples and processing pipelines, as well as storage of different types of data files, from raw data to ready-to-use measurements. The system has been developed to provide research laboratories with a freely-available, integrated system that offers a simple and effective way for experiment annotation and tracking of analysis procedures.

Background

The widespread availability of high-throughput sequencing techniques have importantly impacted genome research and reshaped the way we study genome function and structure. The rapidly decreasing costs of sequencing make these technologies affordable to small and medium size laboratories. Furthermore, the constant development of novel sequencing based assays, coined with the suffix -seq, expands the scope of cellular properties analyzable by high-throughput sequencing, with sequencing reads forming an underlying common data format. Today, virtually all nucleic acid omics methods traditionally based on microarrays have a -seq counterpart and many more have been made available recently. As a consequence, the possibility of running multiple sequencing-based experiments to measure different aspects of gene regulation and combining these with non-sequencing omics technologies such as proteomics and metabolomics has become practical.^[1]^[2]^[3]^[4]^[5] For example, the ENCODE project combined ten major types of sequencing-based assays to unravel the complexity of genome architecture.^[6] Many records can be found at the SRA archive that integrate multiple -seq technologies measured on the same samples and a PubMed search for NGS plus proteomics or metabolomics results in over hundred entries. Last but not least, one of the advantages of sequenced-based experiments is that they are equally applicable to the study of well-annotated model organisms as well as less-studied non-model organisms since little or no a priori genome knowledge is required.

Moreover, sequencing-based assays come with new challenges for data processing and storage. The memory size requirements of a medium sequencing experiments exceeds the capacity of current regular workstations. At the same time, the analysis steps to go from raw to processed data are more complex and resource intensive. As the number of datasets grows, the need to properly store and track the data and their associated metadata becomes more pressing. For example, a medium-sized RNA-seq experiments ranging between 4 to 20 samples of 20 million reads each can take up to 40 GB of raw data and generate multiple Quality Control and intermediate processing step files of up to 200 to 500 GB. Laboratory Information and Management Systems (LIMS) or Sample Management Systems (SMS) are bioinformatics tools that aid experimentalists to organize samples and experimental procedures in a controlled and annotated fashion. There are several commercial and free dedicated LIMS that have been developed specifically for genotyping labs where thousand of samples are processed by automated pipelines and procedures are tightly standardized.^[7]^[8]^[9] One popular LIMS for genomics is BASE.^[10] This software includes a highly structured system for metadata annotation and a flexible architecture for defining experiments and incorporating analysis modules. However, BASE is currently restricted to the annotation of microarray experiments.

Several laboratory information systems have been developed and implemented specifically for samples at sequencing facilities to manage the large volume of samples and data routinely handled by such services. Some of these have been made available to the scientific community or exported to other centers^[11] published an extension of the Protein Information Management System (PIMS) for the Leeds University DNA sequencing facility designed to provide sample tracking both to users and operators. The system allows facility users to place orders and monitor the processing status of their samples while a different interface provides operators with a full control on the progression of the sequencing pipeline with automated connection to sequencing robots. The Leeds system supervises the whole procedure from sample submission to generation of fastq files but does not track the actual experimental characteristics of the sequenced samples or the post-processing of the raw data. Other solutions add to the tracking of sequencing samples analysis modules that execute some steps of the raw data processing such as Quality Control analysis or mapping to a reference genome. For example, the QUEST software^[12] uses an experiment-resolved configuration file to store experiment metadata and execute predefined processing pipelines. Another example is NG6^[13] an integrated NGS storage and processing environment where workflows can be easily defined and adapted to different data input formats. NG6 can be used interactively to generate intermediate analysis statistics and downloadable end results. Similarly, Scholtalbers et al. recently published a LIMS for the Galaxy platform that keeps track of input sample quality and organize flow cells.^[14] By working within the Galaxy system, associated fastq files are readily available for processing using the platform's analysis resources. Another interesting package is the MADMAX system that considers multiple omics experiments by incorporating modules for microrarrays, metabolomics and genome annotation.^[15] MADMAX uses an Oracle relational database to store sample and raw data, and links to common bioinformatics tools such as Blast or Bioconductor installed on a computer cluster to facilitate data analysis.

We describe the STATegra Experiment Management System (EMS), which is an information system for storage and annotation of complex NGS and omics experiments. In contrast to other solutions that put the focus on management of thousands of samples for core sequencing facilities, the STATegra EMS has as primary goal the annotation of experiments designed and run at individual research laboratories. The system contains modules for the definition of omics experiments, samples and analysis workflows and it is able to incorporate data from different analytical platforms and sequencing services with great flexibility. The STATegra EMS supports currently mRNA-seq, ChIP-seq, DNase-seq, Methyl-seq, miRNA-seq, Proteomics and Metabolomics by default and can be easily adapted to support additional high-throughput experiments. The system uses free, open source software technologies, such as Java Servlets, the Sencha EXT JS framework, MySQL relational database system and the Apache Tomcat Servlet engine. The application can be downloaded from http://stategra.eu/stategraems.

Methods

STATegra EMS architecture

The STATegra EMS was designed as a multiuser web application and is divided in two components: the SERVER SIDE application and the CLIENT SIDE web application (Figure 1).

Figure 1. Overview of the STATegra EMS architecture

The server side is the responsible for keeping the consistency of data and for controlling the access to the stored information, is built using Java Servlets and a MySQL relational database and is unique for all clients. Although primarily designed and tested on Linux servers, the server EMS code could easily be adapted to work over other architectures due to the cross-platform nature of Java. Additionally, the server code was implemented using the Data Access Object design pattern in conjunction with the Data Transfer Object pattern. This provides an abstraction layer for interaction with databases that acts as an intermediary between server application (servlets) and the MySQL database, making easier future extensions of the application code with new features or changes in the database model.

The STATegra EMS client side was developed as user-friendly and intuitive web application using Ext JS, a cross-browser JavaScript framework which provided powerful tools for building interactive web applications. The client side is based on the Model-View-Controller architecture pattern, which make easier to organize, maintain and extend large client applications. Communication between Client and Server side is handled by AJAX and HTTP GET and POST protocols using JavaScript Object Notation (JSON) for data exchange.

User administration

The STATegra EMS is a system with user control. Users should be registered by the Administrator in the application before start working. As a general rule, the user creating a data element becomes the owner of this element and has exclusive rights for editing and deleting. However, any owner can grant access rights to other registered users of the system.

Data specification

The overall objective of the STATegra EMS is to serve as a logbook for high-throughput genomics projects performed at research labs by providing an easy-to-use tool for the annotation of experimental design, samples, measurements, and the analysis pipelines applied to the data. Experimental data and metadata are organized in the EMS around three major metadata modules (Figure 2): the Experiment module that records experimental design information and associated samples; the Samples module that collects all information on the used biomaterial; and the Analysis module that contains analysis pipelines and results. Both Sample and Analysis modules have been defined broadly to accommodate data from different type of omics experiments and still provide a common annotation framework. Commonly used standards in omics experimental data annotations were used when defining data specifications to facilitate EMS interoperability. In particular, we leveraged MIAPE^[16] for proteomics analysis annotation (metabolomics guidelines proposed by Sumner et al.^[17] and Goodacre et al.^[18]) and MIAME^[19] and MINSEQE^[20] for sequencing experiments.

References

↑ Song, C.X.; Szulwach, K.E.; Dai, Q. et al. (2013). "Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming". Cell 153 (3): 678–691. doi:10.1016/j.cell.2013.04.001. PMC PMC3657391. PMID 23602153. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3657391.
↑ Wei, G.; Abraham, B.J.; Yagi, R. et al. (2011). "Genome-wide analyses of transcription factor GATA3-mediated gene regulation in distinct T cell types". Immunity 35 (2): 299–311. doi:10.1016/j.immuni.2011.08.007. PMC PMC3169184. PMID 21867929. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3169184.
↑ Schmid, N.; Pessi, G.; Deng, Y. et al. (2012). "The AHL- and BDSF-dependent quorum sensing systems control specific and overlapping sets of genes in Burkholderia cenocepacia H111". PLoS One 7 (11): e49966. doi:10.1371/journal.pone.0049966. PMC PMC3502180. PMID 23185499. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3502180.
↑ Bordbar, A.; Mo, M.L.; Nakayasu, E.S. et al. (2012). "Model-driven multi-omic data analysis elucidates metabolic immunomodulators of macrophage activation". Molecular Systems Biology 8: 558. doi:10.1038/msb.2012.21. PMC PMC3397418. PMID 22735334. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3397418.
↑ Baltz, A.G.; Munschauer, M.; Schwanhäusser, B. et al. (2012). "The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts". Molecular Cell 46 (5): 674–690. doi:10.1016/j.molcel.2012.05.021. PMID 22681889.
↑ ENCODE Project Consortium; Bernstein, B.E.; Birney, E. et al. (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57–74. doi:10.1038/nature11247. PMC PMC3439153. PMID 22955616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153.
↑ Van Rossum, T.; Tripp, B.; Daley, D. (2010). "SLIMS: A user-friendly sample operations and inventory management system for genotyping labs". Bioinformatics 26 (14): 1808-1810. doi:10.1093/bioinformatics/btq271. PMC PMC2894515. PMID 20513665. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894515.
↑ Binneck, E.; Silva, J.F.; Neumaier, N. et al. (2004). "VSQual: A visual system to assist DNA sequencing quality control". Genetics and Molecular Research 3 (4): 474–482. PMID 15688314.
↑ Haquin, S.; Oeuillet, E.; Pajon, A. et al. (2008). "Data management in structural genomics: an overview". Methods in Molecular Biology 426: 49–79. doi:10.1007/978-1-60327-058-8_4. PMID 18542857.
↑ Vallon-Christersson, J.; Nordborg, N.; Svensson, M.; Häkkinen, J. (2009). "BASE - 2nd generation software for microarray data management and analysis". BMC Bioinformatics 10: 330. doi:10.1186/1471-2105-10-330. PMC PMC2768720. PMID 19822003. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768720.
↑ Troshin, P.V. Postis, V.L.; Ashworth, D. et al. (2011). "PIMS sequencing extension: a laboratory information management system for DNA sequencing facilities". BMC Research Notes 4: 48. doi:10.1186/1756-0500-4-48. PMC PMC3058032. PMID 21385349. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3058032.
↑ Camerlengo, T.; Ozer, H.G.; Onti-Srinivasan, R. et al. (2012). "From sequencer to supercomputer: an automatic pipeline for managing and processing next generation sequencing data". AMIA Joint Summits on Translational Science Proceedings 2012: 1–10. PMC PMC3392054. PMID 22779037. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3392054.
↑ Mariette, J.; Escudié, F.; Allias, N.; Salin, G.; Noirot, C.; Thomas, S.; Klopp, C. (2012). "NG6: Integrated next generation sequencing storage and processing environment". BMC Genomics 13: 462. doi:10.1186/1471-2164-13-462. PMC PMC3444930. PMID 22958229. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444930.
↑ Scholtalbers, J.; Rössler, J.; Sorn, P.; de Graaf, J.; Boisguérin, V.; Castle, J.; Sahin, U. (2013). "Galaxy LIMS for next-generation sequencing". Bioinformatics 29 (9): 1233-1234. doi:10.1093/bioinformatics/btt115. PMID 23479349.
↑ Lin, K.; Kools, H.; de Groot, P.J. et al. (2011). "MADMAX - Management and analysis database for multiple omics experiments". Journal of Integrative Bioinformatics 8 (2): 160. doi:10.2390/biecoll-jib-2011-160. PMID 21778530.
↑ Taylor, C.F.; Paton, N.W.; Lilley, K.S. et al. (2007). "The minimum information about a proteomics experiment (MIAPE)". Nature Biotechnology 25 (8): 887-893. doi:10.1038/nbt1329. PMID 17687369.
↑ Sumner, L.W.; Amberg, A.; Barrett, D. et al. (2007). "Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI)". Metabolomics 3 (3): 211–221. doi:10.1007/s11306-007-0082-2. PMC PMC3772505. PMID 24039616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3772505.
↑ Goodacre, R.; Broadhurst, D.; Smilde, A.K. et al. (2007). "Proposed minimum reporting standards for data analysis in metabolomics". Metabolomics 3 (3): 231–241. doi:10.1007/s11306-007-0081-3.
↑ Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365—371. doi:10.1038/ng1201-365. PMID 11726920.
↑ Functional Genomics Data Society. "MINSEQE: Minimum Information about a high throughput Nucleotide SeQuencing Experiment". fged.org. http://www.fged.org/projects/minseqe/.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. References 17 and 18 were not attached to any text in the original, so author names were added to the text for better flow.

[SongGenome13-1] Song, C.X.; Szulwach, K.E.; Dai, Q. et al. (2013). "Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming". Cell 153 (3): 678–691. doi:10.1016/j.cell.2013.04.001. PMC PMC3657391. PMID 23602153. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3657391.

[WeiGenome11-2] Wei, G.; Abraham, B.J.; Yagi, R. et al. (2011). "Genome-wide analyses of transcription factor GATA3-mediated gene regulation in distinct T cell types". Immunity 35 (2): 299–311. doi:10.1016/j.immuni.2011.08.007. PMC PMC3169184. PMID 21867929. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3169184.

[SchmidTheAHL12-3] Schmid, N.; Pessi, G.; Deng, Y. et al. (2012). "The AHL- and BDSF-dependent quorum sensing systems control specific and overlapping sets of genes in Burkholderia cenocepacia H111". PLoS One 7 (11): e49966. doi:10.1371/journal.pone.0049966. PMC PMC3502180. PMID 23185499. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3502180.

[BordbarModel12-4] Bordbar, A.; Mo, M.L.; Nakayasu, E.S. et al. (2012). "Model-driven multi-omic data analysis elucidates metabolic immunomodulators of macrophage activation". Molecular Systems Biology 8: 558. doi:10.1038/msb.2012.21. PMC PMC3397418. PMID 22735334. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3397418.

[BaltzThe_mRNA12-5] Baltz, A.G.; Munschauer, M.; Schwanhäusser, B. et al. (2012). "The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts". Molecular Cell 46 (5): 674–690. doi:10.1016/j.molcel.2012.05.021. PMID 22681889.

[ENCODEAnInt12-6] ENCODE Project Consortium; Bernstein, B.E.; Birney, E. et al. (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57–74. doi:10.1038/nature11247. PMC PMC3439153. PMID 22955616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153.

[VanRossumSLIMS10-7] Van Rossum, T.; Tripp, B.; Daley, D. (2010). "SLIMS: A user-friendly sample operations and inventory management system for genotyping labs". Bioinformatics 26 (14): 1808-1810. doi:10.1093/bioinformatics/btq271. PMC PMC2894515. PMID 20513665. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894515.

[BinneckVS04-8] Binneck, E.; Silva, J.F.; Neumaier, N. et al. (2004). "VSQual: A visual system to assist DNA sequencing quality control". Genetics and Molecular Research 3 (4): 474–482. PMID 15688314.

[HaquinData08-9] Haquin, S.; Oeuillet, E.; Pajon, A. et al. (2008). "Data management in structural genomics: an overview". Methods in Molecular Biology 426: 49–79. doi:10.1007/978-1-60327-058-8_4. PMID 18542857.

[VallonBASE09-10] Vallon-Christersson, J.; Nordborg, N.; Svensson, M.; Häkkinen, J. (2009). "BASE - 2nd generation software for microarray data management and analysis". BMC Bioinformatics 10: 330. doi:10.1186/1471-2105-10-330. PMC PMC2768720. PMID 19822003. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768720.

[TroshinPIMS11-11] Troshin, P.V. Postis, V.L.; Ashworth, D. et al. (2011). "PIMS sequencing extension: a laboratory information management system for DNA sequencing facilities". BMC Research Notes 4: 48. doi:10.1186/1756-0500-4-48. PMC PMC3058032. PMID 21385349. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3058032.

[CamerlengoFrom12-12] Camerlengo, T.; Ozer, H.G.; Onti-Srinivasan, R. et al. (2012). "From sequencer to supercomputer: an automatic pipeline for managing and processing next generation sequencing data". AMIA Joint Summits on Translational Science Proceedings 2012: 1–10. PMC PMC3392054. PMID 22779037. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3392054.

[MarietteNG612-13] Mariette, J.; Escudié, F.; Allias, N.; Salin, G.; Noirot, C.; Thomas, S.; Klopp, C. (2012). "NG6: Integrated next generation sequencing storage and processing environment". BMC Genomics 13: 462. doi:10.1186/1471-2164-13-462. PMC PMC3444930. PMID 22958229. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444930.

[ScholtalbersGal13-14] Scholtalbers, J.; Rössler, J.; Sorn, P.; de Graaf, J.; Boisguérin, V.; Castle, J.; Sahin, U. (2013). "Galaxy LIMS for next-generation sequencing". Bioinformatics 29 (9): 1233-1234. doi:10.1093/bioinformatics/btt115. PMID 23479349.

[LinMAD11-15] Lin, K.; Kools, H.; de Groot, P.J. et al. (2011). "MADMAX - Management and analysis database for multiple omics experiments". Journal of Integrative Bioinformatics 8 (2): 160. doi:10.2390/biecoll-jib-2011-160. PMID 21778530.

[TaylorTheMin07-16] Taylor, C.F.; Paton, N.W.; Lilley, K.S. et al. (2007). "The minimum information about a proteomics experiment (MIAPE)". Nature Biotechnology 25 (8): 887-893. doi:10.1038/nbt1329. PMID 17687369.

[SumnerProposed07-17] Sumner, L.W.; Amberg, A.; Barrett, D. et al. (2007). "Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI)". Metabolomics 3 (3): 211–221. doi:10.1007/s11306-007-0082-2. PMC PMC3772505. PMID 24039616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3772505.

[GoodacreProp07-18] Goodacre, R.; Broadhurst, D.; Smilde, A.K. et al. (2007). "Proposed minimum reporting standards for data analysis in metabolomics". Metabolomics 3 (3): 231–241. doi:10.1007/s11306-007-0081-3.

[BrazmaMin01-19] Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365—371. doi:10.1038/ng1201-365. PMID 11726920.

[TheFuncMINSEQE12-20] Functional Genomics Data Society. "MINSEQE: Minimum Information about a high throughput Nucleotide SeQuencing Experiment". fged.org. http://www.fged.org/projects/minseqe/.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

@@ Line 33: / Line 33: @@
 Several laboratory information systems have been developed and implemented specifically for samples at sequencing facilities to manage the large volume of samples and data routinely handled by such services. Some of these have been made available to the scientific community or exported to other centers<ref name="TroshinPIMS11">{{cite journal |title=PIMS sequencing extension: a laboratory information management system for DNA sequencing facilities |journal=BMC Research Notes |author=Troshin, P.V. Postis, V.L.; Ashworth, D. et al. |volume=4 |pages=48 |year=2011 |doi=10.1186/1756-0500-4-48 |pmid=21385349 |pmc=PMC3058032}}</ref> published an extension of the [[PiMS|Protein Information Management System]] (PIMS) for the Leeds University DNA sequencing facility designed to provide sample tracking both to users and operators. The system allows facility users to place orders and monitor the processing status of their samples while a different interface provides operators with a full control on the progression of the sequencing pipeline with automated connection to sequencing robots. The Leeds system supervises the whole procedure from sample submission to generation of fastq files but does not track the actual experimental characteristics of the sequenced samples or the post-processing of the raw data. Other solutions add to the tracking of sequencing samples analysis modules that execute some steps of the raw data processing such as Quality Control analysis or mapping to a reference genome. For example, the QUEST software<ref name="CamerlengoFrom12">{{cite journal |title=From sequencer to supercomputer: an automatic pipeline for managing and processing next generation sequencing data |journal=AMIA Joint Summits on Translational Science Proceedings |author=Camerlengo, T.; Ozer, H.G.; Onti-Srinivasan, R. et al. |volume=2012 |pages=1–10 |year=2012 |pmid=22779037 |pmc=PMC3392054}}</ref> uses an experiment-resolved configuration file to store experiment metadata and execute predefined processing pipelines. Another example is NG6<ref name="MarietteNG612">{{cite journal |title=NG6: Integrated next generation sequencing storage and processing environment |journal=BMC Genomics |author=Mariette, J.; Escudié, F.; Allias, N.; Salin, G.; Noirot, C.; Thomas, S.; Klopp, C. |volume=13 |pages=462 |year=2012 |doi=10.1186/1471-2164-13-462 |pmid=22958229 |pmc=PMC3444930}}</ref> an integrated NGS storage and processing environment where workflows can be easily defined and adapted to different data input formats. NG6 can be used interactively to generate intermediate analysis statistics and downloadable end results. Similarly, Scholtalbers et al. recently published a LIMS for the [[Galaxy (biomedical software)|Galaxy]] platform that keeps track of input sample quality and organize flow cells.<ref name="ScholtalbersGal13">{{cite journal |title=Galaxy LIMS for next-generation sequencing |journal=Bioinformatics |author=Scholtalbers, J.; Rössler, J.; Sorn, P.; de Graaf, J.; Boisguérin, V.; Castle, J.; Sahin, U. |volume=29 |issue=9 |pages=1233-1234 |year=2013 |doi=10.1093/bioinformatics/btt115 |pmid=23479349}}</ref> By working within the Galaxy system, associated fastq files are readily available for processing using the platform's analysis resources. Another interesting package is the MADMAX system that considers multiple omics experiments by incorporating modules for microrarrays, metabolomics and genome annotation.<ref name="LinMAD11">{{cite journal |title=MADMAX - Management and analysis database for multiple omics experiments |journal=Journal of Integrative Bioinformatics |author=Lin, K.; Kools, H.; de Groot, P.J. et al. |volume=8 |issue=2 |page=160 |year=2011 |doi=10.2390/biecoll-jib-2011-160 |pmid=21778530}}</ref> MADMAX uses an Oracle relational database to store sample and raw data, and links to common bioinformatics tools such as Blast or Bioconductor installed on a computer cluster to facilitate data analysis.
+We describe the STATegra Experiment Management System (EMS), which is an information system for storage and annotation of complex NGS and omics experiments. In contrast to other solutions that put the focus on management of thousands of samples for core sequencing facilities, the STATegra EMS has as primary goal the annotation of experiments designed and run at individual research laboratories. The system contains modules for the definition of omics experiments, samples and analysis workflows and it is able to incorporate data from different analytical platforms and sequencing services with great flexibility. The STATegra EMS supports currently mRNA-seq, ChIP-seq, DNase-seq, Methyl-seq, miRNA-seq, Proteomics and Metabolomics by default and can be easily adapted to support additional high-throughput experiments. The system uses free, open source software technologies, such as Java Servlets, the Sencha EXT JS framework, [[MySQL]] relational database system and the [[Apache Tomcat]] Servlet engine. The application can be downloaded from http://stategra.eu/stategraems.
+==Methods==
+===STATegra EMS architecture===
+The STATegra EMS was designed as a multiuser web application and is divided in two components: the SERVER SIDE application and the CLIENT SIDE web application (Figure 1).
+[[File:Fig1 Hernandez BMCSystemsBiology2014 8-Suppl2.jpg|700px]]
+{{clear}}
+{|
+ | STYLE="vertical-align:top;"|
+{| border="0" cellpadding="5" cellspacing="0" width="700px"
+ |-
+  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Overview of the STATegra EMS architecture</blockquote>
+ |-
+|}
+|}
+The server side is the responsible for keeping the consistency of data and for controlling the access to the stored information, is built using Java Servlets and a MySQL relational database and is unique for all clients. Although primarily designed and tested on Linux servers, the server EMS code could easily be adapted to work over other architectures due to the cross-platform nature of Java. Additionally, the server code was implemented using the Data Access Object design pattern in conjunction with the Data Transfer Object pattern. This provides an abstraction layer for interaction with databases that acts as an intermediary between server application (servlets) and the MySQL database, making easier future extensions of the application code with new features or changes in the database model.
+The STATegra EMS client side was developed as user-friendly and intuitive web application using Ext JS, a cross-browser JavaScript framework which provided powerful tools for building interactive web applications. The client side is based on the Model-View-Controller architecture pattern, which make easier to organize, maintain and extend large client applications. Communication between Client and Server side is handled by AJAX and HTTP GET and POST protocols using JavaScript Object Notation (JSON) for data exchange.
+===User administration===
+The STATegra EMS is a system with user control. Users should be registered by the Administrator in the application before start working. As a general rule, the user creating a data element becomes the owner of this element and has exclusive rights for editing and deleting. However, any owner can grant access rights to other registered users of the system.
+===Data specification===
+The overall objective of the STATegra EMS is to serve as a logbook for high-throughput genomics projects performed at research labs by providing an easy-to-use tool for the annotation of experimental design, samples, measurements, and the analysis pipelines applied to the data. Experimental data and metadata are organized in the EMS around three major metadata modules (Figure 2): the Experiment module that records experimental design information and associated samples; the Samples module that collects all information on the used [[biomaterial]]; and the Analysis module that contains analysis pipelines and results. Both Sample and Analysis modules have been defined broadly to accommodate data from different type of omics experiments and still provide a common annotation framework. Commonly used standards in omics experimental data annotations were used when defining data specifications to facilitate EMS interoperability. In particular, we leveraged MIAPE<ref name="TaylorTheMin07">{{cite journal |title=The minimum information about a proteomics experiment (MIAPE) |journal=Nature Biotechnology |author=Taylor, C.F.; Paton, N.W.; Lilley, K.S. et al. |volume=25 |issue=8 |page=887-893 |year=2007 |doi=10.1038/nbt1329 |pmid=17687369}}</ref> for proteomics analysis annotation (metabolomics guidelines proposed by Sumner et al.<ref name="SumnerProposed07">{{cite journal |title=Proposed minimum reporting standards for chemical analysis Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI) |journal=Metabolomics |author=Sumner, L.W.; Amberg, A.; Barrett, D. et al. |volume=3 |issue=3 |page=211–221 |year=2007 |doi=10.1007/s11306-007-0082-2 |pmid=24039616 |pmc=PMC3772505}}</ref> and Goodacre et al.<ref name="GoodacreProp07">{{cite journal |title=Proposed minimum reporting standards for data analysis in metabolomics |journal=Metabolomics |author=Goodacre, R.; Broadhurst, D.; Smilde, A.K. et al. |volume=3 |issue=3 |page=231–241 |year=2007 |doi=10.1007/s11306-007-0081-3}}</ref>) and MIAME<ref name="BrazmaMin01">{{cite journal |title=Minimum information about a microarray experiment (MIAME)-toward standards for microarray data |journal=Nature Genetics |author=Brazma, A.; Hingamp, P.; Quackenbush, J. et al. |volume=29 |issue=4 |page=365—371 |year=2001 |doi=10.1038/ng1201-365 |pmid=11726920}}</ref> and MINSEQE<ref name="TheFuncMINSEQE12">{{cite web |url=http://www.fged.org/projects/minseqe/ |title=MINSEQE: Minimum Information about a high throughput Nucleotide SeQuencing Experiment |author=Functional Genomics Data Society |work=fged.org}}</ref> for sequencing experiments.
 ==References==
@@ Line 38: / Line 65: @@
 ==Notes==
-This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.
+This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. References 17 and 18 were not attached to any text in the original, so author names were added to the text for better flow.
 <!--Place all category tags here-->

Difference between revisions of "Journal:STATegra EMS: An experiment management system for complex next-generation omics experiments"

Revision as of 23:14, 14 March 2016

Contents

Abstract

Background

Methods

STATegra EMS architecture

User administration

Data specification

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export