Journal:STATegra EMS: An experiment management system for complex next-generation omics experiments

From LIMSWiki
Revision as of 21:18, 14 March 2016 by Shawndouglas (talk | contribs) (Added more)
Jump to navigationJump to search
Full article title STATegra EMS: An experiment management system for complex next-generation omics experiments
Journal BMC Systems Biology
Author(s) Hernández-de-Diego, R.; Boix-Chova, N.; Gómez-Cabrero, D.; Tegner, J.; Abugessaisa, I.; Conesa, A.
Author affiliation(s) Centro de Investigación Príncipe Felipe and the Karolinska Institute
Primary contact None given
Year published 2014
Volume and issue 8(Suppl 2)
Page(s) S9
DOI 10.1186/1752-0509-8-S2-S9
ISSN 1752-0509
Distribution license Creative Commons Attribution 2.0 Generic
Website http://bmcsystbiol.biomedcentral.com/articles/10.1186/1752-0509-8-S2-S9
Download http://bmcsystbiol.biomedcentral.com/track/pdf/10.1186/1752-0509-8-S2-S9 (PDF)

Abstract

High-throughput sequencing assays are now routinely used to study different aspects of genome organization. As decreasing costs and widespread availability of sequencing enable more laboratories to use sequencing assays in their research projects, the number of samples and replicates in these experiments can quickly grow to several dozens of samples and thus require standardized annotation, storage and management of preprocessing steps. As a part of the STATegra project, we have developed an Experiment Management System (EMS) for high throughput omics data that supports different types of sequencing-based assays such as RNA-seq, ChIP-seq, Methyl-seq, etc, as well as proteomics and metabolomics data. The STATegra EMS provides metadata annotation of experimental design, samples and processing pipelines, as well as storage of different types of data files, from raw data to ready-to-use measurements. The system has been developed to provide research laboratories with a freely-available, integrated system that offers a simple and effective way for experiment annotation and tracking of analysis procedures.

Background

The widespread availability of high-throughput sequencing techniques have importantly impacted genome research and reshaped the way we study genome function and structure. The rapidly decreasing costs of sequencing make these technologies affordable to small and medium size laboratories. Furthermore, the constant development of novel sequencing based assays, coined with the suffix -seq, expands the scope of cellular properties analyzable by high-throughput sequencing, with sequencing reads forming an underlying common data format. Today, virtually all nucleic acid omics methods traditionally based on microarrays have a -seq counterpart and many more have been made available recently. As a consequence, the possibility of running multiple sequencing-based experiments to measure different aspects of gene regulation and combining these with non-sequencing omics technologies such as proteomics and metabolomics has become practical.[1][2][3][4][5] For example, the ENCODE project combined ten major types of sequencing-based assays to unravel the complexity of genome architecture.[6] Many records can be found at the SRA archive that integrate multiple -seq technologies measured on the same samples and a PubMed search for NGS plus proteomics or metabolomics results in over hundred entries. Last but not least, one of the advantages of sequenced-based experiments is that they are equally applicable to the study of well-annotated model organisms as well as less-studied non-model organisms since little or no a priori genome knowledge is required.

Moreover, sequencing-based assays come with new challenges for data processing and storage. The memory size requirements of a medium sequencing experiments exceeds the capacity of current regular workstations. At the same time, the analysis steps to go from raw to processed data are more complex and resource intensive. As the number of datasets grows, the need to properly store and track the data and their associated metadata becomes more pressing. For example, a medium-sized RNA-seq experiments ranging between 4 to 20 samples of 20 million reads each can take up to 40 GB of raw data and generate multiple Quality Control and intermediate processing step files of up to 200 to 500 GB. Laboratory Information and Management Systems (LIMS) or Sample Management Systems (SMS) are bioinformatics tools that aid experimentalists to organize samples and experimental procedures in a controlled and annotated fashion. There are several commercial and free dedicated LIMS that have been developed specifically for genotyping labs where thousand of samples are processed by automated pipelines and procedures are tightly standardized.[7][8][9] One popular LIMS for genomics is BASE.[10] This software includes a highly structured system for metadata annotation and a flexible architecture for defining experiments and incorporating analysis modules. However, BASE is currently restricted to the annotation of microarray experiments.

References

  1. Song, C.X.; Szulwach, K.E.; Dai, Q. et al. (2013). "Genome-wide profiling of 5-formylcytosine reveals its roles in epigenetic priming". Cell 153 (3): 678–691. doi:10.1016/j.cell.2013.04.001. PMC PMC3657391. PMID 23602153. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3657391. 
  2. Wei, G.; Abraham, B.J.; Yagi, R. et al. (2011). "Genome-wide analyses of transcription factor GATA3-mediated gene regulation in distinct T cell types". Immunity 35 (2): 299–311. doi:10.1016/j.immuni.2011.08.007. PMC PMC3169184. PMID 21867929. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3169184. 
  3. Schmid, N.; Pessi, G.; Deng, Y. et al. (2012). "The AHL- and BDSF-dependent quorum sensing systems control specific and overlapping sets of genes in Burkholderia cenocepacia H111". PLoS One 7 (11): e49966. doi:10.1371/journal.pone.0049966. PMC PMC3502180. PMID 23185499. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3502180. 
  4. Bordbar, A.; Mo, M.L.; Nakayasu, E.S. et al. (2012). "Model-driven multi-omic data analysis elucidates metabolic immunomodulators of macrophage activation". Molecular Systems Biology 8: 558. doi:10.1038/msb.2012.21. PMC PMC3397418. PMID 22735334. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3397418. 
  5. Baltz, A.G.; Munschauer, M.; Schwanhäusser, B. et al. (2012). "The mRNA-bound proteome and its global occupancy profile on protein-coding transcripts". Molecular Cell 46 (5): 674–690. doi:10.1016/j.molcel.2012.05.021. PMID 22681889. 
  6. ENCODE Project Consortium; Bernstein, B.E.; Birney, E. et al. (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57–74. doi:10.1038/nature11247. PMC PMC3439153. PMID 22955616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153. 
  7. Van Rossum, T.; Tripp, B.; Daley, D. (2010). "SLIMS: A user-friendly sample operations and inventory management system for genotyping labs". Bioinformatics 26 (14): 1808-1810. doi:10.1093/bioinformatics/btq271. PMC PMC2894515. PMID 20513665. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894515. 
  8. Binneck, E.; Silva, J.F.; Neumaier, N. et al. (2004). "VSQual: A visual system to assist DNA sequencing quality control". Genetics and Molecular Research 3 (4): 474–482. PMID 15688314. 
  9. Haquin, S.; Oeuillet, E.; Pajon, A. et al. (2008). "Data management in structural genomics: an overview". Methods in Molecular Biology 426: 49–79. doi:10.1007/978-1-60327-058-8_4. PMID 18542857. 
  10. Vallon-Christersson, J.; Nordborg, N.; Svensson, M.; Häkkinen, J. (2009). "BASE - 2nd generation software for microarray data management and analysis". BMC Bioinformatics 10: 330. doi:10.1186/1471-2105-10-330. PMC PMC2768720. PMID 19822003. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768720. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.