Journal:SaDA: From sampling to data analysis—An extensible open source infrastructure for rapid, robust and automated management and analysis of modern ecological high-throughput microarray data

Full article title	SaDA: From sampling to data analysis—An extensible open source infrastructure for rapid, robust; and automated management and analysis of modern ecological high-throughput microarray data
Journal	BMC Bioinformatics
Author(s)	Singh, Kumar Saurabh; Thual, Dominique; Spurio, Roberto; Cannata, Nicola
Author affiliation(s)	University of Camerino; Next Generation Bioinformatics s.r.l
Primary contact	Tel.: +44-1582-763133 (ext. 5230); Fax: +44-1582-762595
Editors	Tchounwou, Paul B.
Year published	2015
Volume and issue	12 (6)
Page(s)	6352-6366
DOI	10.3390/ijerph120606352
ISSN	1660-4601
Distribution license	Creative Commons Attribution 4.0 International
Website	http://www.mdpi.com/1660-4601/12/6/6352/htm
Download	http://www.mdpi.com/1660-4601/12/6/6352/pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

One of the most crucial characteristics of day-to-day laboratory information management is the collection, storage and retrieval of information about research subjects and environmental or biomedical samples. An efficient link between sample data and experimental results is absolutely important for the successful outcome of a collaborative project. Currently available software solutions are largely limited to large scale, expensive commercial Laboratory Information Management Systems (LIMS). Acquiring such LIMS indeed can bring laboratory information management to a higher level, but most of the times this requires a sufficient investment of money, time and technical efforts. There is a clear need for a light weighted open source system which can easily be managed on local servers and handled by individual researchers. Here we present a software named SaDA for storing, retrieving and analyzing data originated from microorganism monitoring experiments. SaDA is fully integrated in the management of environmental samples, oligonucleotide sequences, microarray data and the subsequent downstream analysis procedures. It is simple and generic software, and can be extended and customized for various environmental and biomedical studies.

Keywords: software; data management; microarrays; ecological assessment; environmental studies; LIMS; open source system

Introduction

To gain an understanding of complex microbial environments, researchers have to assemble different type of information originating from heterogeneous sources. Every bit of information is valuable and should be made available whenever it is required. An exhaustive and clear protocol for data collection, processing and analysis is critical to achieve the projects end goals. The rapid increase in the discovery and application of molecular techniques constitutes important steps towards the detection and identification of microorganisms present in complex environmental samples. Identification and characterization of microorganisms is a key part of the management of environment and quality of life, tracing contaminants and troubleshooting problems such as pathogenesis or pollution indication.^[1] Identification of an unknown species can help to assess whether it poses a safety concern or not. Generally, the process of bio-monitoring plays a crucial role in the execution of large scale projects which focus either on the protection of people’s health, safety of materials or dealing with environment quality. Big ecology related projects address very complicated issues.^[2] They require generation and collection of data from a wide spectrum of sources, collaboration of people from different disciplines and the application of highly complex analytical approaches.^[3] Moreover, information management is important especially when a large volume of complex data is collected. It becomes even more important when multiple organizations, at different geographical locations, are involved in the data collection procedures. It is absolutely essential that data are collected and managed using the methods recognized by scientific community. Existing software systems do not typically overlap the full pipeline as far as ecological collaborative projects are concerned. They are limited by data types, provide limited extensibility or require additional skills to handle data which are beyond the scope of experimental researchers. We developed Sampling to Data Analysis (SaDA) keeping in mind the requirements of individual laboratory involved in big collaborative projects. The system is a complete portal for ecology related high-throughput data management and analyses. Extra care was taken in designing database schema, information retrieval and in data presentation. SaDA is customizable and extensible to meet the needs of diverse research activities as the source code is freely available. The core of SaDA is based on model dependent development where abstract models of software systems are created and systematically transformed to its implementation.^[4] The current version of SaDA supports data management and analysis protocols involved in fresh water monitoring systems utilizing microarray technology as a state-of-the-art technology.

State-of-art technology

To the best of our knowledge no freely available software is focused on microarray data management and analysis in terms of its ecological application, as in environmental monitoring. The aim of SaDA is to provide a streamlined work-flow from environmental sample collection and its downstream analysis. Microarray technology, however, is used here as a state-of-the-art technology to detect and quantify pathogens and other microorganisms in environmental freshwater samples. SaDA currently supports Agilent’s one channel oligonucleotide hybridization microarray technology. In addition, the downstream data analysis detects the presence or absence of signals, normalization and estimating cell counts from the micro-array signal intensities. SaDA is not a tool for gene expression analysis. Although existing software tools could meet some of the requirements of SaDA, none meet all of them in the form of a comprehensive, end-to-end platform available as open source and extensible system. Some tools have experienced only limited use. The state of art in Laboratory Information Management System (LIMS) provides all aspect of microarray data management and analysis, using advanced automation and precise robotic control. This type of integrated solutions ensures high quality and reliable data output and enables multiple projects to be managed in parallel. LIMS are mainly commercial and expensive software products. Few commercial systems lack clarity of open source systems, come at considerable costs, and requires high level of technical expertise to install and configure hardware, software and database and is not discussed in the present work.

MARS^[5] provides a comprehensive MIAME supportive suite for storing, retrieving, and analyzing multi-color microarray data. iLAP^[6], based on a workflow driven modular architecture, was developed specifically to create and manage experimental protocols and to analyze and share laboratory data. Base^[7] has been proved to be a comprehensive Microarray data repository, providing researchers an efficient data management and analysis tool. Labkey server^[8] is a complete system with easy to use interfaces for specimen request submission across collaborators, allowing users to graphically define new data-types to fit in diverse datasets. It interacts dynamically with external data sources and supports developing custom interfaces using client libraries. GnomEx^[9] provides comprehensive automated protocols for sample submission, sample tracking, billing, data management and analysis workflows for both Next Generation Sequencing (NGS) and Microarrays. LIMS Light System^[10] has been developed to meet the demands of high-throughput primary data storage and data upload/retrieval. This system provides an efficient storage system with a Hierarchical Storage Management System and an Oracle relational database. OpenBis^[11] is a flexible framework that has been adapted to be used for proteomics, high content screening and NGS projects. The SMITH^[12] and WASP^[13] system has been developed to meet the demands of NGS experiments and clinical tests. They provide embedded pipelines for the analysis of ChIP-Seq, RNA-Seq, miRNA-Seq and Exome-Seq experiments. There is also an extension with LIMS functionality to the popular Galaxy^[14] workflow engine called Galaxy LIMS.^[15] The system supports request submission, offers assistance during flow cell layout, and automatically launches Illumina’s CASAVA software^[16] to perform de-multiplexing and delivery of the user-specific NGS raw files. Being integrated with Galaxy, the data are automatically available to be processed by analysis pipelines stored in Galaxy. SLIMS^[17] is a sample management tool for genotyping laboratories.

References

↑ Rüegg, J.; Gries, C.; Bond-Lamberty, B.; Bowen, G.J.; Felzer, B.S.; McIntyre, N.E.; Soranno, P.; Vanderbilt, K.L.; Weathers, K.C. (2014). "Completing the data life cycle: Using information management in macrosystems ecology research". Frontiers in Ecology and the Environment 12 (1): 24–30. doi:10.1890/120375. http://www.mdpi.com/1660-4601/12/6/6352/htm.
↑ Heffernan, J.B.; Soranno, P.; Angilletta, M.J.; Buckley, L.B.; Gruner, D.S.; Keitt, T.H.; Kellner, J.R.; Kominoski, J.S.; Rocha, A.V.; Xiao, J. et al. (2014). "Macrosystems ecology: Understanding ecological patterns and processes at continental scales". Frontiers in Ecology and the Environment 12 (1): 5–14. doi:10.1890/130017. http://www.esajournals.org/doi/abs/10.1890/130017.
↑ Goring, S.J.; Weathers, K.C.; Dodds, W.K.; Soranno, P.; Sweet, L.C.; Cheruvelil, K.S.; Kominoski, J.S.; Rüegg, J.; Thorn, A.M.; Utz, R.M. (2014). "Improving the culture of interdisciplinary collaboration in ecology by expanding measures of success". Frontiers in Ecology and the Environment 12 (1): 39–47. doi:10.1890/120370. http://www.esajournals.org/doi/abs/10.1890/120370.
↑ Bushmann, F.; Meunier, R.; Rohnert, H.; Sommerlad, P.; Stal, M. (1996). Pattern-Oriented Software Architecture: A System of Patterns. 1. West Sussex, England: John Wiley & Sons Ltd. pp. 476. ISBN 9780471958697. http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471958697.html.
↑ Maurer, M.; Molidor, R.; Sturn, A.; Hartler, J.; Hackl, H.; Stocker, G.; Prokesch, A.; Scheideler, M.; Trajanoski, Z. (2005). "MARS: Microarray analysis, retrieval, and storage system". BMC Bioinformatics 6: 101. doi:10.1186/1471-2105-6-101. PMC PMC1090551. PMID 15836795. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1090551.
↑ Stocker, G.; Fischer, M.; Rieder, D.; Bindea, G.; Kainz, S.; Oberstolz, M.; McNally, J.G.; Trajanoski, Z. (2009). "iLAP: A workflow-driven software for experimental protocol development, data acquisition and analysis". BMC Bioinformatics 10: 390. doi:10.1186/1471-2105-10-390. PMC PMC2789074. PMID 19941647. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789074.
↑ Vallon-Christersson, J.; Nordborg, N.; Svensson, M.; Häkkinen, J. (2009). "BASE - 2nd generation software for microarray data management and analysis". BMC Bioinformatics 10: 330. doi:10.1186/1471-2105-10-330. PMC PMC2768720. PMID 19822003. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768720.
↑ Nelson, E.K.; Piehler, B.; Eckels, J.; Rauch, A.; Bellew, M.; Hussey, P.; Ramsay, S.; Nathe, C.; Lum, K.; Krouse, K; Stearns, D.; Connolly, B.; Skillman, T.; Igra, M. (2011). "LabKey Server: An open source platform for scientific data integration, analysis and collaboration". BMC Bioinformatics 12 (71). doi:10.1186/1471-2105-12-71. PMC PMC3062597. PMID 21385461. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3062597.
↑ Nix, D.; di Sera, T.L.; Dalley, B.K.; Milash, B.; Cundick, R.M.; Quinn, K.S.; Courdy, S.J. (2010). "Next generation tools for genomic data generation, distribution, and visualization". BMC Bioinformatics 11 (455). doi:10.1186/1471-2105-11-455. PMC PMC2944281. PMID 20828407. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2944281.
↑ Colmsee, C.; Flemming, S.; Klapperstück, M.; Lange, M.; Scholz, U. (2011). "A case study for efficient management of high throughput primary lab data". BMC Research Notes 4: 413. doi:10.1186/1756-0500-4-413. PMC PMC3217054. PMID 22005096. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3217054.
↑ Bauch, A.; Adamczyk, I.; Buczek, P.; Elmer, F.-J.; Enimanev, K.; Glyzewski, P.; Kohler, M.; Pylak, T.; Quandt, A.; Ramakrishnan, C.; Beisel, C.; Malmström, L.; Aebersold, R.; Rinn, B. (2011). "openBIS: a flexible framework for managing and analyzing complex data in biology research". BMC Bioinformatics 12: 468. doi:10.1186/1471-2105-12-468. PMC PMC3275639. PMID 22151573. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3275639.
↑ Venco, F.; Ceol, A.; Muller, H. (2013). "SLIMS: A LIMS for handling next-generation sequencing workflows". EMBnet.journal 19 (B): 85–87. doi:10.14806/ej.19.B.739. http://journal.embnet.org/index.php/embnetjournal/article/view/739.
↑ McLellan, A.S.; Dubin, R.; Jing, Q.; Broin, P.Ó.; Moskowitz, D.; Suzuki, M.; Calder, R.B.; Hargitai, J.; Golden, A.; Greally, J.M. (2012). "The Wasp System: An open source environment for managing and analyzing genomic data". Genomics 100 (6): 345-51. doi:10.1016/j.ygeno.2012.08.005. PMID 22944616.
↑ Giardine, B.; Riemer, C.; Hardison, R.C.; Burhans, R.; Elnitski, L.; Shah, P.; Zhang, Y.; Blankenberg, D.; Albert, I.; Taylor, J.; Miller, W.; Kent, W.J.; Nekrutenko, A. (2005). "Galaxy: A platform for interactive large-scale genome analysis". Genome Research 15 (10): 1451–1455. doi:10.1101/gr.4086505. PMC PMC1240089. PMID 16169926. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1240089.
↑ Scholtalbers, J.; Rössler, J.; Sorn, P.; de Graaf, J.; Boisguérin, V.; Castle, J.; Sahin, U. (2013). "Galaxy LIMS for next-generation sequencing". Bioinformatics 29 (9): 1233-1234. doi:10.1093/bioinformatics/btt115. PMID 23479349.
↑ "Sequencing Documentation". CASAVA User Guide. Illumina, Inc. http://support.illumina.com/sequencing/documentation.html. Retrieved 02 June 2015.
↑ Van Rossum, T.; Tripp, B.; Daley, D. (2010). "SLIMS—A user-friendly sample operations and inventory management system for genotyping labs". Bioinformatics 26 (14): 1808–10. doi:10.1093/bioinformatics/btq271. PMC PMC2894515. PMID 20513665. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894515.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.

[RueggComp14-1] Rüegg, J.; Gries, C.; Bond-Lamberty, B.; Bowen, G.J.; Felzer, B.S.; McIntyre, N.E.; Soranno, P.; Vanderbilt, K.L.; Weathers, K.C. (2014). "Completing the data life cycle: Using information management in macrosystems ecology research". Frontiers in Ecology and the Environment 12 (1): 24–30. doi:10.1890/120375. http://www.mdpi.com/1660-4601/12/6/6352/htm.

[HeffernanMacro14-2] Heffernan, J.B.; Soranno, P.; Angilletta, M.J.; Buckley, L.B.; Gruner, D.S.; Keitt, T.H.; Kellner, J.R.; Kominoski, J.S.; Rocha, A.V.; Xiao, J. et al. (2014). "Macrosystems ecology: Understanding ecological patterns and processes at continental scales". Frontiers in Ecology and the Environment 12 (1): 5–14. doi:10.1890/130017. http://www.esajournals.org/doi/abs/10.1890/130017.

[GoringImp14-3] Goring, S.J.; Weathers, K.C.; Dodds, W.K.; Soranno, P.; Sweet, L.C.; Cheruvelil, K.S.; Kominoski, J.S.; Rüegg, J.; Thorn, A.M.; Utz, R.M. (2014). "Improving the culture of interdisciplinary collaboration in ecology by expanding measures of success". Frontiers in Ecology and the Environment 12 (1): 39–47. doi:10.1890/120370. http://www.esajournals.org/doi/abs/10.1890/120370.

[BushmannPatern96-4] Bushmann, F.; Meunier, R.; Rohnert, H.; Sommerlad, P.; Stal, M. (1996). Pattern-Oriented Software Architecture: A System of Patterns. 1. West Sussex, England: John Wiley & Sons Ltd. pp. 476. ISBN 9780471958697. http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471958697.html.

[MaurerMARS05-5] Maurer, M.; Molidor, R.; Sturn, A.; Hartler, J.; Hackl, H.; Stocker, G.; Prokesch, A.; Scheideler, M.; Trajanoski, Z. (2005). "MARS: Microarray analysis, retrieval, and storage system". BMC Bioinformatics 6: 101. doi:10.1186/1471-2105-6-101. PMC PMC1090551. PMID 15836795. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1090551.

[Stocker_iLAP09-6] Stocker, G.; Fischer, M.; Rieder, D.; Bindea, G.; Kainz, S.; Oberstolz, M.; McNally, J.G.; Trajanoski, Z. (2009). "iLAP: A workflow-driven software for experimental protocol development, data acquisition and analysis". BMC Bioinformatics 10: 390. doi:10.1186/1471-2105-10-390. PMC PMC2789074. PMID 19941647. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2789074.

[VallonBASE09-7] Vallon-Christersson, J.; Nordborg, N.; Svensson, M.; Häkkinen, J. (2009). "BASE - 2nd generation software for microarray data management and analysis". BMC Bioinformatics 10: 330. doi:10.1186/1471-2105-10-330. PMC PMC2768720. PMID 19822003. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768720.

[NelsonLabkey11-8] Nelson, E.K.; Piehler, B.; Eckels, J.; Rauch, A.; Bellew, M.; Hussey, P.; Ramsay, S.; Nathe, C.; Lum, K.; Krouse, K; Stearns, D.; Connolly, B.; Skillman, T.; Igra, M. (2011). "LabKey Server: An open source platform for scientific data integration, analysis and collaboration". BMC Bioinformatics 12 (71). doi:10.1186/1471-2105-12-71. PMC PMC3062597. PMID 21385461. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3062597.

[NixNext10-9] Nix, D.; di Sera, T.L.; Dalley, B.K.; Milash, B.; Cundick, R.M.; Quinn, K.S.; Courdy, S.J. (2010). "Next generation tools for genomic data generation, distribution, and visualization". BMC Bioinformatics 11 (455). doi:10.1186/1471-2105-11-455. PMC PMC2944281. PMID 20828407. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2944281.

[ColmseeACase11-10] Colmsee, C.; Flemming, S.; Klapperstück, M.; Lange, M.; Scholz, U. (2011). "A case study for efficient management of high throughput primary lab data". BMC Research Notes 4: 413. doi:10.1186/1756-0500-4-413. PMC PMC3217054. PMID 22005096. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3217054.

[Bauch_openBIS11-11] Bauch, A.; Adamczyk, I.; Buczek, P.; Elmer, F.-J.; Enimanev, K.; Glyzewski, P.; Kohler, M.; Pylak, T.; Quandt, A.; Ramakrishnan, C.; Beisel, C.; Malmström, L.; Aebersold, R.; Rinn, B. (2011). "openBIS: a flexible framework for managing and analyzing complex data in biology research". BMC Bioinformatics 12: 468. doi:10.1186/1471-2105-12-468. PMC PMC3275639. PMID 22151573. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3275639.

[VencoSLIMS13-12] Venco, F.; Ceol, A.; Muller, H. (2013). "SLIMS: A LIMS for handling next-generation sequencing workflows". EMBnet.journal 19 (B): 85–87. doi:10.14806/ej.19.B.739. http://journal.embnet.org/index.php/embnetjournal/article/view/739.

[McLellanTheWasp12-13] McLellan, A.S.; Dubin, R.; Jing, Q.; Broin, P.Ó.; Moskowitz, D.; Suzuki, M.; Calder, R.B.; Hargitai, J.; Golden, A.; Greally, J.M. (2012). "The Wasp System: An open source environment for managing and analyzing genomic data". Genomics 100 (6): 345-51. doi:10.1016/j.ygeno.2012.08.005. PMID 22944616.

[GiardineGalaxy05-14] Giardine, B.; Riemer, C.; Hardison, R.C.; Burhans, R.; Elnitski, L.; Shah, P.; Zhang, Y.; Blankenberg, D.; Albert, I.; Taylor, J.; Miller, W.; Kent, W.J.; Nekrutenko, A. (2005). "Galaxy: A platform for interactive large-scale genome analysis". Genome Research 15 (10): 1451–1455. doi:10.1101/gr.4086505. PMC PMC1240089. PMID 16169926. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1240089.

[ScholtalbersGal13-15] Scholtalbers, J.; Rössler, J.; Sorn, P.; de Graaf, J.; Boisguérin, V.; Castle, J.; Sahin, U. (2013). "Galaxy LIMS for next-generation sequencing". Bioinformatics 29 (9): 1233-1234. doi:10.1093/bioinformatics/btt115. PMID 23479349.

[CASAVA-16] "Sequencing Documentation". CASAVA User Guide. Illumina, Inc. http://support.illumina.com/sequencing/documentation.html. Retrieved 02 June 2015.

[VanRossumSLIMS10-17] Van Rossum, T.; Tripp, B.; Daley, D. (2010). "SLIMS—A user-friendly sample operations and inventory management system for genotyping labs". Bioinformatics 26 (14): 1808–10. doi:10.1093/bioinformatics/btq271. PMC PMC2894515. PMID 20513665. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894515.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

Journal:SaDA: From sampling to data analysis—An extensible open source infrastructure for rapid, robust and automated management and analysis of modern ecological high-throughput microarray data

Contents

Abstract

Introduction

State-of-art technology

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export