Journal:Integrated systems for NGS data management and analysis: Open issues and available solutions

From LIMSWiki
Revision as of 23:05, 29 August 2016 by Shawndouglas (talk | contribs) (Added content. Saving and adding more.)
Jump to navigationJump to search
Full article title Integrated systems for NGS data management and analysis: Open issues and available solutions
Journal Frontiers in Genetics
Author(s) Bianchi, V.; Ceol, A.; Ogier, A.G.; de Pretis, S.; Galeota, E.; Kishore, K.; Bora, P.; Croci, O.; Campaner, S.; Amati, B.; Morelli, M.J.; Pelizzola, M.
Author affiliation(s) Fondazione Istituto Italiano di Tecnologia, European Institute of Oncology,
Primary contact Email: hubert dot rehrauer at fgcz dot ethz dot ch
Editors Pellegrini, M.
Year published 2016
Volume and issue 7
Page(s) 75
DOI 10.3389/fgene.2016.00075
ISSN 1664-8021
Distribution license Creative Commons Attribution 4.0 International
Website http://journal.frontiersin.org/article/10.3389/fgene.2016.00075/full
Download http://journal.frontiersin.org/article/10.3389/fgene.2016.00075/pdf (PDF)

Abstract

Next-generation sequencing (NGS) technologies have deeply changed our understanding of cellular processes by delivering an astonishing amount of data at affordable prices; nowadays, many biology laboratories have already accumulated a large number of sequenced samples. However, managing and analyzing these data poses new challenges, which may easily be underestimated by research groups devoid of IT and quantitative skills. In this perspective, we identify five issues that should be carefully addressed by research groups approaching NGS technologies. In particular, the five key issues to be considered concern: (1) adopting a laboratory information management system (LIMS) and safeguard the resulting raw data structure in downstream analyses; (2) monitoring the flow of the data and standardizing input and output directories and file names, even when multiple analysis protocols are used on the same data; (3) ensuring complete traceability of the analysis performed; (4) enabling non-experienced users to run analyses through a graphical user interface (GUI) acting as a front-end for the pipelines; (5) relying on standard metadata to annotate the datasets, and when possible using controlled vocabularies, ideally derived from biomedical ontologies. Finally, we discuss the currently available tools in the light of these issues, and we introduce HTS-flow, a new workflow management system conceived to address the concerns we raised. HTS-flow is able to retrieve information from a LIMS database, manages data analyses through a simple GUI, outputs data in standard locations and allows the complete traceability of datasets, accompanying metadata and analysis scripts.

Keywords: high-throughput sequencing, workflow management system, genomics, epigenomics, laboratory information management system

Introduction

Next-generation sequencing (NGS) technologies have unveiled with unprecedented detail the genomic and epigenomic patterns associated with cellular processes, therefore revolutionizing our understanding of biology. In the last years, a large number of laboratories adopted these technologies, also thanks to the steady decrease of the associated costs. However, working with NGS data inescapably creates issues in the management, storage and analysis of large and complex datasets, which are often largely underestimated: for example, a medium-sized lab (10–15 scientists) could easily generate over 500 NGS samples per year, corresponding to 1–2 terabytes of raw data.

Standard analysis of NGS data can be divided in two steps. First, the raw sequencing reads need to be assembled or aligned to a reference genome: this process often requires substantial computing time and infrastructures, produces large output files, but it typically does not involve much hands-on time and can be standardized for a given data type; we will call these first steps "primary analyses." Second, biologically relevant information needs to be extracted from the assembled/aligned reads: this part of the analysis is strongly data-type-dependent, outputs small files but it may involve multiple attempts using different tools (whose parameters need to be tuned), resulting in a much larger hands-on time and potential branching of the analysis flow; we will refer to these steps as "secondary analyses." An overview of this process is given in Figure 1.


Fig1 Bianchi FrontinGenetics2016 7.jpg

Figure 1. Workflow management systems for NGS: overview and issues discussed in the text. A typical analysis workflow for NGS is presented, associated to both the corresponding metadata and to optional additional external data. The workflow is linked to the corresponding issues discussed in the text.

We identified five issues to be addressed by labs generating and analyzing a large amount of NGS data, which we discuss below.

Issue 1: Structuring the raw data

Research groups and sequencing facilities using high-throughput sequencers will quickly deal with large numbers of samples; therefore, they will likely need a laboratory information management system (LIMS) to manage the production of NGS data. A LIMS can handle information typically associated and submitted with the sequencing request. In addition, it can follow the processing of the sequencers up to the generation and archiving of the final sequencing reads. A LIMS typically relies on a database for keeping track of all the steps and includes a graphical user interface (GUI) to process user requests, as well as taking care of the maintenance of data. Specifically, a LIMS deals with general tasks such as quality controls, tracking of data, de-multiplexing of reads, and it commonly adopts a structured tree of directories for the distribution of the final data to the user (typically unaligned reads as FASTQ files).

A number of recently developed solutions are available in this field. SMITH[1] offers a complete automatic system for handling NGS data on high performance computing clusters (HPCs) and can run various downstream workflows through the Galaxy workflow management system (WMS)[2][3], limiting the user interaction only to administrative tasks. MendeLIMS[4] is mainly focused on the management of clinical genome sequencing projects. WASP provides a basic management system that hosts automated workflows[5] for ChIP-Seq, RNA-Seq, miRNA-Seq, and Exome-Seq. SLIMS is a sample management tool that allows the creation of metadata information for genome-wide association studies.[6] The Galaxy platform offers the module Galaxy LIMS[7] providing full access to Galaxy analysis tools in addition to basic LIMS functions.

Issue 2: Monitoring the analysis flow

A LIMS allows complementing raw NGS data with storage locations and metadata information, guaranteeing complete traceability, and therefore easily pinpointing inconsistencies. The subsequent analysis of those data, both at the primary and secondary levels, commonly neglects the structure provided by the LIMS, severely affecting the robustness and reproducibility of the analyses, ultimately complicating the retrieval and sharing of the final results. Indeed, renaming of files and/or their relocation in non-standard locations can often occur and ultimately impair the association of the output of the analysis to the original LIMS entries. In some cases, the lack of this link could be particularly problematic, and would obfuscate batch effects or issues affecting specific samples. For example, read length is a parameter that is usually stored in a NGS LIMS, and it is generally not conserved along the downstream analysis. Let us suppose that, during the sequencing, a sample was generated with a different read length. This sample could become an outlier due to this specific difference; yet, if the connection to this piece of information stored in the LIMS is lost, the cause of this peculiar behavior could not be easily traced. On the contrary, while the extraction of biological information from NGS data requires the creation of new and possibly complex data structures, it is essential to avoid losing the structured order given by the LIMS, allowing the traceability of the performed analyses and defining standards for reproducibility.

Importantly, the secondary data analysis often involves testing several combinations of tools and parameters, further inflating the required disk space and generating multiple outputs. Without a careful organization of the results and without tracking their association with the raw data, primary analysis, and associated metadata, very rapidly the final results could become difficult to interpret by collaborators or colleagues. Given enough time, this would likely be also the case for the person who performed the final secondary analyses. The ability to monitor the flow of NGS data from their generation to preliminary and higher-level analyses becomes then critical for dissemination and integration of the results of a single experiment in a larger community.

Issue 3: Automatizing and documenting tools

Typically, a scientist with quantitative background (often a bioinformatician) is responsible for the primary and secondary analysis of NGS data. This task is accomplished through the execution of a series of existent or custom-made pipelines, which may need to be manually changed to account for the peculiarities of each given experiment. If this process is not properly managed, it could rely on copying and pasting lines of code, followed by manual renaming and moving of script files. The automation of these processes is a key point for a standardized (while flexible) analytical workflow and largely prevents the possibility of committing errors, especially for routine tasks that have to be manually repeated several times.

Automation is facilitated by the definition of modular, interconnected functions, which can be used to tailor the application of pipelines to the specific user’s needs. Modularity makes pipelines flexible, efficient, and easy to maintain and is critical when contribution from multiple people is expected; moreover, the high turnover of Ph.D. students and postdocs, implies frequent transfers of knowledge, which may result in a loss of critical information. To this regard, the usage of established versioning system (e.g., Git[8] and Subversion[9]) allows to easily reproduce old results, to retrieve and correct errors in the code, and to increase the productivity of collaborative projects of software development. The modularity of software is also instrumental in promoting the adoption of parallel computation: given the growing field of cloud computing and multicore processors, having efficient pipelines that can distribute data and tasks across different parallel computing nodes and/or processors is a clear advantage and greatly reduces the amount of waiting time for the user.

Issue 4: Ease of use

In a typical scientific department or research group working with high-throughput sequencing technologies, the number of bioinformaticians in charge of the analysis of NGS data is considerably smaller than their wet-lab counterparts. This either causes an exceeding number of requests to the bioinformaticians or encourages wet-lab scientists to embark in NGS data analysis. Even in presence of consolidated and thoroughly tested pipelines, running the analysis requires being able to use command-line interfaces and some familiarity with the Linux/Unix operating systems: these skills are typically not taught in biology courses. Setting up a GUI that offers access to the pipelines would strongly increase the ease of use of the analysis framework, and disclose it to users devoid of specific training in computer science. GUI-based systems could alternatively offer the possibility of tuning the parameters of the various implemented tools, implement only default parameters, or implement automatic choices of the optimal parameters based on the input data and metadata.[10] In any case, the analysis should finally provide simple, standardized diagnostics and clear figures and tables summarizing standard outputs, and should possibly automatically highlight failed quality checks and inconsistencies. These last features are particularly important when inexperienced users can control the parameters of the available tools.

Equipped with these options, the analysis framework can easily be used by wet-lab scientists with minimal training (especially the part concerning secondary analyses). This would in turn free up a substantial amount of time for bioinformaticians, which could be relocated from repetitive tasks to more challenging and rewarding projects. Based on our experience, we believe this model is particularly effective for medium to large labs.

Issue 5: Data reproducibility

The analysis of high-throughput biological data, including NGS data, often relies on annotation data available in public databases. For example, the alignment of the reads requires the availability of FASTA files for a reference genome; the determination of absolute gene expression in RNA-seq experiments depends on having GTF files containing the structure of transcriptional units constituents (such as exons, and coding sequences). Importantly, reference genomes and other annotation data are periodically updated, and tracking their versions becomes essential to ensure compatibility with future analyses. At the same time, multiple annotation data have to be consistent with a particular reference genome build. In addition, some metadata can be retrieved by alternative providers or generated based on different criteria (see for example transcript annotations based on RefSeq or ENSEMBL). As a result, analyses based on alternative sources of metadata will likely provide different results, even if matched to the correct reference genome.

For these reasons it is fundamental to track the adopted resources and maintain compatible and updated annotation data. To this regard, projects such as Bioconductor[11] encourage the adoption of standard annotation packages as reference for the community of scientists working in this field, ranging from annotation databases (packages of the TxDb series) to complete genome assemblies (BSgenome packages). In this way, one could rely on those metadata packages, thus ensuring the reproducibility and comparability of the results.

Similarly, the usage of controlled vocabularies, ideally derived from biomedical ontologies, would prevent ambiguities and help properly organize the metadata. This can help create an unambiguous description of the type of treatment that a given cell or tissue type in a specific disease state was subjected to. The use of these resources is often encouraged, for example when publishing NGS data in large-scale repositories, and standards such as MIAME are available.[12] Nevertheless, these good practice recommendations are only sporadically applied and, as a result, querying databases of high-throughput biological data can be cumbersome. The same issue can affect the metadata contained in the LIMS and WMSs: therefore, it can be extremely useful to provide metadata specific for the primary and secondary analyses, containing a minimal description of the experimental design and the performed analysis (for examples describing the rational of comparing specific conditions within an analysis of differential expression), therefore making the results more intelligible to other scientists. Noteworthy, the availability of proper metadata for the samples and their analysis can greatly facilitate the export of the output files in public repositories.

Workflow management systems

Workflow management systems try to cope with Issues 2–5 and are essential for efficiently managing the analysis of large NGS datasets.

Galaxy[2][3] is a popular data analysis framework that handles NGS data and allows designing articulated workflows. Its last release provides a simplified framework for integrating and/or designing new analysis pipelines, and despite being intended also for non-programmer users, it is mainly restricted to skilled bioinformaticians for complex tasks. While Galaxy addressed Issue 1 with the development of Galaxy LIMS (which supports request submissions, de-multiplexing, and delivery of the sample files), the output of the pipelines is not standardized and depends on the user. The high level of flexibility in modifying the parameters decreases a lot the automation given by this resource, limiting its usage to an audience with both a good knowledge of the tools applied and good IT skills.

Chipster[13] is another popular WMS with a user-friendly interface that can analyze several types of NGS data (such as RNA-, miRNA-, ChIP-, and whole-genome sequencing), and save and share automatic workflows with other users. Chipster is not designed to be integrated with a LIMS; the users have to import their data manually. Workflows can be easily set up in few minutes and then repeated on several samples with little hands-on work. The lack of a LIMS-like system make this tool suitable only for laboratories with a limited amount of NGS datasets (each sample has to be loaded separately,) and setting the pipelines requires a good knowledge of the tools the user is going to use.

Two recent tools, Omics Pipe[14] and QuickNGS[15], were developed with the main goal of making NGS analyses available to a broader audience. However, Omics Pipe is strongly oriented for IT specialists and bioinformaticians who need to analyze a large number of dataset and want to automate data analysis pipelines for multiple NGS technologies (RNA-, Exome-, miRNA-, ChIP-, and whole-genome sequencing). The Python modules at the core of Omics Pipe make it easily extendable and allow users with Unix command-line experience to execute the supported pipelines, which can be debugged and corrected thanks to the built-in version control. All these features combined make it very difficult for a biologist with average computational expertise to use Omics Pipe.

On the other hand, QuickNGS allows performing most of the common operations on NGS data, such as primary analyses (filtering and alignment of the sequence reads to the reference genome) and secondary analyses (differential gene expression and differential exon usage for RNA sequencing data) with very limited prior knowledge and hands-on time. The results of the workflows are easily accessible by users through the generation of standardized spreadsheet files, plots, and web reports. The adoption of annotation databases such as BioMart or genome sequence and annotations from Ensembl, make the results highly reproducible. The web interface is extremely simple and permits to follow the operations performed on the samples, but does not allow parameter adjusting. Finally, QuickNGS does not offer integration with a LIMS and the pipelines require the samples being associated with metadata information.

Despite the availability of various WMS, a recent review highlighted that these solutions have a fundamental limit: users have to switch to multiple GUIs to execute a complete NGS analysis.[16]

References

  1. Venco, F.; Vaskin, Y.; Ceol, A.; Muller, H. (2014). "SMITH: a LIMS for handling next-generation sequencing workflows". BMC Bioinformatics 15 (Suppl 14): S3. doi:10.1186/1471-2105-15-S14-S3. PMC PMC4255740. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4255740. 
  2. 2.0 2.1 Blankenberg, D.; Taylor, J.; Schenck, I. et al. (2007). "A framework for collaborative analysis of ENCODE data: making large-scale analyses biologist-friendly". Genome Research 17 (6): 960–964. doi:10.1101/gr.5578007. PMC PMC1891355. PMID 17568012. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1891355. 
  3. 3.0 3.1 Boekel, J.; Chilton, J.M.; Cooke, I.R. et al. (2015). "Multi-omic data analysis using Galaxy". Nature Biotechnology 33 (2): 137–139. doi:10.1038/nbt.3134. PMID 25658277. 
  4. Grimes, Susan M.; Ji, Hanlee P. (2014). "MendeLIMS: A web-based laboratory information management system for clinical genome sequencing". BMC Bioinformatics 15: 290. doi:10.1186/1471-2105-15-290. PMC PMC4155081. PMID 25159034. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4155081. 
  5. McLellan, A.S.; Dubin, R.; Jing, Q. et al. (2012). "The Wasp System: An open source environment for managing and analyzing genomic data". Genomics 100 (6): 345-51. doi:10.1016/j.ygeno.2012.08.005. PMID 22944616. 
  6. Van Rossum, T.; Tripp, B.; Daley, D. (2010). "SLIMS: A user-friendly sample operations and inventory management system for genotyping labs". Bioinformatics 26 (14): 1808-1810. doi:10.1093/bioinformatics/btq271. PMC PMC2894515. PMID 20513665. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2894515. 
  7. Scholtalbers, J.; Rössler, J.; Sorn, P. et al. (2013). "Galaxy LIMS for next-generation sequencing". Bioinformatics 29 (9): 1233-1234. doi:10.1093/bioinformatics/btt115. PMID 23479349. 
  8. "Git: Local Branching on the cheap". https://git-scm.com/. 
  9. "Apache Subversion". Apache Software Foundation. https://subversion.apache.org/. 
  10. Pelizzola, M.; Pavelka, N.; Foti, M.; Ricciardi-Castagnoli, P. (2006). "AMDA: An R package for the automated microarray data analysis". BMC Bioinformatics 7: 335. doi:10.1186/1471-2105-7-335. PMC PMC1534071. PMID 16824223. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1534071. 
  11. Gentleman, R.C.; Carey, V.J.; Bates, D.M. et al. (2004). "Bioconductor: Open software development for computational biology and bioinformatics". Genome Biology 5 (10): R80. doi:10.1186/gb-2004-5-10-r80. PMC PMC545600. PMID 15461798. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC545600. 
  12. Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365-71. doi:10.1038/ng1201-365. PMID 11726920. 
  13. Kallio, M.A.; Tuimala, J.T.; Hupponen, T. et al. (2011). "Chipster: User-friendly analysis software for microarray and other high-throughput data". BMC Genomics 12: 507. doi:10.1186/1471-2164-12-507. PMC PMC3215701. PMID 21999641. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3215701. 
  14. Fisch, K.M.; Meißner, T.; Gioia, L. et al. (2015). "Omics Pipe: A community-based framework for reproducible multi-omics data analysis". Bioinformatics 31 (11): 1724-1728. doi:10.1093/bioinformatics/btv061. PMC PMC4443682. PMID 25637560. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4443682. 
  15. Wagle, P.; Nikolić, M.; Frommolt, P. (2015). "QuickNGS elevates next-generation sequencing data analysis to a new level of automation". BMC Genomics 16: 487. doi:10.1186/s12864-015-1695-x. PMC PMC4486389. PMID 26126663. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4486389. 
  16. Poplawski, A.; Marini, F.; Hess, M. et al. (2016). "Systematically evaluating interfaces for RNA-seq analysis from a life scientist perspective". Briefings in Bioinformatics 17 (2): 213–23. doi:10.1093/bib/bbv036. PMID 26108229. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.