Difference between revisions of "User:Shawndouglas/Sandbox"

From LIMSWiki
Jump to navigationJump to search
(Replaced content with "<div class="nonumtoc">__TOC__</div> {{ombox | type = notice | style = width: 960px; | text = This is my primary sandbox page, where I play with features and...")
Tags: Manual revert Replaced
 
(71 intermediate revisions by the same user not shown)
Line 3: Line 3:
| type      = notice
| type      = notice
| style    = width: 960px;
| style    = width: 960px;
| text      = This is sublevel1 of my sandbox, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see [[User_talk:Shawndouglas|my discussion page]] instead.<p></p>
| text      = This is my primary sandbox page, where I play with features and test MediaWiki code. If you wish to leave a comment for me, please see [[User_talk:Shawndouglas|my discussion page]] instead.<p></p>
}}
}}


==Sandbox begins below==
==Sandbox begins below==
{{Infobox journal article
|name        =
|image        =
|alt          = <!-- Alternative text for images -->
|caption      =
|title_full  = systemPipeR: NGS workflow and report generation environment
|journal      = ''BMC Bioinformatics''
|authors      = Backman, Tyler W.H.; Girke, Thomas
|affiliations = University of California, Riverside
|contact      = Email: thomas dot girke at ucr dot edu
|editors      =
|pub_year    = 2016
|vol_iss      = '''17'''
|pages        = 388
|doi          = [http://10.1186/s12859-016-1241-0 10.1186/s12859-016-1241-0]
|issn        = 1471-2105
|license      = [http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International]
|website      = [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1241-0 https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1241-0]
|download    = [https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1241-0 https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1241-0] (PDF)
}}
{{ombox
| type      = content
| style    = width: 500px;
| text      = This article should not be considered complete until this message box has been removed. This is a work in progress.
}}
==Abstract==
'''Background''': Next-generation sequencing (NGS) has revolutionized how research is carried out in many areas of biology and medicine. However, the analysis of NGS data remains a major obstacle to the efficient utilization of the technology, as it requires complex multi-step processing of big data, demanding considerable computational expertise from users. While substantial effort has been invested on the development of software dedicated to the individual analysis steps of NGS experiments, insufficient resources are currently available for integrating the individual software components within the widely used R/Bioconductor environment into automated [[Workflow|workflows]] capable of running the analysis of most types of NGS applications from start-to-finish in a time-efficient and reproducible manner.
'''Results''': To address this need, we have developed the R/Bioconductor package systemPipeR. It is an extensible environment for both building and running end-to-end analysis workflows with automated report generation for a wide range of NGS applications. Its unique features include a uniform workflow interface across different NGS applications, automated report generation, and support for running both R and command-line software on local computers and computer clusters. A flexible sample annotation infrastructure efficiently handles complex sample sets and experimental designs. To simplify the analysis of widely used NGS applications, the package provides pre-configured workflows and reporting templates for RNA-Seq, ChIP-Seq, VAR-Seq, and Ribo-Seq. Additional workflow templates will be provided in the future.
'''Conclusions''': systemPipeR accelerates the extraction of reproducible analysis results from NGS experiments. By combining the capabilities of many R/Bioconductor and command-line tools, it makes efficient use of existing software resources without limiting the user to a set of predefined methods or environments. systemPipeR is freely available for all common operating systems from Bioconductor (http://bioconductor.org/packages/devel/systemPipeR).
'''Keywords''': analysis workflow, next generation sequencing (NGS), Ribo-Seq, ChIP-Seq, RNA-Seq, VAR-Seq
==Background==
By allowing scientists to rapidly sequence and quantify DNA and RNA molecules, next-generation sequencing (NGS) technology has transformed biology into one of the most data intensive research disciplines. In the past, experiments have been performed on a gene-by-gene basis, while NGS has introduced an age where it is has become a routine to sequence entire transcriptomes, genomes, or epigenomes rather than their isolated parts of interest. It will soon be possible to conduct these experiments on large numbers of single cell samples<ref name="KaliskySingle11">{{cite journal |title=Single-cell genomics |journal=Nature Methods |author=Kalisky, T.; Quake, S.R. |volume=8 |issue=4 |pages=311–4 |year=2011 |doi=10.1038/nmeth0411-311 |pmid=21451520}}</ref><ref name="TrapnellTheDynamics14">{{cite journal |title=The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells |journal=Nature Biotechnology |author=Trapnell, C.; Cacchiarelli, D.; Grimsby, J. et al. |volume=32 |issue=4 |pages=381–86 |year=2014 |doi=10.1038/nbt.2859 |pmid=24658644 |pmc=PMC4122333}}</ref> for a wide range of time points, treatments, and genetic backgrounds to study biological systems with greater resolution and precision. Sequencing the genetic material of each individual within entire populations of organisms of the same species or genus will enable the study of adaptation processes<ref name="Lindblad-TohAHigh11">{{cite journal |title=A high-resolution map of human evolutionary constraint using 29 mammals |journal=Nature |author=Lindblad-Toh, K.; Garber, M.; Zuk, O. et al. |volume=478 |issue=7370 |pages=476–82 |year=2011 |doi=10.1038/nature10530 |pmid=21993624 |pmc=PMC3207357}}</ref>, disease progression, and micro-evolution in real time.<ref name="Kato-MaedaUseOf13">{{cite journal |title=Use of whole genome sequencing to determine the microevolution of Mycobacterium tuberculosis during an outbreak |journal=PLoS One |author=Kato-Maeda, M.; Ho, C.; Passarelli, B. et al. |volume=8 |issue=3 |pages=e58235 |year=2013 |doi=10.1371/journal.pone.0058235 |pmid=23472164 |pmc=PMC3589338}}</ref> This technological shift empowers researchers to address questions at a [[Genomics|genome-wide]] scale, for example by profiling the mRNA, miRNA, and DNA methylation states of a large set of biological samples in parallel.<ref name="HoltTheNew08">{{cite journal |title=The new paradigm of flow cell sequencing |journal=Genome Research |author=Holt, R.A.; Jones, S.J. |volume=18 |issue=6 |pages=839-46 |year=2008 |doi=10.1101/gr.073262.107 |pmid=18519653}}</ref>
The success of NGS-driven research has led to a data explosion of increasing size and complexity, making it now more time-consuming and challenging for researchers to extract knowledge from their experiments. Rapid processing of the results is essential to test, refine, and formulate new hypotheses for designing follow-up experiments. As a result, biologists have to dedicate nowadays substantial time to [[data analysis]] tasks while training themselves effectively as [[Genome informatics|genome data scientists]] rather than focusing on experimentation as they used to in the past.
In recent years, a considerable number of algorithms, statistical methods, and software tools has been developed to perform the individual analysis steps of different NGS applications. These include short read pre-processors, aligners, variant and peak callers, as well as statistical methods for the analysis of genomic regions that are differentially expressed<ref name="RobinsonEdgeR10">{{cite journal |title=edgeR: A Bioconductor package for differential expression analysis of digital gene expression data |journal=Bioinformatics |author=Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. |volume=26 |issue=1 |pages=139–40 |year=2010 |doi=10.1093/bioinformatics/btp616 |pmid=19910308 |pmc=PMC2796818}}</ref><ref name="LoveModerated14">{{cite journal |title=Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 |journal=Genome Biology |author=Love, M.I.; Huber, W.; Anders, S. |volume=15 |issue=12 |pages=550 |year=2014 |doi=10.1186/s13059-014-0550-8 |pmid=25516281 |pmc=PMC4302049}}</ref>, bound<ref name="KharchenkoDesign08">{{cite journal |title=Design and analysis of ChIP-seq experiments for DNA-binding proteins |journal=Nature Biotechnology |author=Kharchenko, P.V.; Tolstorukov, M.Y.; Park, P.J. |volume=26 |issue=12 |pages=1351–9 |year=2008 |doi=10.1038/nbt.1508 |pmid=19029915 |pmc=PMC2597701}}</ref>, or methylated.<ref name="AkalinMethyl12">{{cite journal |title=methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles |journal=Genome Biology |author=Akalin, A.; Kormaksson, M.; Li, S. et al. |volume=13 |issue=10 |pages=R87 |year=2012 |doi=10.1186/gb-2012-13-10-r87 |pmid=23034086 |pmc=PMC3491415}}</ref><ref name="HuberOrch15">{{cite journal |title=Orchestrating high-throughput genomic analysis with Bioconductor |journal=Nature Methods |author=Huber, W.; Carey, V.J.; Gentleman, R. et al. |volume=12 |issue=2 |pages=115–21 |year=2015 |doi=10.1038/nmeth.3252 |pmid=25633503 |pmc=PMC4509590}}</ref> Also essential are tools for processing short read alignments<ref name="LiTheSeq09">{{cite journal |title=The Sequence Alignment/Map format and SAMtools |journal=Bioinformatics |author=Li, H.; Handsaker, B.; Wysoker, A. et al. |volume=25 |issue=16 |pages=2078–9 |year=2009 |doi=10.1093/bioinformatics/btp352 |pmid=19505943 |pmc=PMC2723002}}</ref>, genomic intervals<ref name="LawrenceSoft13">{{cite journal |title=Software for computing and annotating genomic ranges |journal=PLoS Computational Biology |author=Lawrence, M.; Huber, W.; Pagès, H. et al. |volume=9 |issue=8 |page=e1003118 |year=2013 |doi=10.1371/journal.pcbi.1003118 |pmid=23950696 |pmc=PMC3738458}}</ref><ref name="QuinlanBED10">{{cite journal |title=BEDTools: a flexible suite of utilities for comparing genomic features |journal=Bioinformatics |author=Quinlan, A.R.; Hall, I.M. |volume=26 |issue=6 |pages=841-2 |year=2010 |doi=10.1093/bioinformatics/btq033 |pmid=20110278 |pmc=PMC2832824}}</ref>, and annotations.<ref name="DurinckBioMart">{{cite journal |title=BioMart and Bioconductor: A powerful link between biological databases and microarray data analysis |journal=Bioinformatics |author=Durinck, S.; Moreau, Y.; Kasprzyk, A. |volume=21 |issue=16 |pages=3439-40 |year=2005 |doi=10.1093/bioinformatics/bti525 |pmid=16082012}}</ref> However, most data analysis routines of NGS applications are very complex, involving multiple software tools for their many processing steps. As a result, there is a great need for flexible software environments connecting the individual software components to automated workflows in order to perform complex genome-wide analyses in an efficient and reproducible manner. While many workflow management resources exist<ref name="GoecksGalaxy10">{{cite journal |title=Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences |journal=Genome Biology |author=Goecks, Jeremey; Nekrutenko, Anton; Taylor, James; The Galaxy Team |year=2010 |volume=11 |issue=8 |pages=R86 |doi=10.1186/gb-2010-11-8-r86 |pmid=20738864 |pmc=PMC2945788}}</ref><ref name="KösterSnake12">{{cite journal |title=Snakemake: A scalable bioinformatics workflow engine |journal=Bioinformatics |author=Köster, J.; Rahmann, S. |volume=28 |issue=19 |pages=2520-2 |year=2012 |doi=10.1093/bioinformatics/bts480 |pmid=22908215}}</ref><ref name="WolstencroftTheTav13">{{cite journal |title=The Taverna workflow suite: Designing and executing workflows of Web Services on the desktop, web or in the cloud |journal=Nucleic Acids Research |author=Wolstencroft, K.; Haines, R.; Fellows, D. et al. |volume=41 |issue=W1 |pages=W557-W561 |year=2013 |doi=10.1093/nar/gkt328 |pmid=23640334 |pmc=PMC3692062}}</ref><ref name="GuimeraBcbio12">{{cite journal |title=bcbio-nextgen: Automated, distributed next-gen sequencing pipeline |journal=ENBnet Journal |author=Guimera, R.V. |volume=17 |issue=B |page=30 |year=2012 |doi=10.14806/ej.17.B.286}}</ref><ref name="WarrScient12">{{cite journal |title=Scientific workflow systems: Pipeline Pilot and KNIME |journal=Journal of Computer-aided Molecular Design |author=Warr, W.A. |volume=26 |issue=7 |page=801–4 |year=2012 |doi=10.1007/s10822-012-9577-7 |pmid=22644661 |pmc=PMC3414708}}</ref><ref name="GoodstadtRuffus10">{{cite journal |title=Ruffus: A lightweight Python library for computational pipelines |journal=Bioinformatics |author=Goodstadt, L. |volume=26 |issue=21 |pages=2778-9 |year=2010 |doi=10.1093/bioinformatics/btq524 |pmid=20847218}}</ref><ref name="StroppWorkflows12">{{cite journal |title=Workflows for microarray data processing in the Kepler environment |journal=BMC Bioinformatics |author=Stropp, T.; McPhillips, T.;  Ludäscher, B.; Bieda, M. |volume=13 |pages=102 |year=2012 |doi=10.1186/1471-2105-13-102 |pmid=22594911 |pmc=PMC3431220}}</ref><ref name="McLellanTheWasp12">{{cite journal |title=The Wasp System: An open source environment for managing and analyzing genomic data |journal=Genomics |author=McLellan, A.S.; Dubin, R.; Jing, Q. et al. |volume=100 |issue=6 |page=345-51 |year=2012 |doi=10.1016/j.ygeno.2012.08.005 |pmid=22944616}}</ref><ref name="WolfingerVienna15">{{cite journal |title=ViennaNGS: A toolbox for building efficient next-generation sequencing analysis pipelines |journal=F1000Research |author=Wolfinger, M.T.; Fallmann, J.; Eggenhofer, F.; Amman, F. |volume=4 |page=50 |year=2015 |doi=10.12688/f1000research.6157.2 |pmid=26236465 |pmc=PMC4513691}}</ref><ref name="ReidLaunch14">{{cite journal |title=Launching genomics into the cloud: Deployment of Mercury, a next generation sequence analysis pipeline |journal=BMC Bioinformatics |author=Reid, J.G.; Carroll, A.; Veeraraghavan, N. et al. |volume=15 |page=30 |year=2014 |doi=10.1186/1471-2105-15-30 |pmid=24475911 |pmc=PMC3922167}}</ref> for a variety of data analysis programming languages (for details see below), only insufficient general purpose NGS workflow solutions are currently available for the popular [[R (programming language)|R programming language]]. R and the affiliated Bioconductor environment provide a substantial number of widely used tools with a large user base in this area.<ref name="HuberOrch15" /> Thus, a workflow framework for federating NGS applications from within R will have many benefits for experimental and computational scientists who use R for NGS data analysis.
To address this need, we designed systemPipeR as a Bioconductor package for building and running workflows for most NGS applications, with support for integrating a wide array of command-line and R/Bioconductor software.
==Implementation==
===Environment===
systemPipeR has been implemented as an open-source Bioconductor package using the R programming language for statistical computing and graphics. R was chosen as the core development platform for systemPipeR because of the following reasons. (i) R is currently one of the most popular statistical data analysis and programming environments in [[bioinformatics]]. (ii) Its external language bindings support the implementation of computationally time-consuming analysis steps in high-performance languages such as C/C++. (iii) It supports advanced parallel computation on multi-core machines and computer clusters. (iv) A well developed infrastructure interfaces R with several other popular programing languages such as Python. (v) R provides advanced graphical and visualization utilities for scientific computing. (vi) It offers access to a vast landscape of statistical and machine learning tools. (vii) Its integration with the Bioconductor project promotes reusability of genomics software components, while also making efficient use of a large number of existing NGS packages that are well tested and widely used by the community. To support long-term reproducibility of analysis outcomes, systemPipeR is also distributed as a Docker image of Bioconductor’s sequencing division. Docker containers provide an efficient solution for packaging complex software together with all its system dependencies to ensure it will run the same in the future across different environments, including different operating systems and [[Cloud computing|cloud-based]] solutions.
===Workflow design===
systemPipeR workflows (Fig. 1) can be run from start-to-finish with a single command, or stepwise in interactive mode from the R console. New workflows are constructed, or existing ones modified, by connecting so-called SYSargs workflow control modules (R S4 class). Each SYSargs module contains instructions needed for processing a set of input files with a specific command-line or R software; as well as the paths to the corresponding outputs generated by a specific NGS tool such as a read preprocessor (trimmed/filtered FASTQ files), aligner (SAM/BAM files), read counter, variant caller (VCF/BCF files), peak caller (BED/WIG files), or statistical function. Typically, the only input the user needs to provide for running workflows is a single tabular targets file containing the paths to the initial sample input files (e.g. FASTQ) along with sample labels, and if appropriate biological replicate and contrast information for controlling differential abundance analyses (e.g., gene expression). Downstream derivatives of these targets files along with the corresponding SYSargs instances (see Fig. 1) are created automatically within each workflow.
[[File:Fig1 Backman BMCBio2016 17.gif|750px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="750px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' Workflow steps with input/output file operations are controlled by SYSargs objects. Each SYSargs instance is constructed from a targets and a param file. The only input required from the user is the initial targets file. Subsequent instances are created automatically. Any number of predefined or custom workflow steps is supported.</blockquote>
|-
|}
|}
The parameters required for running command-line software are provided by parameter (param) files, described below. For R-based workflow steps, param files are not required but can be useful for operations importing and/or exporting sample-level files. This modular design has several advantages. First, it provides a high level of flexibility for designing workflows, such as allowing the user to start workflows from the very beginning or anywhere in-between (e.g. FASTQ or BAM level). Second, it is straightforward to add custom workflow steps without requiring computational expert knowledge from users. Workflows can also have any number of steps including branch points. Lastly, it also minimizes errors as all input and output files are registered, and sample labels specified in the initial targets file will be consistently used throughout all workflow results, including plots, tables, and workflow reports.
===Command-line software support===
An important feature of systemPipeR is support for running command-line software directly from R on both single machines or computer clusters. This offers several advantages, such as seamless integration of most command-line software available in the NGS field with the extensive genome analysis resources provided by R/Bioconductor. The user interface for running command-line software has been generalized as a single function for ease of use, while only one additional command will run the same tool in parallel mode on a computer cluster (see below). Examples of command-line software used by systemPipeR’s preconfigured workflow templates (see below) include the aligners BWA-MEM<ref name="LiAligning13">{{cite web |url=https://arxiv.org/abs/1303.3997 |title=Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM |author=Li, H. |publisher=Cornell University Library |work=arXiv.org |date=26 May 2013}}</ref>, Bowtie2<ref name="LangmeadFast12">{{cite journal |title=Fast gapped-read alignment with Bowtie 2 |journal=Nature Methods |author=Langmead, B.; Salzberg, S.L. |volume=9 |issue=4 |page=357-9 |year=2012 |doi=10.1038/nmeth.1923 |pmid=22388286 |pmc=PMC3322381}}</ref>, TopHat2<ref name="KimTopHat2_13">{{cite journal |title=TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions |journal=Genome Biology |author=Kim, D.; Pertea, G.; Trapnell, C. et al. |volume=14 |pages=R36 |year=2013 |doi=10.1186/gb-2013-14-4-r36 |pmid=23618408 |pmc=PMC4053844}}</ref>, HISAT2<ref name="KimHISAT15">{{cite journal |title=HISAT: A fast spliced aligner with low memory requirements |journal=Nature Methods |author=Kim, D.; Langmead, B.; Salzberg, S.L. |volume=12 |issue=4 |pages=357-60 |year=2015 |doi=10.1038/nmeth.3317 |pmid=25751142 |pmc=PMC4655817}}</ref>, as well as the peak/variant callers MACS<ref name="ZhangModel08">{{cite journal |title=Model-based analysis of ChIP-Seq (MACS) |journal=Genome Biology |author=Zhang, Y.; Liu, T.; Meyer, C.A. et al. |volume=9 |issue=9 |pages=R137 |year=2008 |doi=10.1186/gb-2008-9-9-r137 |pmid=18798982 |pmc=PMC2592715}}</ref>, GATK<ref name="McKennaTheGenome10">{{cite journal |title=The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data |journal=Genome Research |author=McKenna, A.; Hanna, M.; Banks, E. et al. |volume=20 |pages=1297-1303 |year=2010 |doi=10.1101/gr.107524.110}}</ref>, and BCFtools.<ref name="LiTheSeq09" /> Support for additional command-line NGS software can be added by simply providing the argument settings of a chosen software in a tabular param file. If appropriate, new param files can be permanently included in the package to share them with the community. Functionality for creating param files automatically will be provided in the future. This will allow users to create new param instances simply by providing an example of the command-line syntax of a chosen software tool.
Major advantages of running command-line software from within systemPipeR include: a uniform sample management infrastructure within and across workflows; integration of BatchJobs’<ref name="BischlBatch12">{{cite journal |title=BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments |journal=Journal of Statistical Software |author=Bischl, B.; Lang, M.; Mersmann, O. et al. |volume=64 |issue=11 |pages=1–25 |year=2012 |doi=10.18637/jss.v064.i11}}</ref> efficient error management infrastructure for job submissions on computer clusters; the simplicity of restarting failed processes; as well as seamless addition of new samples (e.g., FASTQ or BAM files). In case of a restart, the system will skip the analysis steps of already completed samples and only perform the analysis of the missing ones. If required, any workflow step can be rerun on demand for all or a subset of samples. When submitting command-line software to computer clusters, BatchJobs monitors the status of job submissions and alerts users of exceptions, while recording warning and error messages for each process in a log directory with a database-like structure that is accessible from within R or the command-line. This organization helps to diagnose and resolve errors.
===Parallel evaluation===
The processing time for NGS experiments can be greatly reduced by making use of parallel evaluation across several CPU cores on single machines, or multiple nodes of computer clusters and cloud-based systems. systemPipeR simplifies these parallelization tasks without creating any limitations for users who do not have access to high-performance computer (HPC) resources by providing the option to run workflows in serial or parallel mode. The parallelization functionalities available in systemPipeR are largely based on existing and well maintained R packages, mainly BatchJobs and BiocParallel.<ref name="BischlBatch12" /> By making use of cluster template files, most schedulers and queuing systems are also supported (e.g., Torque, Sun Grid Engine, Slurm). If required, entire workflows can be executed in parallel mode by issuing a single command, while simultaneously generating a detailed analysis report (for details see below). If sufficient parallel computer resources are available, systemPipeR can complete the entire analysis workflow of several complex NGS experiments, each containing large numbers of FASTQ files, within hours rather than days or weeks, as can be the case for non-parallelized workflows.
===Automated analysis reports===
systemPipeR generates automated analysis reports with knitr and R markdown.<ref name="XieDynamic13">{{cite book |title=Dynamic Documents with R and knitr |author=Xie, Y. |publisher=Chapman and Hall/CRC |edition=1st |pages=216 |year=2013 |isbn=9781482203530}}</ref> These modern reporting environments integrate R code with LaTeX or Markdown. During the evaluation of the R code, reports are dynamically generated in PDF or HTML format. A caching system allows to re-execute selected workflow reporting steps without repeating unnecessary components. This way one can generate reports that resemble a research paper where user generated text is combined with analysis results. This includes support for citations, autogenerated bibliographies, code chunks with syntax highlighting, and inline evaluation of variables to update text content. Data components in a report such as tables and figures are updated automatically when rebuilding the document and/or rerunning workflows partially or entirely.
==Results and discussion==
===Overview===
systemPipeR provides utilities for building and running NGS analysis workflows. To adapt to community standards, widely used R/Bioconductor packages are integrated where possible. This includes the Bioconductor packages ShortRead, Biostrings, and Rsamtools for processing sequence and alignment files<ref name="MorganShortRead09">{{cite journal |title=ShortRead: A bioconductor package for input, quality assessment and exploration of high-throughput sequence data |journal=Bioinformatics |author=Morgan, M.; Anders, S.; Lawrence, M. et al. |volume=25 |issue=19 |pages=2607-8 |year=2009 |doi=10.1093/bioinformatics/btp450 |pmid=19654119 |pmc=PMC2752612}}</ref>; GenomicRanges, GenomicAlignments, and GenomicFeatures for handling genomic range operations, read counting, and annotation data<ref name="LawrenceSoft13" />; edgeR and DESeq2 for differential abundance analysis<ref name="RobinsonEdgeR10" /><ref name="LoveModerated14" />; and VariantTools and VariantAnnotation for filtering and annotating genome variants.<ref name="ObenchainVariant14">{{cite journal |title=VariantAnnotation: a Bioconductor package for exploration and annotation of genetic variants |journal=Bioinformatics |author=Obenchain, V.; Lawrence, M.; Carey, V. et al. |volume=30 |issue=14 |pages=2076-8 |year=2014 |doi=10.1093/bioinformatics/btu168 |pmid=24681907 |pmc=PMC4080743}}</ref> If necessary, one can substitute most of these packages with alternative R or command-line tools.
Because many NGS applications share overlapping analysis needs (Fig. 2 a), certain workflow steps are conceptualized in systemPipeR by a single generic function, with support for application-specific parameter settings (Table 1). For instance, most NGS applications involve a short read alignment step (see Fig. 2 b), but with very distinct mapping requirements, such as splice junction awareness and variant tolerance for RNA-Seq and VAR-Seq, respectively. To simplify their execution for the user, the different aligners can be run with the same runCommandline function where the software, and its parameter settings are specified in the corresponding SYSargs instance (see above and Fig. 1).
[[File:Fig2 Backman BMCBio2016 17.gif|567px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="567px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 2.''' Workflow Steps and Graphical Features. Relevant workflow steps of several NGS applications ('''a''') are illustrated in form of a simplified flowchart ('''b'''). Examples of systemPipeR’s functionalities are given under ('''c''') including: (1) eight different plots for summarizing the quality and diversity of short reads provided as FASTQ files; (2) strand-specific read count summaries for all feature types provided by a genome annotation; (3) summary plots of read depth coverage for any number of transcripts with nucleotide resolution upstream/downstream of their start and stop codons, as well as binned coverage for their coding regions; (4) enumeration of up- and down-regulated DEGs for user defined sample comparisons; (5) similarity clustering of sample profiles; (6) 2-5-way Venn diagrams for DEGs, peak and variant sets; (7) gene-wise clustering with a wide range of algorithms; and (8) support for plotting read pileups and variants in the context of genome annotations along with genome browser support.</blockquote>
|-
|}
|}
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|'''Table 1.''' Selected functions. The table lists a subset of over 50 methods and functions defined by systemPipeR. Usage instructions are provided in the corresponding help pages and vignettes of the package.
|-
|-
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Function name
  ! style="background-color:#e2e2e2; padding-left:10px; padding-right:10px;"|Description
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>genWorkenvir</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Generates workflow templates provided by systemPipeRdata helper package
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>systemArgs</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Constructs SYSargs workflow control module (S4 object) from targets and param files
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>runCommandline</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Executes command-line software on samples and parameters specified in SYSargs
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>clusterRun</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Runs command-line software in parallel mode on a computer cluster
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>clusterRun</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Runs command-line software in parallel mode on a computer cluster
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>preprocessReads</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Filtering and/or trimming of short reads using predefined or custom parameters
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>seeFASTQ/seeFASTQplot</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Generates quality reports for any number of FASTQ files
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>alignStats</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Generates alignment statistics, such as total number of reads and alignment frequency
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>run_edgeR/run_DESeq2</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Runs edgeR or DESeq2 for any number of pairwise sample comparisons
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>filterDEGs</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Filters and plots DEG results based on user-defined parameters
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>overLapper/vennPlot</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Computation of Venn intersects for 2-20 or more samples and 2-5 way Venn diagrams
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>GOCluster_Report</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|GO term enrichment analysis for large numbers of gene sets
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>variantReport</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Generates a variant report containing genomic annotations and confidence statistics
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>predORF</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Prediction of short open reading frames in DNA sequences
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>featuretypeCounts</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Computes and plots read distribution for many feature types at once
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|<tt>featureCoverage</tt>
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Computes and plots read depth coverage from many transcripts
|-
|}
|}
===Workflow templates===
systemPipeR also provides end-to-end workflow templates for RNA-Seq, Ribo-Seq, ChIP-Seq, and VAR-Seq analysis. A detailed vignette (manual) is provided for each workflow, while an overview vignette introduces the general design concepts. Templates for additional NGS applications will be made available in the future. To test workflows quickly or design new ones from existing templates, users can generate with a single command (genWorkenvir) workflow instances fully populated with sample data and parameter files required for running a chosen workflow. The corresponding sample data are provided by the affiliated data package systemPipeRdata, also available from Bioconductor. To illustrates the utilities of systemPipeR’s workflow templates, a case study has been included as Additional file 1 that guides the reader through the most important steps of a sample workflow. A typical gene-level RNA-Seq analysis was chosen here because it is currently one of the most widely used applications in the NGS field.
===Add-on tools===
In addition to providing a framework for running NGS analysis workflows, systemPipeR includes many functions and methods that expand and enhance its workflows. The following gives selected examples of these utilities (also illustrated in Fig. 2 c and Table 1). A read pre-processor function (<tt>preprocessReads</tt>) addresses the often very sophisticated quality filtering and adaptor trimming needs of specialized NGS applications such as Ribo-Seq or smallRNA-Seq. The functions <tt>seeFastq</tt> and <tt>seeFastqPlot</tt> generate and plot detailed quality reports for FASTQ files (Fig. 2 c1). These reports are easy to generate and designed to facilitate the visual inspection of large numbers of FASTQ files in a single report. The <tt>featuretypeCounts</tt> function computes and plots the distribution of reads across all features available in a genome annotation rather than just a single one (Fig. 2 c2). The <tt>featureCoverage</tt> function generates from genome-level alignments read depth coverage summaries for all or a subset of transcripts with nucleotide resolution upstream/downstream of their start and stop codons, as well as binned coverage for their coding regions (Fig. 2 c3). Additional utilities include functions to automate the analysis of differentially expressed genes (DEGs) with edgeR or DESeq2 (Fig. 2 c4), to compute Venn intersects for large numbers of sample sets (e.g. 2-20 or as many as available memory allows) with plotting functionalities for 2-5 way Venn diagrams (Fig. 2 c6), and to run gene set enrichment analyses in batch mode on large numbers of gene sets. The modular design of the systemPipeR environment allows users to easily substitute any of these built-in tools with alternative R-based or command-line software, such as using FastQC<ref name="BBFastQC">{{cite web |url=http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ |title=FastQC |publisher=Babraham Bioinformatics |accessdate=15 September 2015}}</ref>, FASTX-Toolkit<ref name="HannonFASTX">{{cite web |url=http://hannonlab.cshl.edu/fastx_toolkit/index.html |title=FASTX-Toolkit |publisher=Hannon Laboratory |accessdate=17 September 2015}}</ref>, or MultiQC<ref name="EwelsMultiQC16">{{cite journal |title=MultiQC: summarize analysis results for multiple tools and samples in a single report |journal=Bioinformatics |author=Ewels, P.; Magnusson, M.; Lundin, S.; Käller, M. |volume=32 |issue=19 |pages=3047–3048 |year=2016 |doi=10.1093/bioinformatics/btw354 |pmid=27312411 |pmc=PMC5039924}}</ref> for quality reporting, read trimming or result aggregation, respectively.
===Performance and scalability===
systemPipeR has been optimized to run workflows in a time and memory efficient manner even on very large read sets from complex genomes (e.g., mammalian genomes). This is achieved by making heavy use of indexing, file streaming, and parallelization functionalities. For instance, users can limit the RAM requirements of several workflow steps by specifying the maximum number of reads or alignments to stream into memory at any time. This enables analysis of very large files with tens of GBs of storage space on systems with limited RAM resources, making it possible to run systemPipeR workflows even on laptops or smaller workstations, provided they have the required software installed and enough disk space available for storing large NGS input and result files. The processing time of non-parallelized analysis steps depends on the time performance of a specific software tool chosen for a workflow step. For instance, in the RNA-Seq workflow described under Additional file 1 the alignment step will run on a single sample (FASTQ file) with the native time performance of the chosen aligner Bowtie2/Tophat2. Using the much faster HISAT2 aligner instead would accelerate the alignment step proportionally to the time improvements provided by this aligner without the need of additional parallel computer resources.<ref name="KimHISAT15" />
On a computer cluster, parallelized systemPipeR workflows scale nearly linearly in time with the number of sample files (i.e., FASTQ files) since every step can be parallelized at the sample level. In practice this means, the runtime of an analysis of 100 FASTQ files can be accelerated by 10 or 100 fold when using instead of a single CPU core 10 or 100 CPU cores, respectively. For example, the RNA-Seq workflow in Additional file 1 can process 100 FASTQ files, each with 30–40 million reads from a mammalian genome, in six to eight hours using 100 CPU cores (CPU Model: AMD 6376, 2.3 GHz) and a maximum RAM requirement of less than 10 GB per node. Since the alignment step with Bowtie2/Tophat2 accounts for most of the compute time of the entire workflow, the use of faster RNA-Seq aligners, such as Rsubread or HISAT2, can reduce the compute time to less than three hours. With comparable parallel computer resources available, one can complete with systemPipeR the end-to-end analysis of several complex NGS experiments each containing 50–100 FASTQ files in less than a day rather than many days or weeks as is common in non-parallelized workflows.
===Need for an R-based NGS workflow environment===
Several related software tools with NGS workflow functionality are available. These include Galaxy<ref name="GoecksGalaxy10" /><ref name="AfganHarness11">{{cite journal |title=Harnessing cloud computing with Galaxy Cloud |journal=Nature Biotechnology |author=Afgan, E.; Baker, D.; Coraor, N. et al. |volume=29 |issue=11 |pages=972-4 |year=2011 |doi=10.1038/nbt.2028 |pmid=22068528 |pmc=PMC3868438}}</ref>, Snakemake<ref name="KösterSnake12" />, Taverna<ref name="WolstencroftTheTav13" />, BioBlend<ref name="SloggettBioBlend13">{{cite journal |title=BioBlend: automating pipeline analyses within Galaxy and CloudMan |journal=Bioinformatics |author=Sloggett, C.; Goonasekera, N.; Afgan, E. |volume=29 |issue=13 |pages=1685-6 |year=2013 |doi=10.1093/bioinformatics/btt199 |pmid=23630176 |pmc=PMC4288140}}</ref>, bcbio-nextgen<ref name="GuimeraBcbio12" />, Knime<ref name="WarrScient12" />, Ruffus<ref name="GoodstadtRuffus10" />, Kepler<ref name="StroppWorkflows12" />, Wasp<ref name="McLellanTheWasp12" />, ViennaNGS<ref name="WolfingerVienna15" />, Mercury<ref name="ReidLaunch14" />, RAP<ref name="D'AntonioRAP15">{{cite journal |title=RAP: RNA-Seq Analysis Pipeline, a new cloud-based NGS web application |journal=BMC Genomics |author=D'Antonio, M.; D'Onorio De Meo, P.; Pallocca, M. et al. |volume=16 |pages=S3 |year=2015 |doi=10.1186/1471-2164-16-S6-S3 |pmid=26046471 |pmc=PMC4461013}}</ref>, and LONI<ref name="TorriNext12">{{cite journal |title=Next generation sequence analysis and computational genomics using graphical pipeline workflows |journal=Genes |author=Torri, F.; Dinov, I.D.; Zamanyan, A. et al. |volume=3 |issue=3 |pages=545–75 |year=2012 |doi=10.3390/genes3030545 |pmid=23139896 |pmc=PMC3490498}}</ref> among others. Additionally, general purpose utilities for workflow management and design are provided by Rabix<ref name="SBGRabix">{{cite web |url=http://rabix.io/ |title=Rabix |publisher=Seven Bridges Genomics, Inc}}</ref> and WDL.<ref name="BI_WDL">{{cite web |url=https://github.com/broadinstitute/wdl |title=broadinstitute/wdl |author=Broad Institute |work=GitHub |accessdate=16 September 2015}}</ref>
These tools provide infrastructure for streamlining the analysis of NGS data in a variety of data analysis environments and computer languages. However, only limited resources are available for designing and running analysis workflows for a wide range of NGS applications directly from within R as is possible with systemPipeR. One of the few exceptions is QuasR.<ref name="GaidatzisQuasR15">{{cite journal |title=QuasR: Quantification and annotation of short reads in R |journal=Bioinformatics |author=Gaidatzis, D.; Lerch, A.; Hahne, F. et al. |volume=31 |pages=7 |year=2015 |doi=10.1093/bioinformatics/btu781 |pmid=25417205 |pmc=PMC4382904}}</ref> This Bioconductor package supports the initial analysis steps of several NGS applications, but it lacks an interface to integrate external command-line software and functionalities to build new workflows. Other existing R/Bioconductor resources for analyzing NGS data address the needs in this area only partially. For instance, many of them are limited to certain NGS applications, or cover only a subset of the processing steps required for complete workflows; do not support command-line software; or lack workflow design functionalities for different NGS applications. systemPipeR has been designed to address these requirements. However, it is important to mention here that well established community workflow environments like Galaxy provide several additional features not available in systemPipeR. A small sub-selection of them includes: (i) a web interface to support non-expert users who are not familiar with data analysis programming environments like R; (ii) support for a wider range of data types outside of the NGS field; (iii) a well-established infrastructure and community for archiving and sharing workflow protocols; or (iv) support for additional reporting technologies such as iPython notebooks. To take advantage of this powerful infrastructure, Galaxy compatible versions of systemPipeR’s NGS workflows will be released in the future. This will allow biologists to run them from an easy-to-use web interface, while also being able to access additional functionalities available in Galaxy’s large ecosystem of analysis tools.
==Conclusion==
The systemPipeR package unites R/Bioconductor resources with external command-line software to standardize and automate the analysis of a wide range of NGS applications. Its functionalities reduce the complexity and time required to translate NGS data into interpretable research results, while a built-in reporting feature improves reproducibility. The environment provides sufficient flexibility to choose the optimal software for each step in complex NGS workflows, customize workflows, and design new workflows. Pre-configured workflow templates are included for several NGS applications. Templates for additional NGS applications are under development and will be added to the package in the near future.
==Availability and requirements==
'''Project name''': systemPipeR workflow environment
'''Project home page''': https://bioconductor.org/packages/systemPipeR/
'''Archived version''': systemPipeR
'''Operating system(s)''': Platform-independent
'''Programming language''': R
'''Other requirements''': R version ≥3.2, Bioconductor version ≥3.2
'''License''': Artistic-2-0
'''Any restrictions to use by non-academics''': none
==Abbreviations==
'''BAM''': Binary version of sequence alignment map format
'''ChIP-Seq''': Chromatin immunoprecipitation sequencing
'''DEG''': Differentially expressed genes
'''FASTQ''': short read sequence file format
'''NGS''': Next generation sequencing
'''Ribo-Seq''': NGS profiling of mRNA populations bound to ribosomes
'''RNA-Seq''': NGS profiling of mRNA
'''SAM''': Sequence alignment map format
'''VAR-Seq''': NGS-based variant detection
==Declarations==
===Acknowledgements===
We acknowledge the Bioconductor core team and community for providing valuable input for developing systemPipeR.
===Funding===
This work was supported by grants from the National Science Foundation (ABI-0957099, MCB-1021969, IOS-1546879), the National Institutes of Health (U24AG051129, R01-AI36959), and the National Institute of Food and Agriculture (2011-68004-30154).
===Authors’ contributions===
TB and TG conceived the idea for systemPipeR. TG developed the methods, implemented the R package, and wrote the article. Both authors read and approved the final manuscript.
===Competing interests===
The authors declare that they have no competing interests.
===Additional files===
[https://static-content.springer.com/esm/art%3A10.1186%2Fs12859-016-1241-0/MediaObjects/12859_2016_1241_MOESM1_ESM.pdf Additional file 1]: RNA-Seq Workflow Example. Case study to illustrate the utilities of systemPipeR using an RNA-Seq workflow as example. (PDF 89 kb)
==References==
{{Reflist|colwidth=30em}}
==Notes==
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original URL to Rabix was dead, and it was replaced with a current one for this version.
<!--Place all category tags here-->
[[Category:LIMSwiki journal articles (added in 2018)‎]]
[[Category:LIMSwiki journal articles (all)‎]]
[[Category:LIMSwiki journal articles on bioinformatics]]
[[Category:LIMSwiki journal articles on software‎‎]]

Latest revision as of 17:47, 1 February 2022

Sandbox begins below