Difference between revisions of "User:Shawndouglas/Sandbox"

From LIMSWiki
Jump to navigationJump to search
Line 44: Line 44:
==Background==
==Background==
By allowing scientists to rapidly sequence and quantify DNA and RNA molecules, next-generation sequencing (NGS) technology has transformed biology into one of the most data intensive research disciplines. In the past, experiments have been performed on a gene-by-gene basis, while NGS has introduced an age where it is has become a routine to sequence entire transcriptomes, genomes, or epigenomes rather than their isolated parts of interest. It will soon be possible to conduct these experiments on large numbers of single cell samples<ref name="KaliskySingle11">{{cite journal |title=Single-cell genomics |journal=Nature Methods |author=Kalisky, T.; Quake, S.R. |volume=8 |issue=4 |pages=311–4 |year=2011 |doi=10.1038/nmeth0411-311 |pmid=21451520}}</ref><ref name="TrapnellTheDynamics14">{{cite journal |title=The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells |journal=Nature Biotechnology |author=Trapnell, C.; Cacchiarelli, D.; Grimsby, J. et al. |volume=32 |issue=4 |pages=381–86 |year=2014 |doi=10.1038/nbt.2859 |pmid=24658644 |pmc=PMC4122333}}</ref> for a wide range of time points, treatments, and genetic backgrounds to study biological systems with greater resolution and precision. Sequencing the genetic material of each individual within entire populations of organisms of the same species or genus will enable the study of adaptation processes<ref name="Lindblad-TohAHigh11">{{cite journal |title=A high-resolution map of human evolutionary constraint using 29 mammals |journal=Nature |author=Lindblad-Toh, K.; Garber, M.; Zuk, O. et al. |volume=478 |issue=7370 |pages=476–82 |year=2011 |doi=10.1038/nature10530 |pmid=21993624 |pmc=PMC3207357}}</ref>, disease progression, and micro-evolution in real time.<ref name="Kato-MaedaUseOf13">{{cite journal |title=Use of whole genome sequencing to determine the microevolution of Mycobacterium tuberculosis during an outbreak |journal=PLoS One |author=Kato-Maeda, M.; Ho, C.; Passarelli, B. et al. |volume=8 |issue=3 |pages=e58235 |year=2013 |doi=10.1371/journal.pone.0058235 |pmid=23472164 |pmc=PMC3589338}}</ref> This technological shift empowers researchers to address questions at a [[Genomics|genome-wide]] scale, for example by profiling the mRNA, miRNA, and DNA methylation states of a large set of biological samples in parallel.<ref name="HoltTheNew08">{{cite journal |title=The new paradigm of flow cell sequencing |journal=Genome Research |author=Holt, R.A.; Jones, S.J. |volume=18 |issue=6 |pages=839-46 |year=2008 |doi=10.1101/gr.073262.107 |pmid=18519653}}</ref>
By allowing scientists to rapidly sequence and quantify DNA and RNA molecules, next-generation sequencing (NGS) technology has transformed biology into one of the most data intensive research disciplines. In the past, experiments have been performed on a gene-by-gene basis, while NGS has introduced an age where it is has become a routine to sequence entire transcriptomes, genomes, or epigenomes rather than their isolated parts of interest. It will soon be possible to conduct these experiments on large numbers of single cell samples<ref name="KaliskySingle11">{{cite journal |title=Single-cell genomics |journal=Nature Methods |author=Kalisky, T.; Quake, S.R. |volume=8 |issue=4 |pages=311–4 |year=2011 |doi=10.1038/nmeth0411-311 |pmid=21451520}}</ref><ref name="TrapnellTheDynamics14">{{cite journal |title=The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells |journal=Nature Biotechnology |author=Trapnell, C.; Cacchiarelli, D.; Grimsby, J. et al. |volume=32 |issue=4 |pages=381–86 |year=2014 |doi=10.1038/nbt.2859 |pmid=24658644 |pmc=PMC4122333}}</ref> for a wide range of time points, treatments, and genetic backgrounds to study biological systems with greater resolution and precision. Sequencing the genetic material of each individual within entire populations of organisms of the same species or genus will enable the study of adaptation processes<ref name="Lindblad-TohAHigh11">{{cite journal |title=A high-resolution map of human evolutionary constraint using 29 mammals |journal=Nature |author=Lindblad-Toh, K.; Garber, M.; Zuk, O. et al. |volume=478 |issue=7370 |pages=476–82 |year=2011 |doi=10.1038/nature10530 |pmid=21993624 |pmc=PMC3207357}}</ref>, disease progression, and micro-evolution in real time.<ref name="Kato-MaedaUseOf13">{{cite journal |title=Use of whole genome sequencing to determine the microevolution of Mycobacterium tuberculosis during an outbreak |journal=PLoS One |author=Kato-Maeda, M.; Ho, C.; Passarelli, B. et al. |volume=8 |issue=3 |pages=e58235 |year=2013 |doi=10.1371/journal.pone.0058235 |pmid=23472164 |pmc=PMC3589338}}</ref> This technological shift empowers researchers to address questions at a [[Genomics|genome-wide]] scale, for example by profiling the mRNA, miRNA, and DNA methylation states of a large set of biological samples in parallel.<ref name="HoltTheNew08">{{cite journal |title=The new paradigm of flow cell sequencing |journal=Genome Research |author=Holt, R.A.; Jones, S.J. |volume=18 |issue=6 |pages=839-46 |year=2008 |doi=10.1101/gr.073262.107 |pmid=18519653}}</ref>
The success of NGS-driven research has led to a data explosion of increasing size and complexity, making it now more time-consuming and challenging for researchers to extract knowledge from their experiments. Rapid processing of the results is essential to test, refine, and formulate new hypotheses for designing follow-up experiments. As a result, biologists have to dedicate nowadays substantial time to [[data analysis]] tasks while training themselves effectively as [[Genome informatics|genome data scientists]] rather than focusing on experimentation as they used to in the past.
In recent years, a considerable number of algorithms, statistical methods, and software tools has been developed to perform the individual analysis steps of different NGS applications. These include short read pre-processors, aligners, variant and peak callers, as well as statistical methods for the analysis of genomic regions that are differentially expressed<ref name="RobinsonEdgeR10">{{cite journal |title=edgeR: A Bioconductor package for differential expression analysis of digital gene expression data |journal=Bioinformatics |author=Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. |volume=26 |issue=1 |pages=139–40 |year=2010 |doi=10.1093/bioinformatics/btp616 |pmid=19910308 |pmc=PMC2796818}}</ref><ref name="LoveModerated14">{{cite journal |title=Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 |journal=Genome Biology |author=Love, M.I.; Huber, W.; Anders, S. |volume=15 |issue=12 |pages=550 |year=2014 |doi=10.1186/s13059-014-0550-8 |pmid=25516281 |pmc=PMC4302049}}</ref>, bound<ref name="KharchenkoDesign08">{{cite journal |title=Design and analysis of ChIP-seq experiments for DNA-binding proteins |journal=Nature Biotechnology |author=Kharchenko, P.V.; Tolstorukov, M.Y.; Park, P.J. |volume=26 |issue=12 |pages=1351–9 |year=2008 |doi=10.1038/nbt.1508 |pmid=19029915 |pmc=PMC2597701}}</ref>, or methylated.<ref name="AkalinMethyl12">{{cite journal |title=methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles |journal=Genome Biology |author=Akalin, A.; Kormaksson, M.; Li, S. et al. |volume=13 |issue=10 |pages=R87 |year=2012 |doi=10.1186/gb-2012-13-10-r87 |pmid=23034086 |pmc=PMC3491415}}</ref><ref name="HuberOrch15">{{cite journal |title=Orchestrating high-throughput genomic analysis with Bioconductor |journal=Nature Methods |author=Huber, W.; Carey, V.J.; Gentleman, R. et al. |volume=12 |issue=2 |pages=115–21 |year=2015 |doi=10.1038/nmeth.3252 |pmid=25633503 |pmc=PMC4509590}}</ref>
Also essential are tools for processing short read alignments [11], genomic intervals [12, 13], and annotations [14]. However, most data analysis routines of NGS applications are very complex, involving multiple software tools for their many processing steps. As a result, there is a great need for flexible software environments connecting the individual software components to automated workflows in order to perform complex genome-wide analyses in an efficient and reproducible manner. While many workflow management resources exist [15, 16, 17, 18, 19, 20, 21, 22, 23, 24] for a variety of data analysis programming languages (for details see below), only insufficient general purpose NGS workflow solutions are currently available for the popular [[R (programming language)|R programming language]]. R and the affiliated Bioconductor environment provide a substantial number of widely used tools with a large user base in this area [10]. Thus, a workflow framework for federating NGS applications from within R will have many benefits for experimental and computational scientists who use R for NGS data analysis.


==References==
==References==

Revision as of 20:10, 20 August 2018

Sandbox begins below

Full article title systemPipeR: NGS workflow and report generation environment
Journal BMC Bioinformatics
Author(s) Backman, Tyler W.H.; Girke, Thomas
Author affiliation(s) University of California, Riverside
Primary contact Email: thomas dot girke at ucr dot edu
Year published 2016
Volume and issue 17
Page(s) 388
DOI 10.1186/s12859-016-1241-0
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Website https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1241-0
Download https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1241-0 (PDF)

Abstract

Background: Next-generation sequencing (NGS) has revolutionized how research is carried out in many areas of biology and medicine. However, the analysis of NGS data remains a major obstacle to the efficient utilization of the technology, as it requires complex multi-step processing of big data, demanding considerable computational expertise from users. While substantial effort has been invested on the development of software dedicated to the individual analysis steps of NGS experiments, insufficient resources are currently available for integrating the individual software components within the widely used R/Bioconductor environment into automated workflows capable of running the analysis of most types of NGS applications from start-to-finish in a time-efficient and reproducible manner.

Results: To address this need, we have developed the R/Bioconductor package systemPipeR. It is an extensible environment for both building and running end-to-end analysis workflows with automated report generation for a wide range of NGS applications. Its unique features include a uniform workflow interface across different NGS applications, automated report generation, and support for running both R and command-line software on local computers and computer clusters. A flexible sample annotation infrastructure efficiently handles complex sample sets and experimental designs. To simplify the analysis of widely used NGS applications, the package provides pre-configured workflows and reporting templates for RNA-Seq, ChIP-Seq, VAR-Seq, and Ribo-Seq. Additional workflow templates will be provided in the future.

Conclusions: systemPipeR accelerates the extraction of reproducible analysis results from NGS experiments. By combining the capabilities of many R/Bioconductor and command-line tools, it makes efficient use of existing software resources without limiting the user to a set of predefined methods or environments. systemPipeR is freely available for all common operating systems from Bioconductor (http://bioconductor.org/packages/devel/systemPipeR).

Keywords: analysis workflow, next generation sequencing (NGS), Ribo-Seq, ChIP-Seq, RNA-Seq, VAR-Seq

Background

By allowing scientists to rapidly sequence and quantify DNA and RNA molecules, next-generation sequencing (NGS) technology has transformed biology into one of the most data intensive research disciplines. In the past, experiments have been performed on a gene-by-gene basis, while NGS has introduced an age where it is has become a routine to sequence entire transcriptomes, genomes, or epigenomes rather than their isolated parts of interest. It will soon be possible to conduct these experiments on large numbers of single cell samples[1][2] for a wide range of time points, treatments, and genetic backgrounds to study biological systems with greater resolution and precision. Sequencing the genetic material of each individual within entire populations of organisms of the same species or genus will enable the study of adaptation processes[3], disease progression, and micro-evolution in real time.[4] This technological shift empowers researchers to address questions at a genome-wide scale, for example by profiling the mRNA, miRNA, and DNA methylation states of a large set of biological samples in parallel.[5]

The success of NGS-driven research has led to a data explosion of increasing size and complexity, making it now more time-consuming and challenging for researchers to extract knowledge from their experiments. Rapid processing of the results is essential to test, refine, and formulate new hypotheses for designing follow-up experiments. As a result, biologists have to dedicate nowadays substantial time to data analysis tasks while training themselves effectively as genome data scientists rather than focusing on experimentation as they used to in the past.

In recent years, a considerable number of algorithms, statistical methods, and software tools has been developed to perform the individual analysis steps of different NGS applications. These include short read pre-processors, aligners, variant and peak callers, as well as statistical methods for the analysis of genomic regions that are differentially expressed[6][7], bound[8], or methylated.[9][10]

Also essential are tools for processing short read alignments [11], genomic intervals [12, 13], and annotations [14]. However, most data analysis routines of NGS applications are very complex, involving multiple software tools for their many processing steps. As a result, there is a great need for flexible software environments connecting the individual software components to automated workflows in order to perform complex genome-wide analyses in an efficient and reproducible manner. While many workflow management resources exist [15, 16, 17, 18, 19, 20, 21, 22, 23, 24] for a variety of data analysis programming languages (for details see below), only insufficient general purpose NGS workflow solutions are currently available for the popular R programming language. R and the affiliated Bioconductor environment provide a substantial number of widely used tools with a large user base in this area [10]. Thus, a workflow framework for federating NGS applications from within R will have many benefits for experimental and computational scientists who use R for NGS data analysis.

References

  1. Kalisky, T.; Quake, S.R. (2011). "Single-cell genomics". Nature Methods 8 (4): 311–4. doi:10.1038/nmeth0411-311. PMID 21451520. 
  2. Trapnell, C.; Cacchiarelli, D.; Grimsby, J. et al. (2014). "The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells". Nature Biotechnology 32 (4): 381–86. doi:10.1038/nbt.2859. PMC PMC4122333. PMID 24658644. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4122333. 
  3. Lindblad-Toh, K.; Garber, M.; Zuk, O. et al. (2011). "A high-resolution map of human evolutionary constraint using 29 mammals". Nature 478 (7370): 476–82. doi:10.1038/nature10530. PMC PMC3207357. PMID 21993624. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3207357. 
  4. Kato-Maeda, M.; Ho, C.; Passarelli, B. et al. (2013). "Use of whole genome sequencing to determine the microevolution of Mycobacterium tuberculosis during an outbreak". PLoS One 8 (3): e58235. doi:10.1371/journal.pone.0058235. PMC PMC3589338. PMID 23472164. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3589338. 
  5. Holt, R.A.; Jones, S.J. (2008). "The new paradigm of flow cell sequencing". Genome Research 18 (6): 839-46. doi:10.1101/gr.073262.107. PMID 18519653. 
  6. Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. (2010). "edgeR: A Bioconductor package for differential expression analysis of digital gene expression data". Bioinformatics 26 (1): 139–40. doi:10.1093/bioinformatics/btp616. PMC PMC2796818. PMID 19910308. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796818. 
  7. Love, M.I.; Huber, W.; Anders, S. (2014). "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2". Genome Biology 15 (12): 550. doi:10.1186/s13059-014-0550-8. PMC PMC4302049. PMID 25516281. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4302049. 
  8. Kharchenko, P.V.; Tolstorukov, M.Y.; Park, P.J. (2008). "Design and analysis of ChIP-seq experiments for DNA-binding proteins". Nature Biotechnology 26 (12): 1351–9. doi:10.1038/nbt.1508. PMC PMC2597701. PMID 19029915. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2597701. 
  9. Akalin, A.; Kormaksson, M.; Li, S. et al. (2012). "methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles". Genome Biology 13 (10): R87. doi:10.1186/gb-2012-13-10-r87. PMC PMC3491415. PMID 23034086. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3491415. 
  10. Huber, W.; Carey, V.J.; Gentleman, R. et al. (2015). "Orchestrating high-throughput genomic analysis with Bioconductor". Nature Methods 12 (2): 115–21. doi:10.1038/nmeth.3252. PMC PMC4509590. PMID 25633503. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4509590. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.