Journal:Use of application containers and workflows for genomic data analysis

From LIMSWiki
Revision as of 22:39, 16 January 2017 by Shawndouglas (talk | contribs) (Created stub. Saving and editing now.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Full article title Use of application containers and workflows for genomic data analysis
Journal Journal of Pathology Informatics
Author(s) Schulz, Wade L.; Durant, Thomas; Siddon, Alexa J.; Torres, Richard
Author affiliation(s) Yale University School of Medicine, VA Connecticut Healthcare System
Primary contact Email: Log in to source site to view
Year published 2016
Volume and issue 7
Page(s) 53
DOI 10.4103/2153-3539.197197
ISSN 2153-3539
Distribution license Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported
Download (PDF)


Background: The rapid acquisition of biological data and development of computationally intensive analyses has led to a need for novel approaches to software deployment. In particular, the complexity of common analytic tools for genomics makes them difficult to deploy and decreases the reproducibility of computational experiments.

Methods: Recent technologies that allow for application virtualization, such as Docker, allow developers and bioinformaticians to isolate these applications and deploy secure, scalable platforms that have the potential to dramatically increase the efficiency of big data processing.

Results: While limitations exist, this study demonstrates a successful implementation of a pipeline with several discrete software applications for the analysis of next-generation sequencing (NGS) data.

Conclusions: With this approach, we significantly reduced the amount of time needed to perform clonal analysis from NGS data in acute myeloid leukemia.

Keywords: Big data, bioinformatics workflow, containerization, genomics


The amount of data available for research is growing at an exponential rate. The recent push for open data has also rapidly increased the availability of biomedical datasets for secondary analysis. Examples include the Yale Open Data Access project[1], a repository of clinical trial data, and The Cancer Genome Atlas (TCGA)[2], a project that makes genomic data accessible to researchers after initial findings are released. While these data sets promote ongoing research, the ability to efficiently store, move, and analyze such large repositories is often a bottleneck to analysis.[3]

In addition to the massive growth in volume and availability, novel analyses — including advanced statistical methods and machine learning — often require significant resources for efficient processing. One example of this in biomedical research is the analysis of next generation sequencing (NGS) data. NGS is also known as massively parallel or high-throughput sequencing, as it simultaneously sequences many fragments of DNA, thereby producing enormous amounts of information. These datasets often require several preprocessing steps followed by detailed analysis. In addition to being resource intensive, the reproducibility of computational experiments using these data is often limited due to the complexity of system and software configuration.[4] Some application frameworks have made advances to improve the reproducibility of individual applications and analysis pipelines[5][6], but significant work remains to increase this reliability, particularly for experiments performed in resource-limited environments or on computational clusters.

The deployment of complex computational systems is not unique to bioinformatics. As such, there has been significant progress in building virtualization layers for operating systems and more recently, software applications. [7],[8] A current example of this includes the Docker platform (Docker, San Francisco, CA, U.S.A.), which allows for the creation and configuration of software containers for deployment on a range of systems. [9],[10] While the use of these technologies has limitations, it also has the potential improve the usability of many software applications in computational biology. As such, several studies and initiatives have begun to focus on the use of Docker in bioinformatics and computer science research. [11],[12],[13] In this paper, we demonstrate the potential benefits of containerized applications and application workflows for computational genomics research.


  1. Krumholz, H.M.; Waldstreicher, J. (2016). "The Yale Open Data Access (YODA) Project--A mechanism for data sharing". New England Journal of Medicine 375 (5): 403–405. doi:10.1056/NEJMp1607342. PMID 27518657. 
  2. Collins, F.S.; Barker, A.D. (2007). "Mapping the cancer genome: Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies". Scientific American 296 (3): 50–7. PMID 17348159. 
  3. Fan, J.; Han, F.; Liu, H. (2014). "Challenges of big data analysis". National Science Review 1 (2): 293–314. doi:10.1093/nsr/nwt032. PMC PMC4236847. PMID 25419469. 
  4. Nekrutenko, A.; Taylor, J. (2012). "Next-generation sequencing data interpretation: Enhancing reproducibility and accessibility". Nature Reviews Genetics 13 (9): 667–72. doi:10.1038/nrg3305. PMID 22898652. 
  5. Blankenberg, Daniel; Von Kuster, Gregory; Coraor, Nathaniel; Ananda, Guruprasad; Lazarus, Ross; Mangan, Mary; Nekrutenko, Anton; Taylor, James (2010). "Galaxy: a web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology 19 (Unit 19.10.1–21). doi:10.1002/0471142727.mb1910s89. PMC PMC4264107. PMID 20069535. 
  6. Hatakeyama, M.; Opitz, L.; Russo, G.; Qi, W.; Schlapbach, R.; Rehrauer, H. (2016). "SUSHI: An exquisite recipe for fully documented, reproducible and reusable NGS data analysis". BMC Bioinformatics 17: 228. doi:10.1186/s12859-016-1104-8. 


This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.