Journal:Closha: Bioinformatics workflow system for the analysis of massive sequencing data

Full article title	Closha: Bioinformatics workflow system for the analysis of massive sequencing data
Journal	BMC Bioinformatics
Author(s)	Ko, GunHwan; Kim, Pan-Gyu; Yoon, Jongcheol; Han, Gukhee; Park, Seong-Jin; Song, Wangho; Lee, Byungwook
Author affiliation(s)	Korean BioInformation Center
Primary contact	Email: bulee at kribb dot re dot kr
Year published	2018
Volume and issue	19(Suppl 1)
Page(s)	43
DOI	10.1186/s12859-018-2019-3
ISSN	1471-2105
Distribution license	Creative Commons Attribution 4.0 International
Website	https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2019-3
Download	https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-018-2019-3 (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: While next-generation sequencing (NGS) costs have fallen in recent years, the cost and complexity of computation remain substantial obstacles to the use of NGS in bio-medical care and genomic research. The rapidly increasing amounts of data available from the new high-throughput methods have made data processing infeasible without automated pipelines. The integration of data and analytic resources into workflow systems provides a solution to the problem by simplifying the task of data analysis.

Results: To address this challenge, we developed a cloud-based workflow management system, Closha, to provide fast and cost-effective analysis of massive genomic data. We implemented complex workflows making optimal use of high-performance computing clusters. Closha allows users to create multi-step analyses using drag-and-drop functionality and to modify the parameters of pipeline tools. Users can also import Galaxy pipelines into Closha. Closha is a hybrid system that enables users to use both analysis programs providing traditional tools and MapReduce-based big data analysis programs simultaneously in a single pipeline. Thus, the execution of analytics algorithms can be parallelized, speeding up the whole process. We also developed a high-speed data transmission solution, KoDS, to transmit a large amount of data at a fast rate. KoDS has a file transfer speed of up to 10 times that of normal FTP and HTTP. The computer hardware for Closha is 660 CPU cores and 800 TB of disk storage, enabling 500 jobs to run at the same time.

Conclusions: Closha is a scalable, cost-effective, and publicly available web service for large-scale genomic data analysis. Closha supports the reliable and highly scalable execution of sequencing analysis workflows in a fully automated manner. Closha provides a user-friendly interface to all genomic scientists to try to derive accurate results from NGS platform data. The Closha cloud server is freely available for use from http://closha.kobic.re.kr/.

Background

With the emergence of next-generation sequencing (NGS) technology in 2005, the field of genomics is caught in a data deluge. Modern sequencing platforms are capable of sequencing approximately 5000 M-bases per day.^[1] DNA sequencing is becoming faster and less expensive at a pace far outstripping Moore’s law, which describes the rate at which computing becomes faster and less expensive. As a result of the increased efficiency and diminished cost of NGS, the demand for clinical and agricultural applications is rapidly increasing.^[2] In the bioinformatics community, acquiring massive sequencing data is always followed by large-scale computational analysis to process the data and obtain scientific insights. Therefore, investment in a sequencing instrument would normally be accompanied by substantial investment in computer hardware, analysis pipelines, and bioinformatics experts to analyze the data.^[3]

When genomic datasets were small, they could be analyzed on personal computers in a few hours or perhaps overnight.^[4] However, this approach does not apply to large NGS datasets. Instead, researchers require high-performance computers and parallel algorithms to analyze their big genomic data in a timely manner.^[5] While high-performance computing is essential for data analysis, only a small number of biomedical research labs are equipped to make effective and successful use of parallel computers.^[6] Obstacles include the complexities inherent in managing large NGS datasets and assembling and configuring multi-step genome sequencing pipelines, as well as the difficulties inherent in adapting pipelines to process NGS data on parallel computers.^[7]

The difficulties in creating these complicated computational pipelines, installing and maintaining software packages, and obtaining sufficient computational resources tend to overwhelm bench biologists and prevent them from attempting to analyze their own genomic data.^[8] Despite the availability of a vast set of computational tools and methods for genomic data analysis^[1], it is still challenging for a genomic researcher to organize these tools, integrate them into workable pipelines, find accessible computational platforms, configure the computing environment, and perform the actual analysis.

To address these challenges, the MapReduce^[9] model and the corresponding Apache Hadoop framework have been widely adopted to handle large data sets using parallel processing tools.^[10] The most widely used open-source implementation of the MapReduce programming model for big data batch processing is Apache Hadoop. A cloud-based bioinformatics workflow platform has also been proposed for genomic researchers. Scientific workflow systems such as Galaxy^[11] and Taverna^[12] offer simple web-based workflow toolkits and scalable computing environments to meet this challenge.

References

↑ ^1.0 ^1.1 Souilmi, Y.; Lancaster, A.K.; Jung, J.Y. et al. (2015). "Scalable and cost-effective NGS genotyping in the cloud". BMC Medical Genomics 8: 64. doi:10.1186/s12920-015-0134-9. PMC PMC4608296. PMID 26470712. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4608296.
↑ Afgan, E.; Baker, D.; van den Beek, M. et al. (2016). "The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update". Nucleic Acids Research 44 (W1): W3–W10. doi:10.1093/nar/gkw343. PMC PMC4987906. PMID 27137889. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987906.
↑ de la Garza, L.; Veit, J.; Szolek, A. et al. (2016). "From the desktop to the grid: Scalable bioinformatics via workflow conversion". BMC Bioinformatics 17: 127. doi:10.1186/s12859-016-0978-9. PMC PMC4788856. PMID 26968893. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4788856.
↑ Huang, Z.; Rustagi, N.; Veeraraghavan, N. et al. (2016). "A hybrid computational strategy to address WGS variant analysis in >5000 samples". BMC Bioinformatics 17 (1): 361. doi:10.1186/s12859-016-1211-6. PMC PMC5018196. PMID 27612449. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018196.
↑ Goecks, J.; Eberhard, C.; Too, T. et al. (2013). "Web-based visual analysis for high-throughput genomics". BMC Genomics 14: 397. doi:10.1186/1471-2164-14-397. PMC PMC3691752. PMID 23758618. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3691752.
↑ Langdon, W.B. (2015). "Performance of genetic programming optimised Bowtie2 on genome comparison and analytic testing (GCAT) benchmarks". BioData Mining 8 (1): 1. doi:10.1186/s13040-014-0034-0. PMC PMC4304608. PMID 25621011. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4304608.
↑ Yazar, S.; Gooden, G.E.; Mackey, D.A. et al. (2014). "Benchmarking undedicated cloud computing providers for analysis of genomic datasets". PLoS One 9 (9): e108490. doi:10.1371/journal.pone.0108490. PMC PMC4172764. PMID 25247298. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4172764.
↑ Abouelhoda, M.; Issa, S.A.; Ghanem, M. (2012). "Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support". BMC Bioinformatics 13: 77. doi:10.1186/1471-2105-13-77. PMC PMC3583125. PMID 22559942. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3583125.
↑ O'Driscoll, A.; Daugelaite, J.; Sleator, R.D. (2013). "'Big data', Hadoop and cloud computing in genomics". Journal of Biomedical Informatics 46 (5): 774–81. doi:10.1016/j.jbi.2013.07.001. PMID 23872175.
↑ Hiltemann, S.; Mei, H.; de Hollander, M. et al. (2014). "CGtag: complete genomics toolkit and annotation in a cloud-based Galaxy". Gigascience 3 (1): 1. doi:10.1186/2047-217X-3-1. PMC PMC3905657. PMID 24460651. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3905657.
↑ Goecks, Jeremey; Nekrutenko, Anton; Taylor, James; The Galaxy Team (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. PMC PMC2945788. PMID 20738864. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945788.
↑ Oinn, T.; Addis, M.; Ferris, J. et al. (2004). "Taverna: A tool for the composition and enactment of bioinformatics workflows". Bioinformatics 20 (17): 3045-54. doi:10.1093/bioinformatics/bth361. PMID 15201187.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.