Journal:From the desktop to the grid: Scalable bioinformatics via workflow conversion

Full article title	From the desktop to the grid: Scalable bioinformatics via workflow conversion
Journal	BMC Bioinformatics
Author(s)	de la Garza, Luis; Veit, Johannes; Szolek, Andras; Röttig, Marc; Aiche, Stephan; Gesing, Sandra; Reinert, Knut; Kohlbacher, Oliver
Author affiliation(s)	University of Tübingen, Freie Universität Berlin, University of Notre Dame
Primary contact	Email: delagarza [at] informatik [dot] uni-tuebingen [dot] de
Year published	2016
Volume and issue	17
Page(s)	127
DOI	10.1186/s12859-016-0978-9
ISSN	1471-2105
Distribution license	Creative Commons Attribution 4.0 International
Website	https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0978-9
Download	https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-0978-9 (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background

Reproducibility is one of the tenets of the scientific method. Scientific experiments often comprise complex data flows, selection of adequate parameters, and analysis and visualization of intermediate and end results. Breaking down the complexity of such experiments into the joint collaboration of small, repeatable, well defined tasks, each with well defined inputs, parameters, and outputs, offers the immediate benefit of identifying bottlenecks, pinpoint sections which could benefit from parallelization, among others. Workflows rest upon the notion of splitting complex work into the joint effort of several manageable tasks.

There are several engines that give users the ability to design and execute workflows. Each engine was created to address certain problems of a specific community, therefore each one has its advantages and shortcomings. Furthermore, not all features of all workflow engines are royalty-free — an aspect that could potentially drive away members of the scientific community.

Results

We have developed a set of tools that enables the scientific community to benefit from workflow interoperability. We developed a platform-free structured representation of parameters, inputs, outputs of command-line tools in so-called Common Tool Descriptor documents. We have also overcome the shortcomings and combined the features of two royalty-free workflow engines with a substantial user community: the Konstanz Information Miner, an engine which we see as a formidable workflow editor, and the Grid and User Support Environment, a web-based framework able to interact with several high-performance computing resources. We have thus created a free and highly accessible way to design workflows on a desktop computer and execute them on high-performance computing resources.

Conclusions

Our work will not only reduce time spent on designing scientific workflows, but also make executing workflows on remote high-performance computing resources more accessible to technically inexperienced users. We strongly believe that our efforts not only decrease the turnaround time to obtain scientific results but also have a positive impact on reproducibility, thus elevating the quality of obtained scientific results.

Keywords

Workflow, interoperability, KNIME, grid, cloud, Galaxy, gUSE

Background

The importance of reproducibility for the scientific community has been a topic lately discussed in both high-impact scientific publications and popular news outlets.^[1]^[2] To be able to independently replicate results — be it for verification purposes or to further advance research — is important for the scientific community. Therefore, it is crucial to structure an experiment in such a way that reproducibility could be easily achieved.

Workflows are structured, abstract recipes that help users construct a series of steps in an organized way. Each step is a parametrised specific action that receives some input and produces some output. The collective execution of these steps is seen as a domain-specific task.

With the availability of biological big data, the need to represent workflows in computing languages has also increased.^[3] Scientific tasks such as genome comparison, mass spectrometry analysis, protein-protein interaction, just to name a few, access extensive datasets. Currently, a vast number of workflow engines exist^[4]^[5]^[6]^[7]^[8], and each of these technologies has amassed a considerable user base. These engines support, in some way or another, the execution of workflows on distributed high-performance computing (HPC) resources (e.g., grids, clusters, clouds, etc.), thus allowing speedier obtention of results. A wise selection of a workflow engine will shorten the time spent between workflow design and retrieval of results.

References

↑ "Unreliable research: Trouble at the lab". The Economist 409 (8858): 26–30. 19 October 2013. http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble. Retrieved 07 July 2015.
↑ McNutt, M. (17 January 2014). "Reproducibility". Science 343 (6168): 229. doi:10.1126/science.1250475. PMID 24436391.
↑ Greene, C.S.; Tan, J.; Ung, M.; Moore, J.H.; Cheng, C. (2014). "Big data bioinformatics". Journal of Cellular Physiology 229 (12): 1896-900. doi:10.1002/jcp.24662. PMID 24799088.
↑ Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. (2008). "Chapter 38: KNIME: The Konstanz Information Miner". In Preisach, C.; Burkhardt, H.; Schmidt-Thieme, L.; Decker, R.. Data Analysis, Machine Learning and Applications. Springer Berlin Heidelberg. doi:10.1007/978-3-540-78246-9_38. ISBN 9783540782391.
↑ Kacsuk, P.; Farkas, Z.; Kozlovszky, M.; Hermann, G.; Balasko, A.; Karoczkai, K.; Marton, I. (2012). "WS-PGRADE/gUSE Generic DCI Gateway Framework for a Large Variety of User Communities". Journal of Grid Computing 10 (4): 601–630. doi:10.1007/s10723-012-9240-5.
↑ Blankenberg, D.; Von Kuster, G.; Coraor, N.; Ananda, G.; Lazarus, R.; Mangan, M.; Nekrutenko, A.; Taylor, J. (2010). "Galaxy: A web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology 19 (Unit 19.10.1–21). doi:10.1002/0471142727.mb1910s89. PMC PMC4264107. PMID 20069535. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264107.
↑ Missier, P.; Soiland-Reyes, S.; Owen, S.; Tan, W.; Nenadic, A.; Dunlop, I.; Williams, A.; Oinn, T.; Goble, C. (2010). "Chapter 33: Taverna, Reloaded". In Gertz, M.; Ludäscher, B.. Data Analysis, Machine Learning and Applications. Springer Berlin Heidelberg. doi:10.1007/978-3-642-13818-8_33. ISBN 9783642138171.
↑ Abouelhoda, M.; Issa, S.A.; Ghanem, M. (2012). "Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support". BMC Bioinformatics 13: 77. doi:10.1186/1471-2105-13-77. PMC PMC3583125. PMID 22559942. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3583125.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original citation #1 was a mix of two different sources, and it has been corrected here to refer in full to the correct citation from The Economist.