Journal:From the desktop to the grid: Scalable bioinformatics via workflow conversion

From LIMSWiki
Revision as of 17:53, 14 June 2016 by Shawndouglas (talk | contribs) (Added rest of content.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Full article title From the desktop to the grid: Scalable bioinformatics via workflow conversion
Journal BMC Bioinformatics
Author(s) de la Garza, Luis; Veit, Johannes; Szolek, Andras; Röttig, Marc; Aiche, Stephan; Gesing, Sandra; Reinert, Knut; Kohlbacher, Oliver
Author affiliation(s) University of Tübingen, Freie Universität Berlin, University of Notre Dame
Primary contact Email: delagarza [at] informatik [dot] uni-tuebingen [dot] de
Year published 2016
Volume and issue 17
Page(s) 127
DOI 10.1186/s12859-016-0978-9
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Website https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-0978-9
Download https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-0978-9 (PDF)

Abstract

Background

Reproducibility is one of the tenets of the scientific method. Scientific experiments often comprise complex data flows, selection of adequate parameters, and analysis and visualization of intermediate and end results. Breaking down the complexity of such experiments into the joint collaboration of small, repeatable, well defined tasks, each with well defined inputs, parameters, and outputs, offers the immediate benefit of identifying bottlenecks, pinpoint sections which could benefit from parallelization, among others. Workflows rest upon the notion of splitting complex work into the joint effort of several manageable tasks.

There are several engines that give users the ability to design and execute workflows. Each engine was created to address certain problems of a specific community, therefore each one has its advantages and shortcomings. Furthermore, not all features of all workflow engines are royalty-free — an aspect that could potentially drive away members of the scientific community.

Results

We have developed a set of tools that enables the scientific community to benefit from workflow interoperability. We developed a platform-free structured representation of parameters, inputs, outputs of command-line tools in so-called Common Tool Descriptor documents. We have also overcome the shortcomings and combined the features of two royalty-free workflow engines with a substantial user community: the Konstanz Information Miner, an engine which we see as a formidable workflow editor, and the Grid and User Support Environment, a web-based framework able to interact with several high-performance computing resources. We have thus created a free and highly accessible way to design workflows on a desktop computer and execute them on high-performance computing resources.

Conclusions

Our work will not only reduce time spent on designing scientific workflows, but also make executing workflows on remote high-performance computing resources more accessible to technically inexperienced users. We strongly believe that our efforts not only decrease the turnaround time to obtain scientific results but also have a positive impact on reproducibility, thus elevating the quality of obtained scientific results.

Keywords

Workflow, interoperability, KNIME, grid, cloud, Galaxy, gUSE

Background

The importance of reproducibility for the scientific community has been a topic lately discussed in both high-impact scientific publications and popular news outlets.[1][2] To be able to independently replicate results — be it for verification purposes or to further advance research — is important for the scientific community. Therefore, it is crucial to structure an experiment in such a way that reproducibility could be easily achieved.

Workflows are structured, abstract recipes that help users construct a series of steps in an organized way. Each step is a parametrised specific action that receives some input and produces some output. The collective execution of these steps is seen as a domain-specific task.

With the availability of biological big data, the need to represent workflows in computing languages has also increased.[3] Scientific tasks such as genome comparison, mass spectrometry analysis, protein-protein interaction, just to name a few, access extensive datasets. Currently, a vast number of workflow engines exist[4][5][6][7][8], and each of these technologies has amassed a considerable user base. These engines support, in some way or another, the execution of workflows on distributed high-performance computing (HPC) resources (e.g., grids, clusters, clouds, etc.), thus allowing speedier obtention of results. A wise selection of a workflow engine will shorten the time spent between workflow design and retrieval of results.

Workflow engines

Galaxy[6] is a free web-based workflow system with several pre-installed tools for data-intensive biomedical research. Inclusion of arbitrary tools is reduced to the trivial task of creating ToolConfig[9] files, which are Extensible Markup Language documents (XML). The Galaxy project also features a so-called toolshed[10], from which tools can be obtained and installed on Galaxy instances. At the time of writing Galaxy’s toolshed featured 3470 tools. However, we have found that Galaxy lacks extended support for popular workload managers and middlewares.

Taverna[7] offers an open-source and domain-independent suite of tools used to design and execute scientific workflows, helping users to automate and pipeline processing of data coming from different web services. At the time of writing Taverna features more than 3500 services available on startup and it also provides access to local and remote tools. Taverna allows users to track results and data flows with great granularity, since it implements the Open Provenance Model standard (OPM).[11] A very attractive feature of Taverna is the ability to share workflows via the myExperiment research environment.[12]

The Konstanz Information Miner Analytics Platform (KNIME Analytics Platform)[4][13] is a royalty-free engine that allows users to build and execute workflows using a powerful and user-friendly interface. The KNIME Analytics Platform comes preloaded with several ready-to-use tasks (called KNIME nodes) that serve as the building stones of a workflow. It is also possible to extend the KNIME Analytics Platform by either downloading community nodes or building custom nodes using a well-documented process.[14][15] Workflows executed on the KNIME Analytics Platform are limited to run on the same personal computer on which it has been installed, thus rendering it unsuitable for tasks with high-memory or high-performance requirements. KNIME is offered in two variants able to execute workflows on distributed HPC resources: KNIME Cluster Execution[16], and KNIME Server.[17] These two suites are, however, royalty-based — an aspect that might shy away users of the scientific community.

The grid and cloud User Support Environment (gUSE) offers an open-source, free, web-based workflow platform able to tap into distributed HPC infrastructures.[5] gUSE entails a set of components and services that offers access to distributed computing interfaces (DCI). The Web Services Parallel Grid Runtime and Developer Environment Portal (WS-PGRADE) component acts as the graphical user interface. This web-based portal is a series of dynamically generated web pages, through which users can create, execute, and monitor workflows. WS-PGRADE communicates with internal gUSE services (e.g., Workflow Interpreter, Workflow Storage, Information Service) using the Web Services Description Language (WSDL).[18] Passing documents in the WSDL format between its components allows gUSE services to interact with other workflow systems. Figure 1 shows the three-tiered architecture of gUSE. This complex and granular architecture of gUSE enables administrators to distribute the installation of gUSE across resources. A typical set-up is to install WS-PGRADE on a dedicated web server, while installing other services and components on more powerful computers.


Fig1 Garza BMCBioinformatics2016 17.gif

Figure 1. The three-thiered gUSE’s architecture. The three tiers of gUSE’s architecture: WS-PGRADE acts as the user interface, the service layer handles e.g., file, workflow storage. The Job Submission and Data Management layer contains the DCI Bridge, which is responsible to access DCIs. Figure based on Gottdank's "gUSE in a Nutshell".[19]

In order to provide a common workflow submission Application Programming Interface (API), gUSE channels workflow-related requests (i.e., start, monitor, cancel a job on a DCI) through the DCI Bridge component.[20] The DCI Bridge is fully compatible with the Job Submission Description Language (JSDL)[21], thus enabling other workflow management systems to interact with it in order to benefit from gUSE’s flexibility and DCI support. The DCI Bridge contains so-called DCI Submitters, each containing specific code to submit, monitor, cancel jobs on each of the supported DCIs (e.g., UNICORE[22], LSF[23], Moab[24]). Figure 2 presents a schematic overview of the interaction between the DCI Bridge and other components.


Fig2 Garza BMCBioinformatics2016 17.gif

Figure 2. Schematic overview of gUSE’s DCI Bridge. Interaction of the DCI Bridge with gUSE services and other workflow management systems is done via JSDL requests. The DCI Bridge contains DCI Submitters, which contain specific code for each of the supported DCIs in gUSE. Figure based on MTA SZTAKI Laboratory of Parallel and Distributed Systems' DCI Bridge Administrator Manual.[20]

Designing a workflow in gUSE entails two steps: creation of an abstract graph, and a node-per-node configuration of the concrete. Creation of the abstract graph is achieved via a Java WebStart[25] application that is launched from WS-PGRADE, but is executed on the user’s computer. At this point, users are able to provide only application domain information. After saving the abstract graph, users would direct their web browser back to WS-PGRADE, open the concrete for editing, and configure each task comprising the workflow separately. The configuration entails details such as provision of the required command line arguments to execute each task. Since gUSE offers interfaces to several middlewares, it is possible to execute tasks of the same workflow on different DCIs. The possible complexity of workflows that could be executed by gUSE is reflected in the several available fields on the task configuration windows presented to the user while configuring the concrete (see Fig. 6). The two-step creation of workflows (i.e., creation of the abstract graph and creation/configuration of the concrete, as depicted in Fig. 6), combined with the steep learning curve that gUSE poses to new users is an aspect that might intimidate users without much technical experience.

Due to the diversity of workflow engines, a variety of workflow representations has arisen. This disparity of representations poses a challenge to scientists who desire to reuse workflows. Ideally, a scientist would design and test a workflow only once, and it would be possible to execute it on any workflow engine on a given DCI. Built on this principle, the Sharing Interoperable Workflow for Large-Scale Scientific Simulation on Available DCIs project[26] (SHIWA) allows users to run previously existing workflows from different platforms on the SHIWA Simulation Platform. However, privacy concerns might give scientists second thoughts about executing their workflows and uploading their sensitive data on the SHIWA Simulation Platform. Tavaxy[8], focusing on genome comparison and sequence analysis, was created to enable the design of workflows composed of Taverna and Galaxy sub-workflows, and other workflow nodes in a single environment.

There is a clear overlap between the functionalities of the mentioned workflow engines. All offer royalty-free workflow design and execution. However, based on feedback from users and experience in our research group, we believe that the KNIME Analytics Platform is an accessible workflow editor, although it lacks on computing power. On the other hand, again based on experience and feedback, we see gUSE as a great back-end framework that taps into several DCIs, but we have found that its workflow editing capabilities pose a steep learning curve to its user base.

In this paper we present work that bridges the gap between the KNIME Analytics Platform and gUSE. Our work will surely help adopters to significantly reduce the time spent designing, creating and executing repeatable workflows on distributed HPC resources.

Workflow representation

Formal representation

Workflows can be represented using Petri nets.[27][28] Petri nets are directed bipartite graphs containing two kinds of nodes: "places" and "transitions."

A "place" represents a set of conditions that must be fulfilled for a transition to occur. Once a place has satisfied all its conditions, it is enabled. "Transitions" represent actions that affect the state of the system (i.e., copy a file, perform a computation, modify a file).

Places and transitions are connected by edges called "arcs." No arc connects two nodes of the same kind (i.e., this restriction is precisely what makes Petri nets bipartite graphs). Arcs represent transitions’ pre- and postconditions. In order for a transition to take place, all of its preceding places must be enabled (i.e., all of the conditions of preceding places must be satisfied). Conversely, when a transition has been completed, a postcondition is satisfied, influencing the state of subsequent places to which this transition is connected to.

Whenever a place’s condition is satisfied, a token is depicted inside the corresponding place. Figure 3 depicts a simple computer process in which a molecule contained in an input file will be modified by adding any missing hydrogen atoms (i.e., the molecule will be protonated).


Fig3 Garza BMCBioinformatics2016 17.gif

Figure 3. A workflow as represented by a Petri net. Petri net modelling a software pipeline to protonate a molecule found in a single input file. Places are shown as circles, transitions are depicted as squares. The place P0 expects and contains one token, represented by a black dot, and is thus enabled. It follows that P0 is the starting place and P4 represents the end of the process.[20]

High-level representation

There are several alternatives to represent workflows in a platform-independent way. Yet another Workflow Language (YAWL)[29] was created after extensive analysis and review of already existing workflow languages in order to identify common workflow patterns and develop a new language that could combine the strengths and overcome the handicaps of other languages. The Interoperable Workflow Intermediate Representation (IWIR)[30] is a standard adopted by several workflow engines and is the language describing workflows in the SHIWA Simulation Platform.[26] More recently a group of individuals, vendors and organizations joined efforts to create the Common Workflow Language (CWL)[31] in order to provide a specification that enables scientists to represent workflows for data-intensive scientific tasks (e.g., mass spectrometry analysis, sequence analysis).

For the sake of clarity and brevity, literature commonly depicts workflows as directed acyclical graphs, in which each vertex represents a place together with its pre- and postconditions (i.e., the preceding and following transitions).[32] Each of the vertices is labelled, has a unique identifier and represents a task to be performed. Furthermore, each of the tasks in a workflow can receive inputs and can produce outputs. Outputs of a task can be channeled through another task as an input. An edge between two nodes represents the channeling of an output from a task into another. Edges determine the logical sequence to be followed (i.e., the origin task of an edge has to be completed before the destination task of an edge can be performed). A task will be executed once all of its inputs can be resolved. Workflow tasks are commonly referred by workflow systems as "nodes" or "jobs," and in this manuscript we will use these terms interchangeably.

Workflow abstraction

Workflows contain three different dimensions or abstraction layers, namely, "case," "process" and "resource" dimensions.[27] Mapping these dimensions into concepts used on distributed execution workflow systems, we find that:

  • The "case" dimension refers to the execution of a workflow (i.e., a single run).
  • The "process" dimension, also referred to as the abstract layer[33][34], deals with the application domain (i.e., the purpose of the workflow), therefore, technical details such as architecture, platform, libraries, implementation, programming language, etc., are hidden in this dimension.
  • The "resource" dimension, often called the concrete layer[33][34], encompasses the hardware and software used to execute the desired process; questions such as "How many processors does a task require?", "Which software running on which device will perform a certain task?", "What are the provided parameters for a specific task?", etc. must be answered in this layer.

Given that the focus of our work is distributed workflows, we prefer the use of the "abstract"/"concrete" terminology throughout this document. Figure 4 depicts the abstract layer of the previously introduced protonation process. This workflow is now composed of four tasks (i.e., vertices) and three edges. The task labelled "Input" has no predecessor tasks, therefore this task is the first one to be executed. In comparison, the task labelled "Protonate" depends on the completion of "Split," which in turn depends on the completion of "Input."


Fig4 Garza BMCBioinformatics2016 17.gif

Figure 4. An abstract workflow. Each vertex corresponds to a task and each edge corresponds to the logical sequence to be followed. Only application domain information is present.

Figure 5, in contrast to Figure 4, shows a possible concrete representation of the presented sample workflow, in which each vertex has been annotated with information needed in order to actually execute the corresponding tasks. While this information varies across workflow execution systems and platforms, the abstract representation of a workflow is constrained only to the application domain and is thus independent of the underlying software and infrastructure. At the concrete layer, the edges not only determine the logical sequence to be followed, but also represent channeled output files. For instance, the "Protonate" task receives an input file from its predecessor, "Split," and generates an output file that will be channeled to the "Output" task.


Fig5 Garza BMCBioinformatics2016 17.gif

Figure 5. A concrete workflow. Similar to an abstract workflow, a concrete workflow contains implicit application domain information. However, vertices of concrete workflows are annotated with extra attributes needed to actually execute the given tasks and obtain the needed data.

Workflow conversion

As it has been mentioned before, gUSE splits the creation of workflows into two steps: creation of the abstract graph and configuration of the concrete. These steps match the abstract and concrete layers we have discussed in the previous section. Both Galaxy and the KNIME Analytics Platform, in turn, represent workflows without this separation. In order to create a workflow in either Galaxy or the KNIME Analytics Platform, users drag and drop visual elements representing nodes into a workspace. Nodes come pre-configured and commonly execute a single task, therefore the user creates both the abstract and concrete layer at the same time. Inputs and outputs of nodes are visually represented and they can be connected to other nodes. Each node has a configuration dialog in which users can change the default parameters.

It is easy to see how functionally equivalent workflow constructs (e.g., conditional execution of a workflow section, parameter sweep, etc.) are represented differently across workflow engines. Furthermore, engines may offer features that are not available on other workflow systems (e.g., the KNIME Analytics Platform offers elements that are not present neither in gUSE nor in Galaxy, such as "flow variables"[35]). The proper identification and conversion of these elements is important for the conversion, since they play their part in workflow execution. Figure 6 displays a schematic comparison of the implementation of our example workflow across three selected workflow engines, the KNIME Analytics Platform, Galaxy, and gUSE.


Fig6 Garza BMCBioinformatics2016 17.gif

Figure 6. Comparison of the same pipeline across KNIME Analytics Platform, Galaxy, gUSE. The KNIME Analytics Platform and Galaxy (sections A, B, respectively) offer an intuitive workflow creation and there is no perceived boundary between the abstract and the concrete layers. gUSE, however, (section C) splits the creation of workflows in two phases, creation of the abstract graph and the further configuration of each node in the concrete workflow.

Due to the variety of workflow systems — we have mentioned only a few instances — it is not surprising that there are several languages and file formats to represent workflows. A first step towards a successful conversion of workflows is to be able to represent the vertices of a workflow (i.e., the tasks, or nodes) in a consistent way across platforms. Once a platform-independent representation of the vertices has been achieved, it is easier to import tasks into several workflow engines with less effort. Conversion of edges is an endeavor that is specific to each of the workflow engines (i.e., a task can be seen as a standalone component, while edges are the result of the collaboration of two or more tasks). Over the course of the last years, we have developed a series of tools enabling workflow interoperability across disparate workflow engines.

Implementation

Conversion of whole workflows can be split into two parts, namely, conversion of vertices and conversion of edges. Vertices represent tasks that take inputs, parameters and produce outputs. This information can be represented in a structured way.

Common Tool Descriptor files (CTD) are XML documents that contain information about parameters, inputs and outputs of a given tool. This information is presented in a structured and human readable way, thus facilitating manual generation for arbitrary tools. Since CTDs are also properly formatted XML documents, it is a trivial matter to parse them in an automated way.[36] Generation of CTDs can be either done manually or by CTD-enabled programs. Software libraries and suites such as SeqAn[37], OpenMS[38] and BALL[39] are CTD-enabled, that is, they are able to generate CTDs for each of its tools, parse input CTDs and execute tools accordingly. Executing a CTD-enabled tool across different platforms is a process transparent for end users. Figures 7 and 8 display how CTDs encapsulate needed runtime information into a single file, and how CTDs interact with other languages and platforms, respectively.


Fig7 Garza BMCBioinformatics2016 17.gif

Figure 7. A CTD in action. The upper section shows all three parameters needed for the tool PDBCutter to be executed. The middle section shows a snippet of a CTD representing a CTD-enabled tool. The bottom section shows how to execute a CTD-enabled tool with the given sample CTD.

Fig8 Garza BMCBioinformatics2016 17.gif

Figure 8. Overview of how CTDs interact with programming languages and workflow systems. CTDs can be generated by CTD-enabled tools (e.g., BALL, OpenMS, SeqAn) or via CTDopts. Once a tool is CTD-enabled, it can be imported into the KNIME Analytics Platform or Galaxy. We have also developed converters that can import KNIME Analytics Platform, Galaxy workflows into gUSE to take advantage of HPC resources and DCIs.

We have also worked towards making CTDs more accessible. Taking into account that refactoring tools to make them CTD-enabled might be time consuming, we have developed CTDopts in order to bind an already existing tool via Python wrappers. Naturally, this interaction is not done automatically without any further input, but this is an easier endeavor than performing a refactoring. CTDopts acts as a wrapper allowing users to execute arbitrary command line tools via CTDs. These Python wrappers will communicate directly with the tool, thus offering end users an interface to a CTD-enabled tool. Of course, manual creation of CTDs is always an option. Since CTDs are XML documents, a text editor is all is needed to manually generate them.

Representing a simple vertex using CTDs brings users closer to workflow interoperability, but this exercise might not pay off on its own. In order to extend a workflow engine by adding new tools, one could delve into the inner-workings of said workflow engine and import new tools. Since this could be a technical effort not accessible to users without the needed experience, we have also developed converters of tasks (i.e., vertices).

The KNIME Analytics Platform is an application based on the Eclipse Modelling Framework (EMF)[40], allowing the development of extensions. One of these extensions, and perhaps the most interesting for KNIME Analytics Platform users, is the ability to develop new KNIME nodes.[14] It is also possible to download so-called community nodes.[15]

We developed the Generic KNIME Nodes (GKN) extension to make use of KNIME Analytics Platform’s extensibility. GKN takes a set of CTDs as an input and generates the needed resources to implement KNIME nodes. In order to achieve this, its functionality is split into two main components: node generation and node execution. Once a node built via GKN has been generated (i.e., via the node generation component) and imported into the KNIME Analytics Platform, it interacts with other KNIME nodes via the node execution component. This interaction is transparent for the user.

Since Galaxy is one of the most popular workflow systems in the bioinformatics community, we felt that providing a suitable conversion would benefit the scientific community, so we developed CTD2Galaxy, a set of scripts to convert a CTD into a Galaxy ToolConfig XML file.[9] We also analysed Galaxy’s toolshed[10] and we determined that it would be possible to automatically convert around 1200 of these tool descriptions into CTD files. The rest of the tools in the toolshed contain elements that are not easily translated and are not supported in CTD format (e.g., if-else constructs, for loops, etc.).

So far, we have discussed conversion of vertices. Different workflow engines represent workflows using a different format. It follows that conversion of edges is an effort that heavily depends on the involved workflow engines. Analog to the need of a platform-independent representation of vertices, a first step of workflow interoperability, which is our end target, is the development of a platform-independent representation of workflows.

In spite of the apparent variety of workflow languages, we have focused our efforts in using the KNIME Analytics Platform as the source point of workflow conversion. To this end, we have also started work that could translate any KNIME Analytics Platform workflow into either an IWIR representation or a gUSE workflow (KNIME2gUSE). Alternatively, we have also developed a set of scripts that convert Galaxy workflows into gUSE (Galaxy2gUSE). Refer to Fig. 8 for a brief overview of how our work fits together against workflow systems.

Results and discussion

We have implemented a label-free-quantification (LFQ)[41][42] workflow in the KNIME Analytics Platform. LFQ is a widely used type of experiment in mass spectrometry based proteomics aimed at quantifying and comparing the abundances of peptides and proteins across different samples. Unlike other quantification strategies employing various kinds of chemical labelling of the different samples, LFQ does not impose a limit on the number of samples. Experiments with tens or hundreds of samples are routinely performed in many labs and, considering the ever-increasing performance of modern mass spectrometers, the number of samples to be analyzed per experiment is very likely to keep growing. This, in turn, gives rise to major computational challenges when analyzing the resulting large and complex data sets consisting of up to several terabytes of raw data. Hence, data processing and analysis of label-free quantification experiments can greatly benefit from distributed HPC resources and shall therefore serve as an example use case.

Our example workflow is based on tools provided by OpenMS/TOPP.[38][43][44][45][46] In addition to label-free quantification, it performs a complete consensus peptide identification[47] using the search engines OMSSA[48] and X!Tandem.[49] In essence, so-called tandem mass spectra containing the masses of fragment ions resulting from collision-induced dissociation of selected peptides are compared to theoretical fragment spectra generated from a given FASTA database containing protein sequences. Afterwards, peptide hits are filtered so that the remaining set of identifications has a false discovery rate of less than one percent. The quantification part starts with two major steps of data reduction and signal detection: peak picking and feature finding. Subsequently, the results of the identification and quantification branches of the workflow are combined, and corresponding peptide signals are matched across all samples in a process called feature linking. Finally, a normalization step is performed, which is necessary in order to be able to actually compare the relative abundances of peptides across the different runs. It is important to note that each run is executed independently via parameter sweep. Furthermore, each run is represented in a different input file, as given by the Input Files element. The output of the complete workflow is channeled to the TextExporter tool, which in turn generates a single comma-separated values (CSV) file containing all identified peptides together with their abundances in all given samples.

Figure 9 depicts the implementation of our LFQ workflow. We used our KNIME2gUSE extension and successfully imported our workflow in gUSE, as Fig. 10 shows.


Fig9 Garza BMCBioinformatics2016 17.gif

Figure 9. Label-free Quantification pipeline implemented in the KNIME Analytics Platform. The section enclosed by the ZipLoopStart and ZipLoopEnd will be executed independently for each of the given input files (i.e., parameter sweep).

Fig10 Garza BMCBioinformatics2016 17.gif

Figure 10. Label-free quantification pipeline as imported from the KNIME Analytics Platform into gUSE using the KNIME2gUSE extension. Note how parameter seep elements depicted in Fig. 9 such as ZipLoopStart and ZipLoopEnd are not present in gUSE. This is due to the fact that gUSE implements parameter seep by setting properties in input and output ports of the corresponding nodes.

Conclusions

Throughout this document we have presented our work towards workflow interoperability. We are convinced that investing time and effort in workflow interoperability helps scientists from all fields to expedite retrieval of results, so we tested and analyzed several workflow engines.

Based on user feedback and our own usage experience, we noticed that the creation of workflows in the KNIME Analytics Platform is straightforward, rapid and user-friendly. The needed amount of previous knowledge of the KNIME Analytics Platform or other workflow systems to put together a workflow and execute it is minimal. However, we were not satisfied with the fact that execution of workflows on distributed HPC resources is royalty-based. Our search then brought us to gUSE.

gUSE is an open-source web-based framework that enables users to execute workflows on distributed HPC resources. It supports several major resource managers and middlewares via the use of so-called DCI Submitters, which can also be added to extend gUSE’s support. However, workflow creation in gUSE is not as straightforward as in the KNIME Analytics Platform.

It was apparent that there was a need for a solution that combined the features and overcame the drawbacks of these two framework systems. On one side, a free and easy-to-use workflow editor and on the other side, a free and powerful back-end system connecting to several distributed HPC resources.

We are confident that our work presented in this document, in particular KNIME2gUSE, not only provides scientists a way to design and test workflows on their desktop computers, but also enables them to use powerful resources to execute their workflows, thus producing scientific results in a timely manner. We see KNIME2gUSE as a potential adopter of CWL: KNIME2gUSE could be extended in order to generate a CWL representation of a KNIME Analytics Platform workflow.

Availability and requirements

  • Project name: Workflow Conversion
  • Project home page: http://workflowconversion.github.io/
  • Operating system(s): Platform-independent
  • Programming language: Java, Python
  • Other requirements: e.g. Python 2.7, Java 1.6 or higher
  • License: e.g. GNU General Public License (GPL)
  • Any restrictions to use by non-academics: none

Abbreviations

API: Application programming interface

BALL: Bioinformatics algorithms library

CSV: Comma separated values

CTD: Common tool descriptor

CWL: Common workflow language

DCI: Distributed computing infrastructure

EMF: Eclipse modelling framework

GKN: Generic KNIME nodes

gUSE: Grid and cloud user support environment

HPC: High-performance computing

IWIR: Interoperable workflow intermediate representation

JSDL: Job submission description language

KNIME: Konstanz information miner

LFQ: Label-free quantification

LSF: Platform load sharing facility

OPM: Open provenance model

OMSSA: Open mass spectrometry search algorithm

SHIWA: Sharing interoperable workflow for large-scale scientific simulation on available DCIs project

TOPP: The OpenMS proteomics pipeline

UNICORE: Uniform interface to computing resources

WSDL: Web services description language

WS-PGRADE: Web services parallel grid runtime and developer environment portal

XML: Extensible markup language

YAWL: Yet another Workflow Language

Declaration

Acknowledgements

The authors would like to thank Bernd Wiswedel, Thorsten Meinl, Patrick Winter and Michael Berthold for their support, patience and help in developing the KNIME2gUSE extension. Supported by the German Network for Bioinformatics Infrastructure (Deutsches Netzwerk für Bioinformatik-Infrastruktur, de.NBI). We acknowledge support by Deutsche Forschungsgemeinschaft and Open Access Publishing Fund of University of Tübingen.

Open access

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Competing interests

We declare that we have no competing interests.

Authors’ contributions

LdlG wrote the manuscript, carried out development work for KNIME2gUSE, CTD2Galaxy and contributed to GKN, CTDopts. JV created, tested the LFQ pipeline in the KNIME Analytics Platform and was involved in drafting the manuscript. SA, together with MR developed the GKN extension. AS has been a major contributor and released the first version of CTDopts. LdlG, SA, MR, AS, OK and KR are responsible for the development and maintenance of CTD schemas. SG released the first version of Galaxy2gUSE. OK, KR conceived the study, participated in its design and coordination, and helped to draft the manuscript. All authors read and approved the final manuscript.

Authors’ information

We have no additional information about the authors.

References

  1. "Unreliable research: Trouble at the lab". The Economist 409 (8858): 26–30. 19 October 2013. http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble. Retrieved 07 July 2015. 
  2. McNutt, M. (17 January 2014). "Reproducibility". Science 343 (6168): 229. doi:10.1126/science.1250475. PMID 24436391. 
  3. Greene, C.S.; Tan, J.; Ung, M.; Moore, J.H.; Cheng, C. (2014). "Big data bioinformatics". Journal of Cellular Physiology 229 (12): 1896-900. doi:10.1002/jcp.24662. PMID 24799088. 
  4. 4.0 4.1 Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. (2008). "Chapter 38: KNIME: The Konstanz Information Miner". In Preisach, C.; Burkhardt, H.; Schmidt-Thieme, L.; Decker, R.. Data Analysis, Machine Learning and Applications. Springer Berlin Heidelberg. doi:10.1007/978-3-540-78246-9_38. ISBN 9783540782391. 
  5. 5.0 5.1 Kacsuk, P.; Farkas, Z.; Kozlovszky, M.; Hermann, G.; Balasko, A.; Karoczkai, K.; Marton, I. (2012). "WS-PGRADE/gUSE Generic DCI Gateway Framework for a Large Variety of User Communities". Journal of Grid Computing 10 (4): 601–630. doi:10.1007/s10723-012-9240-5. 
  6. 6.0 6.1 Blankenberg, D.; Von Kuster, G.; Coraor, N.; Ananda, G.; Lazarus, R.; Mangan, M.; Nekrutenko, A.; Taylor, J. (2010). "Galaxy: A web-based genome analysis tool for experimentalists". Current Protocols in Molecular Biology 19 (Unit 19.10.1–21). doi:10.1002/0471142727.mb1910s89. PMC PMC4264107. PMID 20069535. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4264107. 
  7. 7.0 7.1 Missier, P.; Soiland-Reyes, S.; Owen, S.; Tan, W.; Nenadic, A.; Dunlop, I.; Williams, A.; Oinn, T.; Goble, C. (2010). "Chapter 33: Taverna, Reloaded". In Gertz, M.; Ludäscher, B.. Data Analysis, Machine Learning and Applications. Springer Berlin Heidelberg. doi:10.1007/978-3-642-13818-8_33. ISBN 9783642138171. 
  8. 8.0 8.1 Abouelhoda, M.; Issa, S.A.; Ghanem, M. (2012). "Tavaxy: Integrating Taverna and Galaxy workflows with cloud computing support". BMC Bioinformatics 13: 77. doi:10.1186/1471-2105-13-77. PMC PMC3583125. PMID 22559942. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3583125. 
  9. 9.0 9.1 "Admin/Tools/ToolConfigSyntax". Galaxy Project. 19 June 2015. https://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax. Retrieved 28 July 2015. 
  10. 10.0 10.1 "Galaxy Tool Shed". Center for Comparative Genomics and Bioinformatics - Penn State. https://toolshed.g2.bx.psu.edu/. Retrieved 07 July 2015. 
  11. Moreau, L.; Clifford, B.; Freire, J. et al. (2011). "The Open Provenance Model core specification (v1.1)". Future Generation Computer Systems 27 (6): 743–756. doi:10.1016/j.future.2010.07.005. 
  12. Goble, C.; Bhagat, J.; Aleksejevs, S. et al. (May 2010). "myExperiment: a repository and social network for the sharing of bioinformatics workflows". Nucleic Acids Research 38 (Supplemental 2): W677–W682. doi:10.1093/nar/gkq429. 
  13. "KNIME: Open for Innovation". KNIME.org AG. http://www.knime.org/. Retrieved 29 June 2015. 
  14. 14.0 14.1 "New Node Wizard". Developer Guide. KNIME.org AG. https://tech.knime.org/new-node-wizard. Retrieved 06 July 2015. 
  15. 15.0 15.1 "Community Contributions". KNIME Community. KNIME.org AG. https://tech.knime.org/community. Retrieved 07 July 2015. 
  16. "KNIME Cluster Execution". Products. KNIME.org AG. https://www.knime.org/cluster-execution. Retrieved 06 July 2015. 
  17. "KNIME Server - The Heart of a Collaborative KNIME Setup". Products. KNIME.org AG. https://www.knime.org/knime-server. Retrieved 06 July 2015. 
  18. Christensen, E.; Curbera, F.; Meredith, G.; Weerawarana, S. (15 March 2001). "Web Services Description Language (WSDL) 1.1". World Wide Web Consortium. https://www.w3.org/TR/wsdl. 
  19. Gottdank, T. (2013). "gUSE in a Nutshell" (PDF). MTA SZTAKI Laboratory of Parallel and Distributed Systems. http://sourceforge.net/projects/guse/files/gUSE_in_a_Nutshell.pdf/download. 
  20. 20.0 20.1 20.2 "DCI Bridge Administrator Manual - Version 3.7.1" (PDF). MTA SZTAKI Laboratory of Parallel and Distributed Systems. 12 June 2015. http://sourceforge.net/projects/guse/files/3.7.1/Documentation/DCI_BRIDGE_MANUAL_v3.7.1.pdf/download. 
  21. Anjomshoaa, A.; Brisard, F.; Drescher, M. et al. (7 November 2005). "Job Submission Description Language (JSDL) Specification, Version 1.0" (PDF). Global Grid Forum. https://www.ogf.org/documents/GFD.56.pdf. 
  22. Romberg, M. (2002). "The UNICORE Grid Infrastructure". Scientific Programming 10 (2): 149–157. doi:10.1155/2002/483253. 
  23. "IBM Spectrum LSF". IBM Corporation. 2012. http://www-03.ibm.com/systems/spectrum-computing/products/lsf/index.html. 
  24. "HPC Products". Adaptive Computing. http://www.adaptivecomputing.com/products/hpc-products/. Retrieved 06 July 2015. 
  25. "Java SE Desktop Technologies - Java Web Start Technology". Oracle Corporation. http://www.oracle.com/technetwork/java/javase/javawebstart/index.html. Retrieved 03 July 2015. 
  26. 26.0 26.1 Terstyanszky, G.; Kukla, T.; Kiss, T.; Kacsuk, P.; Balasko, A.; Farkas, Z. (2014). "Enabling scientific workflow sharing through coarse-grained interoperability". Future Generation Computer Systems 37 (7): 46–59. doi:10.1016/j.future.2014.02.016. 
  27. 27.0 27.1 Van der Aalst, W.M.P. (1998). "The application of petri nets to workflow management". Journal of Circuits, Systems and Computers 8 (1): 21. doi:10.1142/S0218126698000043. 
  28. Peterson, J.L. (1981). Petri Net Theory and the Modeling of Systems. Prentice Hall. pp. 290. ISBN 9780136619833. 
  29. Van der Aalst, W.M.P.; ter Hofstede, A.H.M. (2005). "YAWL: Yet Another Workflow Language". Information Systems 30 (4): 245–275. doi:10.1016/j.is.2004.02.002. 
  30. Plankensteiner, K.; Montagnat, J.; Prodan, R. (2011). "IWIR: A language enabling portability across grid workflow systems". WORKS '11: Proceedings of the 6th Workshop on Workflows in Support of Large-Scale Science: 97–106. doi:10.1145/2110497.2110509. 
  31. "Common Workflow Language". The CWL Project. http://www.commonwl.org/. Retrieved 03 July 2015. 
  32. Salimifard, K.; Wright, M. (2001). "Petri net-based modelling of workflow systems: An overview". European Journal of Operational Research 134 (3): 664–676. doi:10.1016/S0377-2217(00)00292-7. 
  33. 33.0 33.1 Deelman, E.; Blythe, J.; Gil, Y. et al. (2003). "Mapping abstract complex workflows onto grid environments". Journal of Grid Computing 1 (1): 25–39. doi:10.1023/A:1024000426962. 
  34. 34.0 34.1 Yu, J.; Buyya, R. (2005). "A taxonomy of scientific workflow systems for grid computing". ACM SIGMOD Record 34 (3): 44–49. doi:10.1145/1084805.1084814. 
  35. "Flow Variables". KNIME Wiki. KNIME.com AG. https://tech.knime.org/wiki/flow-variables. Retrieved 26 October 2015. 
  36. "Workflow Conversion: Conversion of workflows across workflow systems". GitHub. http://workflowconversion.github.io/. 
  37. Döring, A.; Weese, D.; Rausch, T.; Reinert, K. (2008). "SeqAn: An efficient, generic C++ library for sequence analysis". BMC Bioinformatics 9: 11. doi:10.1186/1471-2105-9-11. PMC PMC2246154. PMID 18184432. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2246154. 
  38. 38.0 38.1 Sturm, M.; Bertsch, A.; Gröpl, C. et al. (2008). "OpenMS – An open-source software framework for mass spectrometry". BMC Bioinformatics 9: 163. doi:10.1186/1471-2105-9-163. PMC PMC2311306. PMID 18366760. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2311306. 
  39. Hildebrandt, A.; Dehof, A.K.; Rurainski, A. (2010). "BALL - Biochemical algorithms library 1.3". BMC Bioinformatics 11: 531. doi:10.1186/1471-2105-11-531. PMC PMC2984589. PMID 20973958. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2984589. 
  40. Steinberg, D.; Budinksy, F.; Paternostro, M.; Merks, E. (2008). EMF: Eclipse Modeling Framework (2nd revised ed.). Addison-Wesley Professional. pp. 744. ISBN 9780321331885. 
  41. Bantscheff, M.; Schirle, M.; Sweetman, G.; Rick, J.; Kuster, B. (2007). "Quantitative mass spectrometry in proteomics: A critical review". Analytical and Bioanalytical Chemistry 389 (4): 1017-31. doi:10.1007/s00216-007-1486-6. PMID 17668192. 
  42. Weisser, H.; Nahnsen, S.; Grossmann, J. et al. (2013). "An automated pipeline for high-throughput label-free quantitative proteomics". Journal of Proteome Research 12 (4): 1628-44. doi:10.1021/pr300992u. PMID 23391308. 
  43. Kohlbacher, O.; Reinert, K.; Gröpl, C. et al. (2007). "TOPP — The OpenMS proteomics pipeline". Bioinformatics 23 (2): e191–e197. doi:10.1093/bioinformatics/btl299. PMID 17237091. 
  44. Junker, J.; Bielow, C.; Bertsch, A. et al. (2012). "TOPPAS: A graphical workflow editor for the analysis of high-throughput proteomics data". Journal of Proteome Research 11 (7): 3914-20. doi:10.1021/pr300187f. PMID 22583024. 
  45. Junker, J.; Bielow, C.; Bertsch, A. et al. (2012). "TOPPAS: A graphical workflow editor for the analysis of high-throughput proteomics data". Journal of Proteome Research 11 (7): 3914-20. doi:10.1021/pr300187f. PMID 22583024. 
  46. "OpenMS: An Open-source Framework for Mass Spectrometry and TOPP – The OpenMS Proteomics Pipeline". SourceForge. Archived from the original on 09 May 2015. https://web.archive.org/web/20150509070935/http://open-ms.sourceforge.net/. Retrieved 26 June 2015. 
  47. Nahnsen, S.; Bertsch, A.; Rahnenführer, J. et al. (2011). "Probabilistic consensus scoring improves tandem mass spectrometry peptide identification". Journal of Proteome Research 10 (8): 3332-43. doi:10.1021/pr2002879. PMID 21644507. 
  48. Geer, L.Y.; Markey, S.P.; Kowalak, J.A. et al. (2004). "Open mass spectrometry search algorithm". Journal of Proteome Research 3 (5): 958–64. doi:10.1021/pr0499491. PMID 15473683. 
  49. Craig, R.; Beavis, R.C. (2004). "TANDEM: Matching proteins with tandem mass spectra". Bioinformatics 20 (9): 1466-7. doi:10.1093/bioinformatics/bth092. PMID 14976030. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. The original citation #1 was a mix of two different sources, and it has been corrected here to refer in full to the correct citation from The Economist. The original citation #47 got bumped up to 19 due to mandatory wiki ordering, bumping other original citation numbers down one. The original citation #30 (now 31) had an erroneous URL, which was corrected. An additional reference was added concerning "Common Tool Descriptor files" at 36, bumping original citations afterwards down two. The original citation #43 for OpenMS had a dead URL; an archived version of the URL has been substituted. Figure 6 has been moved slightly to a more logical place in the article.