Journal:SUSHI: An exquisite recipe for fully documented, reproducible and reusable NGS data analysis

From LIMSWiki
Revision as of 22:17, 5 July 2016 by Shawndouglas (talk | contribs) (Added content. Saving and adding more.)
Jump to navigationJump to search
Full article title SUSHI: An exquisite recipe for fully documented, reproducible and reusable NGS data analysis
Journal BMC Bioinformatics
Author(s) Hatakeyama, Masaomi; Opitz, Lennart; Russo, Giancarlo; Qi, Weihong; Schlapbach, Ralph; Rehrauer, Hubert
Author affiliation(s) ETH Zürich and University of Zürich
Primary contact Email: hubert dot rehrauer at fgcz dot ethz dot ch
Year published 2016
Volume and issue 17
Page(s) 228
DOI 10.1186/s12859-016-1104-8
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Website http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1104-8
Download http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1104-8 (PDF)

Abstract

Background: Next generation sequencing (NGS) produces massive datasets consisting of billions of reads and up to thousands of samples. Subsequent bioinformatic analysis is typically done with the help of open-source tools, where each application performs a single step towards the final result. This situation leaves the bioinformaticians with the tasks of combining the tools, managing the data files and meta-information, documenting the analysis, and ensuring reproducibility.

Results: We present SUSHI, an agile data analysis framework that relieves bioinformaticians from the administrative challenges of their data analysis. SUSHI lets users build reproducible data analysis workflows from individual applications and manages the input data, the parameters, meta-information with user-driven semantics, and the job scripts. As distinguishing features, SUSHI provides an expert command line interface as well as a convenient web interface to run bioinformatics tools. SUSHI datasets are self-contained and self-documented on the file system. This makes them fully reproducible and ready to be shared. With the associated meta-information being formatted as plain text tables, the datasets can be readily further analyzed and interpreted outside SUSHI.

Conclusion: SUSHI provides an exquisite recipe for analysing NGS data. By following the SUSHI recipe, SUSHI makes data analysis straightforward and takes care of documentation and administration tasks. Thus, the user can fully dedicate his time to the analysis itself. SUSHI is suitable for use by bioinformaticians as well as life science researchers. It is targeted for, but by no means constrained to, NGS data analysis. Our SUSHI instance is in productive use and has served as data analysis interface for more than 1000 data analysis projects. SUSHI source code as well as a demo server are freely available.

Keywords: Data analysis framework, reproducible research, meta-level system design

Background

Today’s bioinformatics faces the practical challenge of analysing massive and diverse data in a well documented and reproducible fashion. The situation is particularly challenging in the area of NGS research where state-of-the-art algorithms are frequently available as standalone tools and where a complete data analysis consists of many individual data processing and analysis steps. The considerations associated with conducting such a data analysis in a research environment have been discussed by W. S. Noble[1], and guidelines as well as an example strategy for organizing computational data analysis have been given. According to Noble a key principle is to record every operation such that reproducibility is ensured.

In this paper, we present SUSHI, which stands for "Support Users for SHell-script Integration," a new approach to bioinformatics analysis that is centered on reusability, reproducibility and scalability. SUSHI produces analysis results as directories that are fully self-contained and hold all the information to be reproduced. Specifically, we document all parameters, input data, commands executed, as well as the versions of the tools and the reference data used. Additionally, we store meta-information on the experimental data together with the result files, so that those can be interpreted and further analyzed by other tools independently from SUSHI. This holds even if the analysis directory is transferred to collaborators with a different computing environment. SUSHI is extendable and we have put special emphasis on the simplicity of adding new software applications. A bioinformatician can define them within a single file and does not need special programming skills. SUSHI natively offers a command line interface as well as a web interface to run data analysis steps. Altogether, SUSHI lets bioinformaticians efficiently build analysis pipelines and ensures that analysis results are ready-to-be-shared and reproducible.

Various types of data analysis frameworks have already been implemented. They can be essentially divided into web-based frameworks and scripting frameworks. Examples for web-based frameworks are Galaxy[2], Chipster[3], GeneProf[4] or GenePattern.[5] They let users run individual steps or entire pipelines on a remote compute system with the framework keeping track of the executed analysis. Scripting frameworks like bpipe[6], Ruffus[7], nestly[8], NGSANE[9], Makeflow[10], and Snakemake[11] let users build bioinformatics pipelines in a command line fashion. Given the different types of user interactions, the former solutions are more targeted for the experienced biologists or the application-oriented bioinformaticians, while the latter address the needs of bioinformaticians who are more inclined to programming and high-throughput analysis of many datasets. However, there is no system as yet that natively offers both interfaces. Additionally, none of the existing frameworks puts an emphasis on having a human-readable and portable file-based representation of the meta-information and associated data.

Implementation

SUSHI data sets and applications

Within SUSHI, the original measured data as well as any derived analysis result is modeled as a data set that is represented as a fully self-sufficient directory on the file system. For the original data this means that it must be accompanied by meta-information on how the data has been measured. The meta-information must include information on the biological samples used as well as information on the measurement process. For analysis results this implies that the analysis result files must have accompanying meta-information that documents the input data, the analysis steps, and the analysis parameters. If these requirements are satisfied, we call this a "dataset." Figure 1 shows schematically how a dataset is generated as the result of running a SUSHI application. The meta-information associated with a dataset is represented in SUSHI as a DataSet object. On the file system a tabular file called "dataset.tsv" represents the dataset. Examples of meta-information are characteristics like sample name, species, tissue, but also e.g. the genome build that has been used for read alignment. Each characteristic is represented as a separate column in the tabular dataset.tsv.


Fig1 Hatakeyama BMCBioinformatics2016 17.gif

Figure 1. The use case of DataSet generation. By running a SUSHI application with an input DataSet and parameters, a new DataSet is generated. Initially (Step 1) only the meta-information, the parameter file, and the job scripts are generated. The actual data files and the log files are generated by executing the static job scripts (Step 2).

A SUSHI application requires as input both a set of parameters and a DataSet object. This means that applications do not take bare data files as direct input. Instead, SUSHI applications take as input the DataSet meta-information object. The DataSet object holds, next to the data files, the meta-information necessary to process and interpret the data files. Based on its input, a SUSHI application first generates 1) the necessary job script(s), 2) a file representation of the parameters, and 3) the DataSet for the output data (Fig. 1 Step 1). The actual result data file(s) are generated by the job script(s) (Fig. 1 Step 2). The columns of the output DataSet hold again the meta-information, which now include additionally the parameters of the executed analysis if relevant for the further analysis or interpretation. The set of characteristics that is added to the annotation columns is defined and generated by the SUSHI application. The SUSHI framework itself does not require any specific annotation columns. Thus, the semantics of the DataSet columns are determined by the SUSHI applications (described in detail in the next section).

Every column of meta-information has a unique header that identifies the content and optional tags that characterize the information type in the column. Tags are represented as comma-separated strings within square brackets in the column headers. Currently supported tags are "File," "Link," "Factor," and "Characteristics." Depending on the tags, the SUSHI framework provides appropriate actions for the corresponding columns. The File tag is reserved for actual file paths, and SUSHI checks if the file actually exists. If a column has a Link tag, SUSHI will automatically add a hyperlink to the data. Finally, the Factor column data will be used to group samples according to experimental factors, which is typically required in a differential gene expression analysis.

Example DataSet holding RNA-seq reads

We follow the convention that sequencing read files are represented as a DataSet with the following columns:

  • Name: the name of the sample measured
  • Read1 [File]: path to the file holding the reads; if reads are paired-end, this must be the first read
  • Adapter1: potentially contaminating adapter sequence at the 3′-end of read 1
  • Read2 [File]: path to the second read for paired-end data (only for paired-end data)
  • Adapter2: potentially contaminating adapter sequence at the 3′-end of read 2 (only for paired-end data)
  • Species: the species of the sample
  • StrandMode: specifies whether the library preparation protocol preserved strand information
  • Enrichment Kit: the kit employed to enrich the input material (e.g. poly-A selection kit)
  • Read Count: the number of reads in the file

Additionally, there can be columns that specify experimental factors. Table 1 shows an example (due to space constraints only a subset of the columns is shown).

Table 1. A sample DataSet. Example of a sequencing read DataSet where a subset of the meta-information is shown as annotation columns. The DataSet includes four samples with four categories of meta-information, 1. Name, 2. Read, 3. Species, and 4. Genotype. Each column header can have a tag. E.g. "[File]" means the column holds file locations, and "[Factor]" means the values represent an experimental factor. The DataSet object is implemented as an Array of Hash objects in the SUSHI system and it can be imported from or exported to tab-separated-value file.
Name Read [File] Species Genotype [Factor]
Mut1 P1001/ventricles/mut1_R1.fastq.gz Mus musculus Mutant
Mut2 P1001/ventricles/mut2_R1.fastq.gz Mus musculus Mutant
Wt1 P1001/ventricles/wt1_R1.fastq.gz Mus musculus Wildtype
Wt2 P1001/ventricles/wt2_R1.fastq.gz Mus musculus Wildtype

It is important to mention here that the SUSHI framework does not impose any constraints or semantics on the columns of the DataSet table. It is up to the user to identify which meta-information is relevant for his data. In particular, we do not require specific ontologies or controlled vocabularies. Users are free to define their own meta-information. The content definition and interpretation is entirely delegated to the user and the SUSHI applications. Figure 2a shows how the above DataSet is visualized in the SUSHI DataSet view.


Fig2 Hatakeyama BMCBioinformatics2016 17.gif

Figure 2. The screenshots of a DataSet and parameter setting view. a. DataSet view shows basic information of the DataSet, sample information, and the compatible SUSHI applications at the bottom. The SUSHI application is shown as a button and categorized based on the @analysis_category defined in the SUSHI application Ruby code. b. After selecting a SUSHI application, the parameter setting view lets users modify the analysis parameters. According to the SUSHI application definition, GUI components are auto-generated and placed in the view.

Example SUSHI application performing a FastQC report

A common task is to generate a FastQC report[12] for each sample in a read data set. If the FastQC package is installed, one would run e.g. fastqc -o . -t 4 p1001/ventricles/mut1_R1.fastq.gz which creates the FastQC report mut1_R1_fastqc.zip for the first sample in the above data set. With the SUSHI framework this can be turned into a SUSHI application defined by the following Ruby code:


Fig2a Hatakeyama BMCBioinformatics2016 17.gif


The example code shows the essential features of a SUSHI application (See also Additional file 1). The "@required_columns" tells the SUSHI framework which columns a DataSet must have so that "FastqcMinimal" is applicable. In Fig. 2a, all applications that are compatible with the example reads data set are shown at the bottom, including the "FastqcMinimal" application. The "@params['cores']" defines the number of cores to be used for multi-threading as a parameter with default value 4. This parameter is automatically turned into an input field in the web interface (see Fig. 2b. The code also defines with the method "next_dataset" the columns and content for the resulting DataSet. Finally, the method "commands" defines the command to be executed.

References

  1. Noble, W.S. (2009). "A quick guide to organizing computational biology projects". PLoS Computational Biology 5 (7): e1000424. doi:10.1371/journal.pcbi.1000424. PMC PMC2709440. PMID 19649301. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709440. 
  2. Goecks, Jeremey; Nekrutenko, Anton; Taylor, James; The Galaxy Team (2010). "Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. PMC PMC2945788. PMID 20738864. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2945788. 
  3. Fisch, K.M.; Meißner, T.; Gioia, L. et al. (2015). "Omics Pipe: A community-based framework for reproducible multi-omics data analysis". Bioinformatics 31 (11): 1724-1728. doi:10.1093/bioinformatics/btv061. PMC PMC4443682. PMID 25637560. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4443682. 
  4. Halbritter, F.; Vaidya, H.J.; Tomlinson, S.R. (2011). "GeneProf: Analysis of high-throughput sequencing experiments". Nature Methods 9 (1): 7–8. doi:10.1038/nmeth.1809. PMID 22205509. 
  5. Reich, M.; Liefeld, T.; Gould, J. et al. (2006). "GenePattern 2.0". Nature Genetics 38 (5): 500–501. doi:10.1038/ng0506-500. PMID 16642009. 
  6. Sadedin, S.P.; Pope, B.; Oshlack, A. (2012). "Bpipe: a tool for running and managing bioinformatics pipelines". Bioinformatics 28 (11): 1525-6. doi:10.1093/bioinformatics/bts167. PMID 22500002. 
  7. Goodstadt, L. (2010). "Ruffus: A lightweight Python library for computational pipelines". Bioinformatics 26 (21): 2778-9. doi:10.1093/bioinformatics/btq524. PMID 20847218. 
  8. McCoy, C.O.; Gallagher, A.; Hoffman, N.G.; Matsen, F.A. (2013). "nestly: A framework for running software with nested parameter choices and aggregating results". Bioinformatics 29 (3): 387–8. doi:10.1093/bioinformatics/bts696. PMC PMC3562064. PMID 23220574. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3562064. 
  9. Buske, F.A.; French, H.J.; Smith, M.A. et al. (2014). "NGSANE: A lightweight production informatics framework for high-throuput data analysis". Bioinformatics 30 (10): 1471-1472. doi:10.1093/bioinformatics/btu036. PMC PMC4016703. PMID 24470576. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4016703. 
  10. Yu, L.; Moretti, C.; Thrasher, A. et al. (2010). "Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions". Cluster Computing 13 (3): 243-256. doi:10.1007/s10586-010-0134-7. 
  11. Köster, J.; Rahmann, S. (2012). "Snakemake: A scalable bioinformatics workflow engine". Bioinformatics 28 (19): 2520-2. doi:10.1093/bioinformatics/bts480. PMID 22908215. 
  12. Andrews, S. (2010). "FastQC: A quality control tool for high throughput sequence date". Babraham Bioinformatics. Babraham Institute. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.