Journal:SUSHI: An exquisite recipe for fully documented, reproducible and reusable NGS data analysis

Full article title	SUSHI: An exquisite recipe for fully documented, reproducible and reusable NGS data analysis
Journal	BMC Bioinformatics
Author(s)	Hatakeyama, Masaomi; Opitz, Lennart; Russo, Giancarlo; Qi, Weihong; Schlapbach, Ralph; Rehrauer, Hubert
Author affiliation(s)	ETH Zürich and University of Zürich
Primary contact	Email: hubert dot rehrauer at fgcz dot ethz dot ch
Year published	2016
Volume and issue	17
Page(s)	228
DOI	10.1186/s12859-016-1104-8
ISSN	1471-2105
Distribution license	Creative Commons Attribution 4.0 International
Website	http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1104-8
Download	http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1104-8 (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Next generation sequencing (NGS) produces massive datasets consisting of billions of reads and up to thousands of samples. Subsequent bioinformatic analysis is typically done with the help of open-source tools, where each application performs a single step towards the final result. This situation leaves the bioinformaticians with the tasks of combining the tools, managing the data files and meta-information, documenting the analysis, and ensuring reproducibility.

Results: We present SUSHI, an agile data analysis framework that relieves bioinformaticians from the administrative challenges of their data analysis. SUSHI lets users build reproducible data analysis workflows from individual applications and manages the input data, the parameters, meta-information with user-driven semantics, and the job scripts. As distinguishing features, SUSHI provides an expert command line interface as well as a convenient web interface to run bioinformatics tools. SUSHI datasets are self-contained and self-documented on the file system. This makes them fully reproducible and ready to be shared. With the associated meta-information being formatted as plain text tables, the datasets can be readily further analyzed and interpreted outside SUSHI.

Conclusion: SUSHI provides an exquisite recipe for analysing NGS data. By following the SUSHI recipe, SUSHI makes data analysis straightforward and takes care of documentation and administration tasks. Thus, the user can fully dedicate his time to the analysis itself. SUSHI is suitable for use by bioinformaticians as well as life science researchers. It is targeted for, but by no means constrained to, NGS data analysis. Our SUSHI instance is in productive use and has served as data analysis interface for more than 1000 data analysis projects. SUSHI source code as well as a demo server are freely available.

Keywords: Data analysis framework, reproducible research, meta-level system design

Background

Today’s bioinformatics faces the practical challenge of analysing massive and diverse data in a well documented and reproducible fashion. The situation is particularly challenging in the area of NGS research where state-of-the-art algorithms are frequently available as standalone tools and where a complete data analysis consists of many individual data processing and analysis steps. The considerations associated with conducting such a data analysis in a research environment have been discussed by W. S. NobleCite error: Closing </ref> missing for <ref> tag, Chipster^[1], GeneProf^[2] or GenePattern.^[3] They let users run individual steps or entire pipelines on a remote compute system with the framework keeping track of the executed analysis. Scripting frameworks like bpipe^[4], Ruffus^[5], nestly^[6], NGSANE^[7], Makeflow^[8], and Snakemake^[9] let users build bioinformatics pipelines in a command line fashion. Given the different types of user interactions, the former solutions are more targeted for the experienced biologists or the application-oriented bioinformaticians, while the latter address the needs of bioinformaticians who are more inclined to programming and high-throughput analysis of many datasets. However, there is no system as yet that natively offers both interfaces. Additionally, none of the existing frameworks puts an emphasis on having a human-readable and portable file-based representation of the meta-information and associated data.

References

↑ Fisch, K.M.; Meißner, T.; Gioia, L. et al. (2015). "Omics Pipe: A community-based framework for reproducible multi-omics data analysis". Bioinformatics 31 (11): 1724-1728. doi:10.1093/bioinformatics/btv061. PMC PMC4443682. PMID 25637560. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4443682.
↑ Halbritter, F.; Vaidya, H.J.; Tomlinson, S.R. (2011). "GeneProf: Analysis of high-throughput sequencing experiments". Nature Methods 1: 7–8. doi:10.1038/nmeth.1809. PMID 22205509.
↑ Reich, M.; Liefeld, T.; Gould, J. et al. (2006). "GenePattern 2.0". Nature Genetics 5: 500–501. doi:10.1038/ng0506-500. PMID 16642009.
↑ Sadedin, S.P.; Pope, B.; Oshlack, A. (2012). "Bpipe: a tool for running and managing bioinformatics pipelines". Bioinformatics 28 (11): 1525-6. doi:10.1093/bioinformatics/bts167. PMID 22500002.
↑ Goodstadt, L. (2010). "Ruffus: A lightweight Python library for computational pipelines". Bioinformatics 26 (21): 2778-9. doi:10.1093/bioinformatics/btq524. PMID 20847218.
↑ McCoy, C.O.; Gallagher, A.; Hoffman, N.G.; Matsen, F.A. (2013). "nestly: A framework for running software with nested parameter choices and aggregating results". Bioinformatics 29 (3): 387–8. doi:10.1093/bioinformatics/bts696. PMC PMC3562064. PMID 23220574. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3562064.
↑ Buske, F.A.; French, H.J.; Smith, M.A. et al. (2014). "NGSANE: A lightweight production informatics framework for high-throuput data analysis". Bioinformatics 30 (10): 1471-1472. doi:10.1093/bioinformatics/btu036. PMC PMC4016703. PMID 24470576. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4016703.
↑ Yu, L.; Moretti, C.; Thrasher, A. et al. (2010). "Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions". Cluster Computing 13 (3): 243-256. doi:10.1007/s10586-010-0134-7.
↑ Köster, J.; Rahmann, S. (2012). "Snakemake: A scalable bioinformatics workflow engine". Bioinformatics 28 (19): 2520-2. doi:10.1093/bioinformatics/bts480. PMID 22908215.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.