Journal:SeqWare Query Engine: Storing and searching sequence data in the cloud

Full article title	SeqWare Query Engine: Storing and searching sequence data in the cloud
Journal	BMC Bioinformatics
Author(s)	O’Connor, Brian D.; Merriman, Barry; Nelson, Stanley F.
Author affiliation(s)	University of North Carolina; University of California - Los Angeles
Primary contact	Email: snelson@ucla.edu
Year published	2010
Volume and issue	11(12)
Page(s)	S2
DOI	10.1186/1471-2105-11-S12-S2
ISSN	1471-2105
Distribution license	Creative Commons Attribution 2.0 Generic
Website	http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-S12-S2
Download	http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-11-S12-S2 (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands.

Results:' In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (https://github.com/SeqWare).

Conclusions: The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.

Background

Recent advances in sequencing technologies have led to a greatly reduced cost and increased throughput.^[1] The dramatic reductions in both time and financial costs have shaped the experiments scientists are able to perform and have opened up the possibility of whole human genome resequencing becoming commonplace. Currently over a dozen human genomes have been completed, most using one of the short read, high-throughput technologies that are responsible for this growth in sequencing.

^[2]

^[3]

^[4]

^[5]

^[6]

^[7]

^[8]

The datatypes produced by these projects are varied, but most report single nucleotide variants (SNVs), small insertions/deletions (indels, typically <10 bases), structural variants (SVs), and may include additional information such as haplotype phasing and novel sequence assemblies. Paired tumor/normal samples can additionally be used to identify somatic mutation events by filtering for those variants present in the tumor but not the normal.

References

↑ Snyder, M.; Du, J.; Gerstein, M. (2010). "Personal genome sequencing: Current approaches and challenges". Genes & Development 24 (5): 423–431. doi:10.1101/gad.1864110. PMC PMC2827837. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2827837.
↑ Lander, E.; Linton, L.; Birren, B. et al. (2001). "Initial sequencing and analysis of the human genome". Nature 409 (6822): 860-921. doi:10.1038/35057062. PMID 11237011.
↑ Levy, S.; Sutton, G.; Ng, P. et al. (2007). "The diploid genome sequence of an individual human". PLOS Biology 5 (10): e254. doi:10.1371/journal.pbio.0050254. PMC PMC1964779. PMID 17803354. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1964779.
↑ Wheeler, D.A.; Srinivasan, M.; Egholm, M. et al. (2008). "The complete genome of an individual by massively parallel DNA sequencing". Nature 452 (7189): 872–876. doi:10.1038/nature06884. PMID 18421352.
↑ Pushkarev, D.; Neff, N.; Quake, S. (2009). "Single-molecule sequencing of an individual human genome". Nature Biotechnology 27 (9): 847-50. doi:10.1038/nbt.1561. PMC PMC4117198. PMID 19668243. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4117198.
↑ Wang, J.; Wang, W.; Li, R. et al. (2008). "The diploid genome sequence of an Asian individual". Nature 456 (7218): 60-5. doi:10.1038/nature07484. PMC PMC2716080. PMID 18987735. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2716080.
↑ Bentley, D.; Balasubramanian, S.; Swerdlow, H. et al. (2008). "Accurate whole human genome sequencing using reversible terminator chemistry". Nature 456 (7218): 53-9. doi:10.1038/nature07517. PMC PMC2581791. PMID 18987734. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791.
↑ ^8.0 ^8.1 McKernan, K.; Peckham, H.; Costa, G. et al. (2009). "Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding". Genome Research 19 (9): 1527-41. doi:10.1101/gr.091868.109. PMC PMC2752135. PMID 19546169. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752135.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Additionally, URLs to the project page have been updated from the old, deprecated SourceForge location to the current GitHub site.