Difference between revisions of "Journal:SeqWare Query Engine: Storing and searching sequence data in the cloud"

From LIMSWiki
Jump to navigationJump to search
(Added content. Saving and adding more.)
(Added content. Saving and adding more.)
Line 39: Line 39:


To date most biological database projects have focused on the storage of heavily annotated model organism reference sequences. For example, efforts such as the UCSC genome databases<ref name="RheadTheUCSC10">{{cite journal |title=The UCSC genome browser database: update 2010 |journal=Nucleic Acids Research |author=Rhead, B.; Karolchik, D.; Kuhn, R. et al. |volume=38 |issue=Suppl 1 |pages=D613-D619 |year=2010 |doi=10.1093/nar/gkp939 |pmid=19906737 |pmc=PMC2808870}}</ref>, the Generic Model Organism Database’s Chado schema<ref name="MungallAChado07">{{cite journal |title=A Chado case study: an ontology-based modular schema for representing genome-associated biological information |journal=Bioinformatics |author=Mungall, C.; Emmert, D.; The FlyBase Consortium |volume=23 |issue=13 |pages=i337-i346 |year=2007 |doi=10.1093/bioinformatics/btm189 |pmid=17646315}}</ref>, and the Ensembl database<ref name="HubbardEnsembl06">{{cite journal |title=Ensembl 2007 |journal=Nucleic Acids Research |author=Hubbard, T.; Aken, B.; Beal, K. |volume=35 |issue=Suppl 1 |pages=D610-D617 |year=2006 |doi=10.1093/nar/gkl996 |pmid=17148474 |pmc=PMC1761443}}</ref> all solve the problem of storing reference genome annotations in a complete and comprehensive way. The focus for these databases is the proper representation of biological data types and genome annotations, but not storing many thousands of genomes worth of variants relative to a given reference. While many biological database schemas currently in wide use could support tens or even hundreds of genomes worth of variant calls, ultimately these systems are limited by the resources of a single database instance. Since they focus on relatively modest amounts of annotation storage, loading hundreds of genomes worth of multi-terabyte sequencing coverage information, for example, would likely overwhelm these traditional database approaches. Yet the appeal of databasing next generation sequence data is clear since it would simplify tool development and allow for useful queries across samples and projects.
To date most biological database projects have focused on the storage of heavily annotated model organism reference sequences. For example, efforts such as the UCSC genome databases<ref name="RheadTheUCSC10">{{cite journal |title=The UCSC genome browser database: update 2010 |journal=Nucleic Acids Research |author=Rhead, B.; Karolchik, D.; Kuhn, R. et al. |volume=38 |issue=Suppl 1 |pages=D613-D619 |year=2010 |doi=10.1093/nar/gkp939 |pmid=19906737 |pmc=PMC2808870}}</ref>, the Generic Model Organism Database’s Chado schema<ref name="MungallAChado07">{{cite journal |title=A Chado case study: an ontology-based modular schema for representing genome-associated biological information |journal=Bioinformatics |author=Mungall, C.; Emmert, D.; The FlyBase Consortium |volume=23 |issue=13 |pages=i337-i346 |year=2007 |doi=10.1093/bioinformatics/btm189 |pmid=17646315}}</ref>, and the Ensembl database<ref name="HubbardEnsembl06">{{cite journal |title=Ensembl 2007 |journal=Nucleic Acids Research |author=Hubbard, T.; Aken, B.; Beal, K. |volume=35 |issue=Suppl 1 |pages=D610-D617 |year=2006 |doi=10.1093/nar/gkl996 |pmid=17148474 |pmc=PMC1761443}}</ref> all solve the problem of storing reference genome annotations in a complete and comprehensive way. The focus for these databases is the proper representation of biological data types and genome annotations, but not storing many thousands of genomes worth of variants relative to a given reference. While many biological database schemas currently in wide use could support tens or even hundreds of genomes worth of variant calls, ultimately these systems are limited by the resources of a single database instance. Since they focus on relatively modest amounts of annotation storage, loading hundreds of genomes worth of multi-terabyte sequencing coverage information, for example, would likely overwhelm these traditional database approaches. Yet the appeal of databasing next generation sequence data is clear since it would simplify tool development and allow for useful queries across samples and projects.
In this work we introduce the SeqWare Query Engine, a scalable database system intended to represent the full range of data types common to whole genome and other experimental designs for next generation sequence data. HBase was chosen as the underlying backend because of its robust querying abilities using the Hadoop MapReduce environment and its auto-sharding of data across a commodity cluster based on the Hadoop HDFS distributed filesystem (http://​hadoop.​apache.​org). We also present a web service that wraps the use of MapReduce to allow for sophisticated queries of the database through a simple web interface. The web service can be used interactively or programmatically and makes it possible to easily integrate with genome browsers, such as the UCSC Browser<ref name="KentTheHuman02">{{cite journal |title=The human genome browser at UCSC |journal=Genome Research |author=Kent, W.; Sugnet, C.; Furey, T. et al. |volume=12 |issue=6 |pages=996-1006 |year=2002 |doi=10.1101/gr.229102 |pmid=12045153 |pmc=PMC186604}}</ref>, GBrowse<ref name="SteinTheGeneric02">{{cite journal |title=The generic genome browser: A building block for a model organism system database |journal=Genome Research |author=Stein, L.; Mungall, C.; Shu, S. et al. |volume=12 |issue=10 |pages=1599-610 |year=2002 |doi=10.1101/gr.403602 |pmid=12368253 |pmc=PMC187535}}</ref>, or IGV (http://​www.​broadinstitute.​org/​igv), and with data analysis tools, such as the UCSC table browser<ref name="KarolchikTheUSCS04">{{cite journal |title=The UCSC Table Browser data retrieval tool |journal=Nucleic Acids Research |author=Karolchik, D.; Hinrichs, A.; Furey, T. et al. |volume=32 |issue=Suppl 1 |pages=D493-D496 |year=2004 |doi=10.1093/nar/gkh103 |pmid=14681465 |pmc=PMC308837}}</ref>, GALAXY<ref name="GiardineGalaxy05">{{cite journal |title=Galaxy: A platform for interactive large-scale genome analysis |journal=Genome Research |author=Giardine, B.; Riemer, C.; Hardison, R.C.; Burhans, R.; Elnitski, L.; Shah, P.; Zhang, Y.; Blankenberg, D.; Albert, I.; Taylor, J.; Miller, W.; Kent, W.J.; Nekrutenko, A. |volume=15 |issue=10 |pages=1451–1455 |year=2005 |doi=10.1101/gr.4086505 |pmid=16169926 |pmc=PMC1240089}}</ref>, and others. The backend and web service can be used together to create databases containing varying levels of annotations, from raw variant calls and coverage to highly annotated and filtered SNV predictions. This flexibility allows the SeqWare Query Engine to scale from raw data analysis and algorithm tuning through highly annotated data dissemination and hosting. The design decision to move away from traditional relational databases in favor of the NoSQL-style of limited, but highly scalable, databases allowed us to support tens of genomes now and thousands of genomes in the future, limited only by the underlying cloud resources.
==Methods==
===Design approach===
The HBase backend for SeqWare Query Engine is based on the increasingly popular design paradigm that focuses on scalability at the expense of full ACID compliance, relational database schemas, and the Structured Query Language (SQL, as reflected in the name “NoSQL”). The result is that, while scalable to thousands of compute nodes, the overall operations permitted on the database are limited. Each records consists of a key and the value, which consists of one or more “column families” that are fixed at table creation time. Each column family can have many “labels” which can be added at any time, and each of these labels can have one or more “timestamps” (versions). For the query engine database, the genomic start position of each feature was used as the key while four column families served to represent the core data types: variants (SNVs, indels, SVs, and translocations), coverage, features (any location-based annotations on the genome), and coding consequences which link back to the variants entries they report on. The coverage object stores individual base coverages in a hash and covers a user-defined range of bases to minimize storage requirements for this data type. New column families can be added to the database to support new data types beyond those described here. Additional column family labels are created as new genomes are loaded into the database, and timestamps are used to distinguish variants in the same genome at identical locations. Figure 1 shows an example row with two genomes’ data loaded. This design was chosen because it meant identical variants in different genomes are stored within the same row, making comparisons between genomes extremely fast using MapReduce or similar simple, uniform operators (Figure 1a). The diagram also shows how secondary indexes are handled in the HBase backend (Figure 1b). Tags are a convenient mechanism to associate arbitrary key-value pairs with any variant object and support lookup for the object using the key (tag). When variants or other data types are written to the database, the persistence code identifies tags and adds them to a second table where the key is the tag plus variant ID and the value is the reference genomic table and location. This enables variants with certain tags to be identified without having to walk the entire contents of the main table.
[[File:Fig1 OConnor BMCInformatics2010 11-12.jpg|600px]]
{{clear}}
{|
| STYLE="vertical-align:top;"|
{| border="0" cellpadding="5" cellspacing="0" width="600px"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1. SeqWare Query Engine schema.''' The HBase database is a generic key-value, column oriented database that pairs well with the inherent sparse matrix nature of variant annotations. (a) The primary table stores multiple genomes worth of generic features, variants, coverages, and variant consequences using genomic location within a particular reference genome as the key. Each genome is represented by a particular column family label (such as “variant:genome7”). For locations with more than one called variant the HBase timestamp is used to distinguish each. (b) Secondary indexing is accomplished using a secondary table per genome indexed. The key is the tag being indexed plus the ID of the object of interest, the value is the row key for the original table. This makes lookup by secondary indexes, “tags” for example, possible without having to iterate over all contents of the primary table.</blockquote>
|-
|}
|}
===Datasets===
Fourteen human genome datasets were chosen for loading into a common SeqWare Query Engine backend, see Table 1 for the complete list. Most datasets included just SNV, small indel, and a limited number of SV predictions. The U87MG human cancer cell line genome was used to test the load of large-scale, raw variant analysis data types. For this genome, SNVs, small indels, large deletions, translocation events, and base-by-base coverage were all loaded. For the SNV and small indels, any variant observed once or more in the underlying short read data were loaded, which resulted in large numbers of spurious variants (i.e. sequencing errors) being loaded in the database. This was done purposefully for two reasons: for this study, to facilitate stress testing the HBase backend, and for the U87MG sequencing project, to facilitate analysis algorithm development by giving practical access to the greatest potential universe of candidate variants. In particular, the fast querying abilities of the SeqWare Query Engine enabled rapid heuristic tuning of the variant calling pipeline parameters through many cycles of filtering and subsequent assessment.
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="100%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|'''Table 1. Datasets'''
|-
  ! style="padding-left:10px; padding-right:10px;"|Dataset
  ! style="padding-left:10px; padding-right:10px;"|Technology
  ! style="padding-left:10px; padding-right:10px;"|SNVs & Indels
  ! style="padding-left:10px; padding-right:10px;"|SV
  ! style="padding-left:10px; padding-right:10px;"|Translocations
  ! style="padding-left:10px; padding-right:10px;"|Reference
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|European-Venter
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Sanger
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Levy et al. 2007<ref name="LevyTheDip07" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|European-Watson
  | style="background-color:white; padding-left:10px; padding-right:10px;"|454
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Wheeler et al. 2008<ref name="WheelerTheCom08" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|European-Quake
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Helicos
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Pushkarev et al. 2009<ref name="PushkarevSingle09" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Asian
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Illumina
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Wang et al. 2008<ref name="WangTheDip08" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Yoruban 18507
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Illumina
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Bentley et al. 2008<ref name="BentleyAcc08" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Yoruban 18507
  | style="background-color:white; padding-left:10px; padding-right:10px;"|SOLiD
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|McKernan et al. 2009<ref name="McKernanSeq09" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Korean
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Illumina
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Ahn et al. 2009<ref name="AhnTheFirst09" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Korean-AKI
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Illumina
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Kim et al. 2009<ref name="KimAHighly09" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|3 human genomes
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Complete Genomics
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Drmanac et al. 2009<ref name="DrmanacHuman10" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|AML T/N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Illumina
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Ley et al. 2008<ref name="LeyDNA08" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|AML genome
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Illumina
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Mardis et al. 2009<ref name="MardisRec09" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Melanoma
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Illumina
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Pleasance et al. 2010<ref name="PleasanceASmall10" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Lung cancer
  | style="background-color:white; padding-left:10px; padding-right:10px;"|SOLiD
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|N
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Pleasance et al. 2010<ref name="PleasanceAComp10" />
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|U87MG
  | style="background-color:white; padding-left:10px; padding-right:10px;"|SOLiD
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Y
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Clark et al. 2010<ref name="ClarkU87MG10" />
|-
|}
|}


==References==
==References==

Revision as of 19:23, 28 December 2015

Full article title SeqWare Query Engine: Storing and searching sequence data in the cloud
Journal BMC Bioinformatics
Author(s) O’Connor, Brian D.; Merriman, Barry; Nelson, Stanley F.
Author affiliation(s) University of North Carolina; University of California - Los Angeles
Primary contact Email: snelson@ucla.edu
Year published 2010
Volume and issue 11(12)
Page(s) S2
DOI 10.1186/1471-2105-11-S12-S2
ISSN 1471-2105
Distribution license Creative Commons Attribution 2.0 Generic
Website http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-S12-S2
Download http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-11-S12-S2 (PDF)

Abstract

Background: Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands.

Results:' In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (https://github.com/SeqWare).

Conclusions: The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.

Background

Recent advances in sequencing technologies have led to a greatly reduced cost and increased throughput.[1] The dramatic reductions in both time and financial costs have shaped the experiments scientists are able to perform and have opened up the possibility of whole human genome resequencing becoming commonplace. Currently over a dozen human genomes have been completed, most using one of the short read, high-throughput technologies that are responsible for this growth in sequencing.[2][3][4][5][6][7][8][9][10][11][12][13][14][15][16] The datatypes produced by these projects are varied, but most report single nucleotide variants (SNVs), small insertions/deletions (indels, typically <10 bases), structural variants (SVs), and may include additional information such as haplotype phasing and novel sequence assemblies. Paired tumor/normal samples can additionally be used to identify somatic mutation events by filtering for those variants present in the tumor but not the normal.

Full genome sequencing, while increasingly common, is just one of many experimental designs that are currently used with this generation of sequencing platforms. Targeted resequencing, whole-exome sequencing, RNA sequencing (RNA-Seq), Chromatin Immunoprecipitation sequencing (ChIP-Seq), and bisulfite sequencing for methylation detection are examples of other important analysis types that require large scale databasing capabilities. Efforts such as the 1000 Genomes project (http://www.1000genomes.org), the Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov), and the International Cancer Genome Consortium (http://www.icgc.org) are each generating a wide variety of such data across hundreds to thousands of samples. The diversity and number of sequencing datasets already produced, in production, or being planned present huge infrastructure challenges for the research community.

Primary data, if available, are typically huge, difficult to transfer over public networks, and cumbersome to analyze without significant local computational infrastructure. These include large compute clusters, extensive data storage facilities, dedicated system administrators, and bioinformaticians adept at low-level programming. Highly annotated datasets, such as finished variant calls, are more commonly available, particularly for human datasets. These present a more compact representation of the most salient information, but are typically only available as flat text files in a variety of quasi-standard file formats that require reformatting and processing. This effort is substantial, particularly as the number of datasets grow, and, as a result, is typically undertaken by a small number of researchers that have a personal stake in the data rather than being more widely and easily accessible. In many cases, essential source information has been eliminated for the sake of data reduction, making recalculation impossible. These challenges, in terms of file sizes, diverse formats, limited data retention, and computational requirements, can make writing generic analysis tools complex and difficult. Efforts such as the Variant Call Format (VCF) from the 1000 Genomes Project provide a standard to exchange variant data. But to facilitate the integration of multiple experimental types and increase tool reuse, a common mechanism to both store and query variant calls and other key information from sequencing experiments is highly desirable. Properly databasing this information enables both a common underlying data structure and a search interface to support powerful data mining of sequence-derived information.

To date most biological database projects have focused on the storage of heavily annotated model organism reference sequences. For example, efforts such as the UCSC genome databases[17], the Generic Model Organism Database’s Chado schema[18], and the Ensembl database[19] all solve the problem of storing reference genome annotations in a complete and comprehensive way. The focus for these databases is the proper representation of biological data types and genome annotations, but not storing many thousands of genomes worth of variants relative to a given reference. While many biological database schemas currently in wide use could support tens or even hundreds of genomes worth of variant calls, ultimately these systems are limited by the resources of a single database instance. Since they focus on relatively modest amounts of annotation storage, loading hundreds of genomes worth of multi-terabyte sequencing coverage information, for example, would likely overwhelm these traditional database approaches. Yet the appeal of databasing next generation sequence data is clear since it would simplify tool development and allow for useful queries across samples and projects.

In this work we introduce the SeqWare Query Engine, a scalable database system intended to represent the full range of data types common to whole genome and other experimental designs for next generation sequence data. HBase was chosen as the underlying backend because of its robust querying abilities using the Hadoop MapReduce environment and its auto-sharding of data across a commodity cluster based on the Hadoop HDFS distributed filesystem (http://hadoop.apache.org). We also present a web service that wraps the use of MapReduce to allow for sophisticated queries of the database through a simple web interface. The web service can be used interactively or programmatically and makes it possible to easily integrate with genome browsers, such as the UCSC Browser[20], GBrowse[21], or IGV (http://www.broadinstitute.org/​igv), and with data analysis tools, such as the UCSC table browser[22], GALAXY[23], and others. The backend and web service can be used together to create databases containing varying levels of annotations, from raw variant calls and coverage to highly annotated and filtered SNV predictions. This flexibility allows the SeqWare Query Engine to scale from raw data analysis and algorithm tuning through highly annotated data dissemination and hosting. The design decision to move away from traditional relational databases in favor of the NoSQL-style of limited, but highly scalable, databases allowed us to support tens of genomes now and thousands of genomes in the future, limited only by the underlying cloud resources.

Methods

Design approach

The HBase backend for SeqWare Query Engine is based on the increasingly popular design paradigm that focuses on scalability at the expense of full ACID compliance, relational database schemas, and the Structured Query Language (SQL, as reflected in the name “NoSQL”). The result is that, while scalable to thousands of compute nodes, the overall operations permitted on the database are limited. Each records consists of a key and the value, which consists of one or more “column families” that are fixed at table creation time. Each column family can have many “labels” which can be added at any time, and each of these labels can have one or more “timestamps” (versions). For the query engine database, the genomic start position of each feature was used as the key while four column families served to represent the core data types: variants (SNVs, indels, SVs, and translocations), coverage, features (any location-based annotations on the genome), and coding consequences which link back to the variants entries they report on. The coverage object stores individual base coverages in a hash and covers a user-defined range of bases to minimize storage requirements for this data type. New column families can be added to the database to support new data types beyond those described here. Additional column family labels are created as new genomes are loaded into the database, and timestamps are used to distinguish variants in the same genome at identical locations. Figure 1 shows an example row with two genomes’ data loaded. This design was chosen because it meant identical variants in different genomes are stored within the same row, making comparisons between genomes extremely fast using MapReduce or similar simple, uniform operators (Figure 1a). The diagram also shows how secondary indexes are handled in the HBase backend (Figure 1b). Tags are a convenient mechanism to associate arbitrary key-value pairs with any variant object and support lookup for the object using the key (tag). When variants or other data types are written to the database, the persistence code identifies tags and adds them to a second table where the key is the tag plus variant ID and the value is the reference genomic table and location. This enables variants with certain tags to be identified without having to walk the entire contents of the main table.

Fig1 OConnor BMCInformatics2010 11-12.jpg

Figure 1. SeqWare Query Engine schema. The HBase database is a generic key-value, column oriented database that pairs well with the inherent sparse matrix nature of variant annotations. (a) The primary table stores multiple genomes worth of generic features, variants, coverages, and variant consequences using genomic location within a particular reference genome as the key. Each genome is represented by a particular column family label (such as “variant:genome7”). For locations with more than one called variant the HBase timestamp is used to distinguish each. (b) Secondary indexing is accomplished using a secondary table per genome indexed. The key is the tag being indexed plus the ID of the object of interest, the value is the row key for the original table. This makes lookup by secondary indexes, “tags” for example, possible without having to iterate over all contents of the primary table.

Datasets

Fourteen human genome datasets were chosen for loading into a common SeqWare Query Engine backend, see Table 1 for the complete list. Most datasets included just SNV, small indel, and a limited number of SV predictions. The U87MG human cancer cell line genome was used to test the load of large-scale, raw variant analysis data types. For this genome, SNVs, small indels, large deletions, translocation events, and base-by-base coverage were all loaded. For the SNV and small indels, any variant observed once or more in the underlying short read data were loaded, which resulted in large numbers of spurious variants (i.e. sequencing errors) being loaded in the database. This was done purposefully for two reasons: for this study, to facilitate stress testing the HBase backend, and for the U87MG sequencing project, to facilitate analysis algorithm development by giving practical access to the greatest potential universe of candidate variants. In particular, the fast querying abilities of the SeqWare Query Engine enabled rapid heuristic tuning of the variant calling pipeline parameters through many cycles of filtering and subsequent assessment.

Table 1. Datasets
Dataset Technology SNVs & Indels SV Translocations Reference
European-Venter Sanger Y Y N Levy et al. 2007[3]
European-Watson 454 Y Y N Wheeler et al. 2008[4]
European-Quake Helicos Y Y N Pushkarev et al. 2009[5]
Asian Illumina Y Y N Wang et al. 2008[6]
Yoruban 18507 Illumina Y Y N Bentley et al. 2008[7]
Yoruban 18507 SOLiD Y Y N McKernan et al. 2009[8]
Korean Illumina Y Y N Ahn et al. 2009[9]
Korean-AKI Illumina Y Y N Kim et al. 2009[10]
3 human genomes Complete Genomics Y Y N Drmanac et al. 2009[11]
AML T/N Illumina Y Y N Ley et al. 2008[12]
AML genome Illumina Y Y N Mardis et al. 2009[13]
Melanoma Illumina Y Y N Pleasance et al. 2010[14]
Lung cancer SOLiD Y Y N Pleasance et al. 2010[15]
U87MG SOLiD Y Y Y Clark et al. 2010[16]

References

  1. Snyder, M.; Du, J.; Gerstein, M. (2010). "Personal genome sequencing: Current approaches and challenges". Genes & Development 24 (5): 423–431. doi:10.1101/gad.1864110. PMC PMC2827837. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2827837. 
  2. Lander, E.; Linton, L.; Birren, B. et al. (2001). "Initial sequencing and analysis of the human genome". Nature 409 (6822): 860-921. doi:10.1038/35057062. PMID 11237011. 
  3. 3.0 3.1 Levy, S.; Sutton, G.; Ng, P. et al. (2007). "The diploid genome sequence of an individual human". PLOS Biology 5 (10): e254. doi:10.1371/journal.pbio.0050254. PMC PMC1964779. PMID 17803354. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1964779. 
  4. 4.0 4.1 Wheeler, D.A.; Srinivasan, M.; Egholm, M. et al. (2008). "The complete genome of an individual by massively parallel DNA sequencing". Nature 452 (7189): 872–876. doi:10.1038/nature06884. PMID 18421352. 
  5. 5.0 5.1 Pushkarev, D.; Neff, N.; Quake, S. (2009). "Single-molecule sequencing of an individual human genome". Nature Biotechnology 27 (9): 847-50. doi:10.1038/nbt.1561. PMC PMC4117198. PMID 19668243. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4117198. 
  6. 6.0 6.1 Wang, J.; Wang, W.; Li, R. et al. (2008). "The diploid genome sequence of an Asian individual". Nature 456 (7218): 60-5. doi:10.1038/nature07484. PMC PMC2716080. PMID 18987735. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2716080. 
  7. 7.0 7.1 Bentley, D.; Balasubramanian, S.; Swerdlow, H. et al. (2008). "Accurate whole human genome sequencing using reversible terminator chemistry". Nature 456 (7218): 53-9. doi:10.1038/nature07517. PMC PMC2581791. PMID 18987734. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791. 
  8. 8.0 8.1 McKernan, K.; Peckham, H.; Costa, G. et al. (2009). "Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding". Genome Research 19 (9): 1527-41. doi:10.1101/gr.091868.109. PMC PMC2752135. PMID 19546169. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752135. 
  9. 9.0 9.1 Ahn, S.; Kim, T.; Lee, S. et al. (2009). "The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group". Genome Research 19 (9): 1622-9. doi:10.1101/gr.092197.109. PMC PMC2752128. PMID 19470904. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752128. 
  10. 10.0 10.1 Kim, J.; Ju, Y.; Park, H. et al. (2009). "A highly annotated whole-genome sequence of a Korean individual". Nature 460 (7258): 1011-5. doi:10.1038/nature08211. PMC PMC2860965. PMID 19587683. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2860965. 
  11. 11.0 11.1 Drmanac, R.; Sparks, A.; Callow, M. et al. (2010). "Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays". Science 327 (5961): 78-81. doi:10.1126/science.1181498. PMID 19892942. 
  12. 12.0 12.1 Ley, T.; Mardis, E.; Ding, L. et al. (2008). "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome". Nature 456 (7218): 66-72. doi:10.1038/nature07485. PMC PMC2603574. PMID 18987736. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603574. 
  13. 13.0 13.1 Mardis, E.; Ding, L.; Dooling, D. et al. (2009). "Recurring mutations found by sequencing an acute myeloid leukemia genome". New England Journal of Medicine 361 (11): 1058-66. doi:10.1056/NEJMoa0903840. PMC PMC3201812. PMID 19657110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3201812. 
  14. 14.0 14.1 Pleasance, E.; Stephens, P.; O’Meara, S. et al. (2010). "A small-cell lung cancer genome with complex signatures of tobacco exposure". Nature 463 (7278): 184-90. doi:10.1038/nature08629. PMC PMC2880489. PMID 20016488. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880489. 
  15. 15.0 15.1 Pleasance, E.; Cheetham, R.; Stephens, P. et al. (2010). "A comprehensive catalogue of somatic mutations from a human cancer genome". Nature 463 (7278): 191-6. doi:10.1038/nature08658. PMC PMC3145108. PMID 20016485. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145108. 
  16. 16.0 16.1 Clark, M.; Homer, N.; O’Connor, B. et al. (2010). "U87MG decoded: The genomic sequence of a cytogenetically aberrant human cancer cell line". PLOS Genetics 6 (1): e1000832. doi:10.1371/journal.pgen.1000832. PMC PMC2813426. PMID 20126413. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813426. 
  17. Rhead, B.; Karolchik, D.; Kuhn, R. et al. (2010). "The UCSC genome browser database: update 2010". Nucleic Acids Research 38 (Suppl 1): D613-D619. doi:10.1093/nar/gkp939. PMC PMC2808870. PMID 19906737. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808870. 
  18. Mungall, C.; Emmert, D.; The FlyBase Consortium (2007). "A Chado case study: an ontology-based modular schema for representing genome-associated biological information". Bioinformatics 23 (13): i337-i346. doi:10.1093/bioinformatics/btm189. PMID 17646315. 
  19. Hubbard, T.; Aken, B.; Beal, K. (2006). "Ensembl 2007". Nucleic Acids Research 35 (Suppl 1): D610-D617. doi:10.1093/nar/gkl996. PMC PMC1761443. PMID 17148474. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761443. 
  20. Kent, W.; Sugnet, C.; Furey, T. et al. (2002). "The human genome browser at UCSC". Genome Research 12 (6): 996-1006. doi:10.1101/gr.229102. PMC PMC186604. PMID 12045153. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC186604. 
  21. Stein, L.; Mungall, C.; Shu, S. et al. (2002). "The generic genome browser: A building block for a model organism system database". Genome Research 12 (10): 1599-610. doi:10.1101/gr.403602. PMC PMC187535. PMID 12368253. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC187535. 
  22. Karolchik, D.; Hinrichs, A.; Furey, T. et al. (2004). "The UCSC Table Browser data retrieval tool". Nucleic Acids Research 32 (Suppl 1): D493-D496. doi:10.1093/nar/gkh103. PMC PMC308837. PMID 14681465. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308837. 
  23. Giardine, B.; Riemer, C.; Hardison, R.C.; Burhans, R.; Elnitski, L.; Shah, P.; Zhang, Y.; Blankenberg, D.; Albert, I.; Taylor, J.; Miller, W.; Kent, W.J.; Nekrutenko, A. (2005). "Galaxy: A platform for interactive large-scale genome analysis". Genome Research 15 (10): 1451–1455. doi:10.1101/gr.4086505. PMC PMC1240089. PMID 16169926. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1240089. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Additionally, URLs to the project page have been updated from the old, deprecated SourceForge location to the current GitHub site.