Difference between revisions of "Journal:SeqWare Query Engine: Storing and searching sequence data in the cloud"

Full article title	SeqWare Query Engine: Storing and searching sequence data in the cloud
Journal	BMC Bioinformatics
Author(s)	O’Connor, Brian D.; Merriman, Barry; Nelson, Stanley F.
Author affiliation(s)	University of North Carolina; University of California - Los Angeles
Primary contact	Email: snelson@ucla.edu
Year published	2010
Volume and issue	11(12)
Page(s)	S2
DOI	10.1186/1471-2105-11-S12-S2
ISSN	1471-2105
Distribution license	Creative Commons Attribution 2.0 Generic
Website	http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-S12-S2
Download	http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-11-S12-S2 (PDF)

Revision as of 17:42, 28 December 2015

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands.

Results:' In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (https://github.com/SeqWare).

Conclusions: The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.

Background

Recent advances in sequencing technologies have led to a greatly reduced cost and increased throughput.^[1] The dramatic reductions in both time and financial costs have shaped the experiments scientists are able to perform and have opened up the possibility of whole human genome resequencing becoming commonplace. Currently over a dozen human genomes have been completed, most using one of the short read, high-throughput technologies that are responsible for this growth in sequencing.^[2]^[3]^[4]^[5]^[6]^[7]^[8]^[9]^[10]^[11]^[12]^[13]^[14]^[15]^[16] The datatypes produced by these projects are varied, but most report single nucleotide variants (SNVs), small insertions/deletions (indels, typically <10 bases), structural variants (SVs), and may include additional information such as haplotype phasing and novel sequence assemblies. Paired tumor/normal samples can additionally be used to identify somatic mutation events by filtering for those variants present in the tumor but not the normal.

Full genome sequencing, while increasingly common, is just one of many experimental designs that are currently used with this generation of sequencing platforms. Targeted resequencing, whole-exome sequencing, RNA sequencing (RNA-Seq), Chromatin Immunoprecipitation sequencing (ChIP-Seq), and bisulfite sequencing for methylation detection are examples of other important analysis types that require large scale databasing capabilities. Efforts such as the 1000 Genomes project (http://www.1000genomes.org), the Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov), and the International Cancer Genome Consortium (http://www.icgc.org) are each generating a wide variety of such data across hundreds to thousands of samples. The diversity and number of sequencing datasets already produced, in production, or being planned present huge infrastructure challenges for the research community.

Primary data, if available, are typically huge, difficult to transfer over public networks, and cumbersome to analyze without significant local computational infrastructure. These include large compute clusters, extensive data storage facilities, dedicated system administrators, and bioinformaticians adept at low-level programming. Highly annotated datasets, such as finished variant calls, are more commonly available, particularly for human datasets. These present a more compact representation of the most salient information, but are typically only available as flat text files in a variety of quasi-standard file formats that require reformatting and processing. This effort is substantial, particularly as the number of datasets grow, and, as a result, is typically undertaken by a small number of researchers that have a personal stake in the data rather than being more widely and easily accessible. In many cases, essential source information has been eliminated for the sake of data reduction, making recalculation impossible. These challenges, in terms of file sizes, diverse formats, limited data retention, and computational requirements, can make writing generic analysis tools complex and difficult. Efforts such as the Variant Call Format (VCF) from the 1000 Genomes Project provide a standard to exchange variant data. But to facilitate the integration of multiple experimental types and increase tool reuse, a common mechanism to both store and query variant calls and other key information from sequencing experiments is highly desirable. Properly databasing this information enables both a common underlying data structure and a search interface to support powerful data mining of sequence-derived information.

To date most biological database projects have focused on the storage of heavily annotated model organism reference sequences. For example, efforts such as the UCSC genome databases^[17], the Generic Model Organism Database’s Chado schema^[18], and the Ensembl database^[19] all solve the problem of storing reference genome annotations in a complete and comprehensive way. The focus for these databases is the proper representation of biological data types and genome annotations, but not storing many thousands of genomes worth of variants relative to a given reference. While many biological database schemas currently in wide use could support tens or even hundreds of genomes worth of variant calls, ultimately these systems are limited by the resources of a single database instance. Since they focus on relatively modest amounts of annotation storage, loading hundreds of genomes worth of multi-terabyte sequencing coverage information, for example, would likely overwhelm these traditional database approaches. Yet the appeal of databasing next generation sequence data is clear since it would simplify tool development and allow for useful queries across samples and projects.

References

↑ Snyder, M.; Du, J.; Gerstein, M. (2010). "Personal genome sequencing: Current approaches and challenges". Genes & Development 24 (5): 423–431. doi:10.1101/gad.1864110. PMC PMC2827837. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2827837.
↑ Lander, E.; Linton, L.; Birren, B. et al. (2001). "Initial sequencing and analysis of the human genome". Nature 409 (6822): 860-921. doi:10.1038/35057062. PMID 11237011.
↑ Levy, S.; Sutton, G.; Ng, P. et al. (2007). "The diploid genome sequence of an individual human". PLOS Biology 5 (10): e254. doi:10.1371/journal.pbio.0050254. PMC PMC1964779. PMID 17803354. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1964779.
↑ Wheeler, D.A.; Srinivasan, M.; Egholm, M. et al. (2008). "The complete genome of an individual by massively parallel DNA sequencing". Nature 452 (7189): 872–876. doi:10.1038/nature06884. PMID 18421352.
↑ Pushkarev, D.; Neff, N.; Quake, S. (2009). "Single-molecule sequencing of an individual human genome". Nature Biotechnology 27 (9): 847-50. doi:10.1038/nbt.1561. PMC PMC4117198. PMID 19668243. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4117198.
↑ Wang, J.; Wang, W.; Li, R. et al. (2008). "The diploid genome sequence of an Asian individual". Nature 456 (7218): 60-5. doi:10.1038/nature07484. PMC PMC2716080. PMID 18987735. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2716080.
↑ Bentley, D.; Balasubramanian, S.; Swerdlow, H. et al. (2008). "Accurate whole human genome sequencing using reversible terminator chemistry". Nature 456 (7218): 53-9. doi:10.1038/nature07517. PMC PMC2581791. PMID 18987734. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791.
↑ McKernan, K.; Peckham, H.; Costa, G. et al. (2009). "Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding". Genome Research 19 (9): 1527-41. doi:10.1101/gr.091868.109. PMC PMC2752135. PMID 19546169. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752135.
↑ Ahn, S.; Kim, T.; Lee, S. et al. (2009). "The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group". Genome Research 19 (9): 1622-9. doi:10.1101/gr.092197.109. PMC PMC2752128. PMID 19470904. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752128.
↑ Kim, J.; Ju, Y.; Park, H. et al. (2009). "A highly annotated whole-genome sequence of a Korean individual". Nature 460 (7258): 1011-5. doi:10.1038/nature08211. PMC PMC2860965. PMID 19587683. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2860965.
↑ Drmanac, R.; Sparks, A.; Callow, M. et al. (2010). "Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays". Science 327 (5961): 78-81. doi:10.1126/science.1181498. PMID 19892942.
↑ Ley, T.; Mardis, E.; Ding, L. et al. (2008). "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome". Nature 456 (7218): 66-72. doi:10.1038/nature07485. PMC PMC2603574. PMID 18987736. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603574.
↑ Mardis, E.; Ding, L.; Dooling, D. et al. (2009). "Recurring mutations found by sequencing an acute myeloid leukemia genome". New England Journal of Medicine 361 (11): 1058-66. doi:10.1056/NEJMoa0903840. PMC PMC3201812. PMID 19657110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3201812.
↑ Pleasance, E.; Stephens, P.; O’Meara, S. et al. (2010). "A small-cell lung cancer genome with complex signatures of tobacco exposure". Nature 463 (7278): 184-90. doi:10.1038/nature08629. PMC PMC2880489. PMID 20016488. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880489.
↑ Pleasance, E.; Cheetham, R.; Stephens, P. et al. (2010). "A comprehensive catalogue of somatic mutations from a human cancer genome". Nature 463 (7278): 191-6. doi:10.1038/nature08658. PMC PMC3145108. PMID 20016485. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145108.
↑ Clark, M.; Homer, N.; O’Connor, B. et al. (2010). "U87MG decoded: The genomic sequence of a cytogenetically aberrant human cancer cell line". PLOS Genetics 6 (1): e1000832. doi:10.1371/journal.pgen.1000832. PMC PMC2813426. PMID 20126413. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813426.
↑ Rhead, B.; Karolchik, D.; Kuhn, R. et al. (2010). "The UCSC genome browser database: update 2010". Nucleic Acids Research 38 (Suppl 1): D613-D619. doi:10.1093/nar/gkp939. PMC PMC2808870. PMID 19906737. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808870.
↑ Mungall, C.; Emmert, D.; The FlyBase Consortium (2007). "A Chado case study: an ontology-based modular schema for representing genome-associated biological information". Bioinformatics 23 (13): i337-i346. doi:10.1093/bioinformatics/btm189. PMID 17646315.
↑ Hubbard, T.; Aken, B.; Beal, K. (2006). "Ensembl 2007". Nucleic Acids Research 35 (Suppl 1): D610-D617. doi:10.1093/nar/gkl996. PMC PMC1761443. PMID 17148474. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761443.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Additionally, URLs to the project page have been updated from the old, deprecated SourceForge location to the current GitHub site.

[SnyderPers10-1] Snyder, M.; Du, J.; Gerstein, M. (2010). "Personal genome sequencing: Current approaches and challenges". Genes & Development 24 (5): 423–431. doi:10.1101/gad.1864110. PMC PMC2827837. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2827837.

[LanderInit01-2] Lander, E.; Linton, L.; Birren, B. et al. (2001). "Initial sequencing and analysis of the human genome". Nature 409 (6822): 860-921. doi:10.1038/35057062. PMID 11237011.

[LevyTheDip07-3] Levy, S.; Sutton, G.; Ng, P. et al. (2007). "The diploid genome sequence of an individual human". PLOS Biology 5 (10): e254. doi:10.1371/journal.pbio.0050254. PMC PMC1964779. PMID 17803354. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1964779.

[WheelerTheCom08-4] Wheeler, D.A.; Srinivasan, M.; Egholm, M. et al. (2008). "The complete genome of an individual by massively parallel DNA sequencing". Nature 452 (7189): 872–876. doi:10.1038/nature06884. PMID 18421352.

[PushkarevSingle09-5] Pushkarev, D.; Neff, N.; Quake, S. (2009). "Single-molecule sequencing of an individual human genome". Nature Biotechnology 27 (9): 847-50. doi:10.1038/nbt.1561. PMC PMC4117198. PMID 19668243. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4117198.

[WangTheDip08-6] Wang, J.; Wang, W.; Li, R. et al. (2008). "The diploid genome sequence of an Asian individual". Nature 456 (7218): 60-5. doi:10.1038/nature07484. PMC PMC2716080. PMID 18987735. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2716080.

[BentleyAcc08-7] Bentley, D.; Balasubramanian, S.; Swerdlow, H. et al. (2008). "Accurate whole human genome sequencing using reversible terminator chemistry". Nature 456 (7218): 53-9. doi:10.1038/nature07517. PMC PMC2581791. PMID 18987734. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2581791.

[McKernanSeq09-8] McKernan, K.; Peckham, H.; Costa, G. et al. (2009). "Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding". Genome Research 19 (9): 1527-41. doi:10.1101/gr.091868.109. PMC PMC2752135. PMID 19546169. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752135.

[AhnTheFirst09-9] Ahn, S.; Kim, T.; Lee, S. et al. (2009). "The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group". Genome Research 19 (9): 1622-9. doi:10.1101/gr.092197.109. PMC PMC2752128. PMID 19470904. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752128.

[KimAHighly09-10] Kim, J.; Ju, Y.; Park, H. et al. (2009). "A highly annotated whole-genome sequence of a Korean individual". Nature 460 (7258): 1011-5. doi:10.1038/nature08211. PMC PMC2860965. PMID 19587683. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2860965.

[DrmanacHuman10-11] Drmanac, R.; Sparks, A.; Callow, M. et al. (2010). "Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays". Science 327 (5961): 78-81. doi:10.1126/science.1181498. PMID 19892942.

[LeyDNA08-12] Ley, T.; Mardis, E.; Ding, L. et al. (2008). "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome". Nature 456 (7218): 66-72. doi:10.1038/nature07485. PMC PMC2603574. PMID 18987736. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2603574.

[MardisRec09-13] Mardis, E.; Ding, L.; Dooling, D. et al. (2009). "Recurring mutations found by sequencing an acute myeloid leukemia genome". New England Journal of Medicine 361 (11): 1058-66. doi:10.1056/NEJMoa0903840. PMC PMC3201812. PMID 19657110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3201812.

[PleasanceASmall10-14] Pleasance, E.; Stephens, P.; O’Meara, S. et al. (2010). "A small-cell lung cancer genome with complex signatures of tobacco exposure". Nature 463 (7278): 184-90. doi:10.1038/nature08629. PMC PMC2880489. PMID 20016488. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880489.

[PleasanceAComp10-15] Pleasance, E.; Cheetham, R.; Stephens, P. et al. (2010). "A comprehensive catalogue of somatic mutations from a human cancer genome". Nature 463 (7278): 191-6. doi:10.1038/nature08658. PMC PMC3145108. PMID 20016485. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3145108.

[ClarkU87MG10-16] Clark, M.; Homer, N.; O’Connor, B. et al. (2010). "U87MG decoded: The genomic sequence of a cytogenetically aberrant human cancer cell line". PLOS Genetics 6 (1): e1000832. doi:10.1371/journal.pgen.1000832. PMC PMC2813426. PMID 20126413. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2813426.

[RheadTheUCSC10-17] Rhead, B.; Karolchik, D.; Kuhn, R. et al. (2010). "The UCSC genome browser database: update 2010". Nucleic Acids Research 38 (Suppl 1): D613-D619. doi:10.1093/nar/gkp939. PMC PMC2808870. PMID 19906737. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808870.

[MungallAChado07-18] Mungall, C.; Emmert, D.; The FlyBase Consortium (2007). "A Chado case study: an ontology-based modular schema for representing genome-associated biological information". Bioinformatics 23 (13): i337-i346. doi:10.1093/bioinformatics/btm189. PMID 17646315.

[HubbardEnsembl06-19] Hubbard, T.; Aken, B.; Beal, K. (2006). "Ensembl 2007". Nucleic Acids Research 35 (Suppl 1): D610-D617. doi:10.1093/nar/gkl996. PMC PMC1761443. PMID 17148474. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1761443.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

@@ Line 34: / Line 34: @@
 Recent advances in sequencing technologies have led to a greatly reduced cost and increased throughput.<ref name="SnyderPers10">{{cite journal |title=Personal genome sequencing: Current approaches and challenges |journal=Genes & Development |author=Snyder, M.; Du, J.; Gerstein, M. |volume=24 |issue=5 |pages=423–431 |year=2010 |doi=10.1101/gad.1864110 |pmc=PMC2827837}}</ref> The dramatic reductions in both time and financial costs have shaped the experiments scientists are able to perform and have opened up the possibility of whole human genome resequencing becoming commonplace. Currently over a dozen human genomes have been completed, most using one of the short read, high-throughput technologies that are responsible for this growth in sequencing.<ref name="LanderInit01">{{cite journal |title=Initial sequencing and analysis of the human genome |journal=Nature |author=Lander, E.; Linton, L.; Birren, B. et al. |volume=409 |issue=6822 |pages=860-921 |year=2001 |doi=10.1038/35057062 |pmid=11237011}}</ref><ref name="LevyTheDip07">{{cite journal |title=The diploid genome sequence of an individual human |journal=PLOS Biology |author=Levy, S.; Sutton, G.; Ng, P. et al. |volume=5 |issue=10 |pages=e254 |year=2007 |doi=10.1371/journal.pbio.0050254 |pmid=17803354 |pmc=PMC1964779}}</ref><ref name="WheelerTheCom08">{{cite journal |title=The complete genome of an individual by massively parallel DNA sequencing |journal=Nature |author=Wheeler, D.A.; Srinivasan, M.; Egholm, M. et al. |volume=452 |issue=7189 |pages=872–876 |year=2008 |doi=10.1038/nature06884 |pmid=18421352}}</ref><ref name="PushkarevSingle09">{{cite journal |title=Single-molecule sequencing of an individual human genome |journal=Nature Biotechnology |author=Pushkarev, D.; Neff, N.; Quake, S. |volume=27 |issue=9 |pages=847-50 |year=2009 |doi=10.1038/nbt.1561 |pmid=19668243 |pmc=PMC4117198}}</ref><ref name="WangTheDip08">{{cite journal |title=The diploid genome sequence of an Asian individual |journal=Nature |author=Wang, J.; Wang, W.; Li, R. et al. |volume=456 |issue=7218 |pages=60-5 |year=2008 |doi=10.1038/nature07484 |pmid=18987735 |pmc=PMC2716080}}</ref><ref name="BentleyAcc08">{{cite journal |title=Accurate whole human genome sequencing using reversible terminator chemistry |journal=Nature |author=Bentley, D.; Balasubramanian, S.; Swerdlow, H. et al. |volume=456 |issue=7218 |pages=53-9 |year=2008 |doi=10.1038/nature07517 |pmid=18987734 |pmc=PMC2581791}}</ref><ref name="McKernanSeq09">{{cite journal |title=Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding |journal=Genome Research |author=McKernan, K.; Peckham, H.; Costa, G. et al. |volume=19 |issue=9 |pages=1527-41 |year=2009 |doi=10.1101/gr.091868.109 |pmid=19546169 |pmc=PMC2752135}}</ref><ref name="AhnTheFirst09">{{cite journal |title=The first Korean genome sequence and analysis: Full genome sequencing for a socio-ethnic group |journal=Genome Research |author=Ahn, S.; Kim, T.; Lee, S. et al. |volume=19 |issue=9 |pages=1622-9 |year=2009 |doi=10.1101/gr.092197.109 |pmid=19470904 |pmc=PMC2752128}}</ref><ref name="KimAHighly09">{{cite journal |title=A highly annotated whole-genome sequence of a Korean individual |journal=Nature |author=Kim, J.; Ju, Y.; Park, H. et al. |volume=460 |issue=7258 |pages=1011-5 |year=2009 |doi=10.1038/nature08211 |pmid=19587683 |pmc=PMC2860965}}</ref><ref name="DrmanacHuman10">{{cite journal |title=Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays |journal=Science |author=Drmanac, R.; Sparks, A.; Callow, M. et al. |volume=327 |issue=5961 |pages=78-81 |year=2010 |doi=10.1126/science.1181498 |pmid=19892942}}</ref><ref name="LeyDNA08">{{cite journal |title=DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome |journal=Nature |author=Ley, T.; Mardis, E.; Ding, L. et al. |volume=456 |issue=7218 |pages=66-72 |year=2008 |doi=10.1038/nature07485 |pmid=18987736 |pmc=PMC2603574}}</ref><ref name="MardisRec09">{{cite journal |title=Recurring mutations found by sequencing an acute myeloid leukemia genome |journal=New England Journal of Medicine |author=Mardis, E.; Ding, L.; Dooling, D. et al. |volume=361 |issue=11 |pages=1058-66 |year=2009 |doi=10.1056/NEJMoa0903840 |pmid=19657110 |pmc=PMC3201812}}</ref><ref name="PleasanceASmall10">{{cite journal |title=A small-cell lung cancer genome with complex signatures of tobacco exposure |journal=Nature |author=Pleasance, E.; Stephens, P.; O’Meara, S. et al. |volume=463 |issue=7278 |pages=184-90 |year=2010 |doi=10.1038/nature08629 |pmid=20016488 |pmc=PMC2880489}}</ref><ref name="PleasanceAComp10">{{cite journal |title=A comprehensive catalogue of somatic mutations from a human cancer genome |journal=Nature |author=Pleasance, E.; Cheetham, R.; Stephens, P. et al. |volume=463 |issue=7278 |pages=191-6 |year=2010 |doi=10.1038/nature08658 |pmid=20016485 |pmc=PMC3145108}}</ref><ref name="ClarkU87MG10">{{cite journal |title=U87MG decoded: The genomic sequence of a cytogenetically aberrant human cancer cell line |journal=PLOS Genetics |author=Clark, M.; Homer, N.; O’Connor, B. et al. |volume=6 |issue=1 |pages=e1000832 |year=2010 |doi=10.1371/journal.pgen.1000832 |pmid=20126413 |pmc=PMC2813426}}</ref> The datatypes produced by these projects are varied, but most report single nucleotide variants (SNVs), small insertions/deletions (indels, typically <10 bases), structural variants (SVs), and may include additional information such as haplotype phasing and novel sequence assemblies. Paired tumor/normal samples can additionally be used to identify somatic mutation events by filtering for those variants present in the tumor but not the normal.
+Full genome sequencing, while increasingly common, is just one of many experimental designs that are currently used with this generation of sequencing platforms. Targeted resequencing, whole-exome sequencing, RNA sequencing (RNA-Seq), Chromatin Immunoprecipitation sequencing (ChIP-Seq), and bisulfite sequencing for methylation detection are examples of other important analysis types that require large scale databasing capabilities. Efforts such as the 1000 Genomes project (http://www.1000genomes.org), the Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov), and the International Cancer Genome Consortium (http://www.icgc.org) are each generating a wide variety of such data across hundreds to thousands of samples. The diversity and number of sequencing datasets already produced, in production, or being planned present huge infrastructure challenges for the research community.
+Primary data, if available, are typically huge, difficult to transfer over public networks, and cumbersome to analyze without significant local computational infrastructure. These include large compute clusters, extensive data storage facilities, dedicated system administrators, and bioinformaticians adept at low-level programming. Highly annotated datasets, such as finished variant calls, are more commonly available, particularly for human datasets. These present a more compact representation of the most salient information, but are typically only available as flat text files in a variety of quasi-standard file formats that require reformatting and processing. This effort is substantial, particularly as the number of datasets grow, and, as a result, is typically undertaken by a small number of researchers that have a personal stake in the data rather than being more widely and easily accessible. In many cases, essential source information has been eliminated for the sake of data reduction, making recalculation impossible. These challenges, in terms of file sizes, diverse formats, limited data retention, and computational requirements, can make writing generic analysis tools complex and difficult. Efforts such as the Variant Call Format (VCF) from the 1000 Genomes Project provide a standard to exchange variant data. But to facilitate the integration of multiple experimental types and increase tool reuse, a common mechanism to both store and query variant calls and other key information from sequencing experiments is highly desirable. Properly databasing this information enables both a common underlying data structure and a search interface to support powerful data mining of sequence-derived information.
+To date most biological database projects have focused on the storage of heavily annotated model organism reference sequences. For example, efforts such as the UCSC genome databases<ref name="RheadTheUCSC10">{{cite journal |title=The UCSC genome browser database: update 2010 |journal=Nucleic Acids Research |author=Rhead, B.; Karolchik, D.; Kuhn, R. et al. |volume=38 |issue=Suppl 1 |pages=D613-D619 |year=2010 |doi=10.1093/nar/gkp939 |pmid=19906737 |pmc=PMC2808870}}</ref>, the Generic Model Organism Database’s Chado schema<ref name="MungallAChado07">{{cite journal |title=A Chado case study: an ontology-based modular schema for representing genome-associated biological information |journal=Bioinformatics |author=Mungall, C.; Emmert, D.; The FlyBase Consortium |volume=23 |issue=13 |pages=i337-i346 |year=2007 |doi=10.1093/bioinformatics/btm189 |pmid=17646315}}</ref>, and the Ensembl database<ref name="HubbardEnsembl06">{{cite journal |title=Ensembl 2007 |journal=Nucleic Acids Research |author=Hubbard, T.; Aken, B.; Beal, K. |volume=35 |issue=Suppl 1 |pages=D610-D617 |year=2006 |doi=10.1093/nar/gkl996 |pmid=17148474 |pmc=PMC1761443}}</ref> all solve the problem of storing reference genome annotations in a complete and comprehensive way. The focus for these databases is the proper representation of biological data types and genome annotations, but not storing many thousands of genomes worth of variants relative to a given reference. While many biological database schemas currently in wide use could support tens or even hundreds of genomes worth of variant calls, ultimately these systems are limited by the resources of a single database instance. Since they focus on relatively modest amounts of annotation storage, loading hundreds of genomes worth of multi-terabyte sequencing coverage information, for example, would likely overwhelm these traditional database approaches. Yet the appeal of databasing next generation sequence data is clear since it would simplify tool development and allow for useful queries across samples and projects.
 ==References==

Difference between revisions of "Journal:SeqWare Query Engine: Storing and searching sequence data in the cloud"

Revision as of 17:42, 28 December 2015

Contents

Abstract

Background

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export