Difference between revisions of "Journal:Bioinformatics workflow for clinical whole genome sequencing at Partners HealthCare Personalized Medicine"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 33: Line 33:


Practical considerations such as sequencing costs, data processing and maintenance, and data analysis complexities are important considerations when a [[laboratory]] is considering a new NGS program. These issues are amplified in whole genome sequencing (WGS) due to the volume of the data and have long been barriers to entry for [[Clinical laboratory|clinical laboratories]] looking to adopt WGS. Despite the ability of WGS to interrogate the entirety of the genome, clinical interpretation still often focuses on only 3% of the genome (i.e., exome data, pharmacogenomics risk variants, and single nucleotide variants associated with complex disease risk).<ref name="BieseckerTheClinSeq09">{{cite journal |title=The ClinSeq Project: Piloting large-scale genome sequencing for research in genomic medicine |journal=Genome Research |author=Biesecker, L.G.; Mullikin, J.C.; Facio, F.M. et al. |volume=19 |issue=9 |pages=1665-74 |year=2009 |doi=10.1101/gr.092841.109 |pmid=19602640 |pmc=PMC2752125}}</ref><ref name="DeweyClinical14">{{cite journal |title=Clinical interpretation and implications of whole-genome sequencing |journal=JAMA |author=Dewey, F.E.; Grove, M.E.; Pan, C. et al. |volume=311 |issue=10 |pages=1035-45 |year=2014 |doi=10.1001/jama.2014.1717 |pmid=24618965 |pmc=PMC4119063}}</ref><ref name="JiangDetect13">{{cite journal |title=Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing |journal=American Journal of Human Genetics |author=Jiang, Y.H.; Yuen, R.K.; Jin, X. et al. |volume=93 |issue=2 |pages=249-63 |year=2013 |doi=10.1016/j.ajhg.2013.06.012 |pmid=23849776 |pmc=PMC3738824}}</ref><ref name="LupskiWhole10">{{cite journal |title=Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy |journal=New England Journal of Medicine |author=Lupski, J.R.; Reid, J.G.; Gonzaga-Jauregui, C. et al. |volume=362 |issue=13 |pages=1181-91 |year=2010 |doi=10.1056/NEJMoa0908094 |pmid=20220177 |pmc=PMC4036802}}</ref> Therefore, WGS services may be overlooked for clinical applications as they trend towards increased costs and longer turnaround times due to a heavier computational load, increased number of variants for analysis, and larger data archives. However, the steadily decreasing cost of sequencing and storage now allow laboratories to consider genome sequencing. WGS, and more specifically PCR-free WGS, also decreases the need to re-sequence each time the coding sequences of targeted regions change, novel genes are discovered, or a new reference genome is released. The balance between cost, turnaround time, accuracy, and completeness has to be addressed when launching a WGS program. Here, we describe the workflow we adopted and the challenges we met supporting the bioinformatics of WGS in a clinical setting.
Practical considerations such as sequencing costs, data processing and maintenance, and data analysis complexities are important considerations when a [[laboratory]] is considering a new NGS program. These issues are amplified in whole genome sequencing (WGS) due to the volume of the data and have long been barriers to entry for [[Clinical laboratory|clinical laboratories]] looking to adopt WGS. Despite the ability of WGS to interrogate the entirety of the genome, clinical interpretation still often focuses on only 3% of the genome (i.e., exome data, pharmacogenomics risk variants, and single nucleotide variants associated with complex disease risk).<ref name="BieseckerTheClinSeq09">{{cite journal |title=The ClinSeq Project: Piloting large-scale genome sequencing for research in genomic medicine |journal=Genome Research |author=Biesecker, L.G.; Mullikin, J.C.; Facio, F.M. et al. |volume=19 |issue=9 |pages=1665-74 |year=2009 |doi=10.1101/gr.092841.109 |pmid=19602640 |pmc=PMC2752125}}</ref><ref name="DeweyClinical14">{{cite journal |title=Clinical interpretation and implications of whole-genome sequencing |journal=JAMA |author=Dewey, F.E.; Grove, M.E.; Pan, C. et al. |volume=311 |issue=10 |pages=1035-45 |year=2014 |doi=10.1001/jama.2014.1717 |pmid=24618965 |pmc=PMC4119063}}</ref><ref name="JiangDetect13">{{cite journal |title=Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing |journal=American Journal of Human Genetics |author=Jiang, Y.H.; Yuen, R.K.; Jin, X. et al. |volume=93 |issue=2 |pages=249-63 |year=2013 |doi=10.1016/j.ajhg.2013.06.012 |pmid=23849776 |pmc=PMC3738824}}</ref><ref name="LupskiWhole10">{{cite journal |title=Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy |journal=New England Journal of Medicine |author=Lupski, J.R.; Reid, J.G.; Gonzaga-Jauregui, C. et al. |volume=362 |issue=13 |pages=1181-91 |year=2010 |doi=10.1056/NEJMoa0908094 |pmid=20220177 |pmc=PMC4036802}}</ref> Therefore, WGS services may be overlooked for clinical applications as they trend towards increased costs and longer turnaround times due to a heavier computational load, increased number of variants for analysis, and larger data archives. However, the steadily decreasing cost of sequencing and storage now allow laboratories to consider genome sequencing. WGS, and more specifically PCR-free WGS, also decreases the need to re-sequence each time the coding sequences of targeted regions change, novel genes are discovered, or a new reference genome is released. The balance between cost, turnaround time, accuracy, and completeness has to be addressed when launching a WGS program. Here, we describe the workflow we adopted and the challenges we met supporting the bioinformatics of WGS in a clinical setting.
==Results==
===Bioinformatics validation===
Our pipeline performs robustly as different entry points and different runs of the same sample returned exactly the same variants. Using HapMap sample NA12878 and previous Sanger confirmed regions, we identified thresholds for variants of Quality by Depth (QD) ≥ 4 and Fisher Strand Bias (FS) ≤ 30 as providing the optimal balance of sensitivity and specificity. This sample has been well characterized by our laboratory and also by other groups.<ref name="MacArthurASys12">{{cite journal |title=A systematic survey of loss-of-function variants in human protein-coding genes |journal=Science |author=MacArthur, D.G.; Balasubramanian, S.; Frankish, A. et al. |volume=335 |issue=6070 |pages=823-8 |year=2012 |doi=10.1126/science.1215040 |pmid=22344438 |pmc=PMC3299548}}</ref><ref name="ZookInteg14">{{cite journal |title=Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls |journal=Nature Biotechnology |author=Zook, J.M., Chapman, B.; Wang, J. et al. |volume=32 |issue=3 |pages=246–51 |year=2014 |doi=10.1038/nbt.2835 |pmid=24531798}}</ref> Of 425 total confirmed variants, all 425 variants were detected by genome sequencing for a sensitivity of 100% (95% CI: 99.1%–100% for SNVs and 79.6%–100% for indels; Table 1a). In addition, four likely reference sequence errors, positions where only the alternative allele has ever been identified, were correctly genotyped by genome sequencing as homozygous for the alternative allele, three of which had incorrect genotype calls with previous orthogonal assays. In addition to these true positive variants, calls were also made for 21 false positive (FP) variants, including 20 substitutions and one indel (Table 1b). After filtration of variants based on optimal QD and FS thresholds discussed, only one FP remained.
{|
| STYLE="vertical-align:top;"|
{| class="wikitable" border="1" cellpadding="5" cellspacing="0" width="70%"
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|'''Table 1.''' Sensitivity and Specificity of WGS. Utilizing HapMap sample NA12878, we previously identified and confirmed via Sanger sequencing 425 variants in 195 genes across ~700 kb of sequence. (a) Analysis of raw variants identified 21 false positive (FP) variants within this region of NA12878. After application of laboratory-defined thresholds (FS ≤ 30 and QD ≥ 4), this limited the number of FP to only one with no increase in false negatives (FN). (b) All 425 Sanger confirmed variants were identified in the WGS data above our quality thresholds. (c) Concordance with 1000 Genomes data for variants across the genome was high, particularly for SNVs. Concordance remained high when matching on predicted genotype.
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|(a) Specificity
|-
  ! style="padding-left:10px; padding-right:10px;" colspan="2"|Variant type
  ! style="padding-left:10px; padding-right:10px;" colspan="2"|FP (before Thresholds)
  ! style="padding-left:10px; padding-right:10px;" colspan="2"|FP (after Thresholds)
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|SNVs
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|20
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|1
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|Indels
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|1
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|0
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|(b) Sensitivity
|-
  ! style="padding-left:10px; padding-right:10px;" colspan="2"|Variant type
  ! style="padding-left:10px; padding-right:10px;"|#
  ! style="padding-left:10px; padding-right:10px;"|FN
  ! style="padding-left:10px; padding-right:10px;"|Sensitivity
  ! style="padding-left:10px; padding-right:10px;"|95% Cl
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|SNVs
  | style="background-color:white; padding-left:10px; padding-right:10px;"|410
  | style="background-color:white; padding-left:10px; padding-right:10px;"|0
  | style="background-color:white; padding-left:10px; padding-right:10px;"|100%
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.1%–100%
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="2"|Indels
  | style="background-color:white; padding-left:10px; padding-right:10px;"|15
  | style="background-color:white; padding-left:10px; padding-right:10px;"|0
  | style="background-color:white; padding-left:10px; padding-right:10px;"|100%
  | style="background-color:white; padding-left:10px; padding-right:10px;"|79.6%–100%
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;" colspan="6"|(c) Concordance with 1000 Genomes data
|-
  ! style="padding-left:10px; padding-right:10px;"|Variant type
  ! style="padding-left:10px; padding-right:10px;"|1K Genomes Variants
  ! style="padding-left:10px; padding-right:10px;"|Present in NGS Calls
  ! style="padding-left:10px; padding-right:10px;"|% Present in NGS Calls
  ! style="padding-left:10px; padding-right:10px;"|Present in NGS Calls with Matched Genotypes
  ! style="padding-left:10px; padding-right:10px;"|% Present in NGS Calls with Matched Genotypes
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|SNVs
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2,762,933
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2,735,592
  | style="background-color:white; padding-left:10px; padding-right:10px;"|99.01%
  | style="background-color:white; padding-left:10px; padding-right:10px;"|2,730,826
  | style="background-color:white; padding-left:10px; padding-right:10px;"|98.84%
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|Indels
  | style="background-color:white; padding-left:10px; padding-right:10px;"|327,474
  | style="background-color:white; padding-left:10px; padding-right:10px;"|299,300
  | style="background-color:white; padding-left:10px; padding-right:10px;"|91.39%
  | style="background-color:white; padding-left:10px; padding-right:10px;"|285,401
  | style="background-color:white; padding-left:10px; padding-right:10px;"|87.15%
|-
  | style="background-color:white; padding-left:10px; padding-right:10px;"|'''Total'''
  | style="background-color:white; padding-left:10px; padding-right:10px;"|3,090,407
  | style="background-color:white; padding-left:10px; padding-right:10px;"|3,034,892
  | style="background-color:white; padding-left:10px; padding-right:10px;"|98.20%
  | style="background-color:white; padding-left:10px; padding-right:10px;"|3,016,227
  | style="background-color:white; padding-left:10px; padding-right:10px;"|97.60%
|-
|}
|}


==References==
==References==

Revision as of 20:07, 21 November 2016

Full article title Bioinformatics workflow for clinical whole genome sequencing at Partners HealthCare Personalized Medicine
Journal Journal of Personalized Medicine
Author(s) Tsai, E.A.; Shakbatyan, R.; Evan, J.; Rossetti, P.; Graham, C.; Sharma, H.; Lin, C.-F., Lebo, M.S.
Author affiliation(s) Partners HealthCare, Brigham and Women’s Hospital, Harvard Medical School
Primary contact Tel.: +1-617-768-8292
Editors Weiss, S.T.; Liggett, S.B.
Year published 2016
Volume and issue 6(1)
Page(s) 12
DOI 10.3390/jpm6010012
ISSN 2075-4426
Distribution license Creative Commons Attribution 4.0 International
Website http://www.mdpi.com/2075-4426/6/1/12/htm
Download http://www.mdpi.com/2075-4426/6/1/12/pdf (PDF)

Abstract

Effective implementation of precision medicine will be enhanced by a thorough understanding of each patient’s genetic composition to better treat his or her presenting symptoms or mitigate the onset of disease. This ideally includes the sequence information of a complete genome for each individual. At Partners HealthCare Personalized Medicine, we have developed a clinical process for whole genome sequencing (WGS) with application in both healthy individuals and those with disease. In this manuscript, we will describe our bioinformatics strategy to efficiently process and deliver genomic data to geneticists for clinical interpretation. We describe the handling of data from FASTQ to the final variant list for clinical review for the final report. We will also discuss our methodology for validating this workflow and the cost implications of running WGS.

Keywords: clinical sequencing, WGS, NGS, next generation sequencing, bioinformatics, validation, precision medicine

Introduction

Precision medicine is becoming an increasing focus in medical research.[1] To achieve the resolution necessary to personalize clinical care, greater attention has been drawn towards higher resolution of the patient genome. Next generation sequencing (NGS) provided a cost-effective method for targeted sequencing of known disease genes at base pair resolution.[2] Moreover, the advent of exome sequencing enabled rapid discovery of genes causing Mendelian disorders. While gene panels and exome sequencing have proved fast and cost-effective for delivering genomic results back to the patient, these technologies are limited by our current knowledge of the exome, which changes over time. Additionally, the use of targeted capture may introduce biases to the data, including PCR duplicates, depth of coverage disparities, and failures at difficult to amplify target regions.[3]

Practical considerations such as sequencing costs, data processing and maintenance, and data analysis complexities are important considerations when a laboratory is considering a new NGS program. These issues are amplified in whole genome sequencing (WGS) due to the volume of the data and have long been barriers to entry for clinical laboratories looking to adopt WGS. Despite the ability of WGS to interrogate the entirety of the genome, clinical interpretation still often focuses on only 3% of the genome (i.e., exome data, pharmacogenomics risk variants, and single nucleotide variants associated with complex disease risk).[4][5][6][7] Therefore, WGS services may be overlooked for clinical applications as they trend towards increased costs and longer turnaround times due to a heavier computational load, increased number of variants for analysis, and larger data archives. However, the steadily decreasing cost of sequencing and storage now allow laboratories to consider genome sequencing. WGS, and more specifically PCR-free WGS, also decreases the need to re-sequence each time the coding sequences of targeted regions change, novel genes are discovered, or a new reference genome is released. The balance between cost, turnaround time, accuracy, and completeness has to be addressed when launching a WGS program. Here, we describe the workflow we adopted and the challenges we met supporting the bioinformatics of WGS in a clinical setting.

Results

Bioinformatics validation

Our pipeline performs robustly as different entry points and different runs of the same sample returned exactly the same variants. Using HapMap sample NA12878 and previous Sanger confirmed regions, we identified thresholds for variants of Quality by Depth (QD) ≥ 4 and Fisher Strand Bias (FS) ≤ 30 as providing the optimal balance of sensitivity and specificity. This sample has been well characterized by our laboratory and also by other groups.[8][9] Of 425 total confirmed variants, all 425 variants were detected by genome sequencing for a sensitivity of 100% (95% CI: 99.1%–100% for SNVs and 79.6%–100% for indels; Table 1a). In addition, four likely reference sequence errors, positions where only the alternative allele has ever been identified, were correctly genotyped by genome sequencing as homozygous for the alternative allele, three of which had incorrect genotype calls with previous orthogonal assays. In addition to these true positive variants, calls were also made for 21 false positive (FP) variants, including 20 substitutions and one indel (Table 1b). After filtration of variants based on optimal QD and FS thresholds discussed, only one FP remained.

Table 1. Sensitivity and Specificity of WGS. Utilizing HapMap sample NA12878, we previously identified and confirmed via Sanger sequencing 425 variants in 195 genes across ~700 kb of sequence. (a) Analysis of raw variants identified 21 false positive (FP) variants within this region of NA12878. After application of laboratory-defined thresholds (FS ≤ 30 and QD ≥ 4), this limited the number of FP to only one with no increase in false negatives (FN). (b) All 425 Sanger confirmed variants were identified in the WGS data above our quality thresholds. (c) Concordance with 1000 Genomes data for variants across the genome was high, particularly for SNVs. Concordance remained high when matching on predicted genotype.
(a) Specificity
Variant type FP (before Thresholds) FP (after Thresholds)
SNVs 20 1
Indels 1 0
(b) Sensitivity
Variant type # FN Sensitivity 95% Cl
SNVs 410 0 100% 99.1%–100%
Indels 15 0 100% 79.6%–100%
(c) Concordance with 1000 Genomes data
Variant type 1K Genomes Variants Present in NGS Calls % Present in NGS Calls Present in NGS Calls with Matched Genotypes % Present in NGS Calls with Matched Genotypes
SNVs 2,762,933 2,735,592 99.01% 2,730,826 98.84%
Indels 327,474 299,300 91.39% 285,401 87.15%
Total 3,090,407 3,034,892 98.20% 3,016,227 97.60%


References

  1. Collins, F.S.; Varmus, H. (2015). "A new initiative on precision medicine". New England Journal of Medicine 372 (9): 793–5. doi:10.1056/NEJMp1500523. PMID 25635347. 
  2. Alfares, A.A.; Kelly, M.A.; McDermott, G. et al. (2015). "Results of clinical genetic testing of 2,912 probands with hypertrophic cardiomyopathy: Expanded panels offer limited additional sensitivity". Genetics in Medicine 17 (11): 880–8. doi:10.1038/gim.2014.205. PMID 25611685. 
  3. Harismendy, O.; Ng, P.C.; Strausberg, R.L. et al. (2009). "Evaluation of next generation sequencing platforms for population targeted sequencing studies". Genome Biology 10 (3): R32. doi:10.1186/gb-2009-10-3-r32. PMC PMC2691003. PMID 19327155. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2691003. 
  4. Biesecker, L.G.; Mullikin, J.C.; Facio, F.M. et al. (2009). "The ClinSeq Project: Piloting large-scale genome sequencing for research in genomic medicine". Genome Research 19 (9): 1665-74. doi:10.1101/gr.092841.109. PMC PMC2752125. PMID 19602640. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752125. 
  5. Dewey, F.E.; Grove, M.E.; Pan, C. et al. (2014). "Clinical interpretation and implications of whole-genome sequencing". JAMA 311 (10): 1035-45. doi:10.1001/jama.2014.1717. PMC PMC4119063. PMID 24618965. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4119063. 
  6. Jiang, Y.H.; Yuen, R.K.; Jin, X. et al. (2013). "Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing". American Journal of Human Genetics 93 (2): 249-63. doi:10.1016/j.ajhg.2013.06.012. PMC PMC3738824. PMID 23849776. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3738824. 
  7. Lupski, J.R.; Reid, J.G.; Gonzaga-Jauregui, C. et al. (2010). "Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy". New England Journal of Medicine 362 (13): 1181-91. doi:10.1056/NEJMoa0908094. PMC PMC4036802. PMID 20220177. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4036802. 
  8. MacArthur, D.G.; Balasubramanian, S.; Frankish, A. et al. (2012). "A systematic survey of loss-of-function variants in human protein-coding genes". Science 335 (6070): 823-8. doi:10.1126/science.1215040. PMC PMC3299548. PMID 22344438. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3299548. 
  9. Zook, J.M., Chapman, B.; Wang, J. et al. (2014). "Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls". Nature Biotechnology 32 (3): 246–51. doi:10.1038/nbt.2835. PMID 24531798. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.