Journal:Bioinformatics workflow for clinical whole genome sequencing at Partners HealthCare Personalized Medicine

From LIMSWiki
Revision as of 20:23, 21 November 2016 by Shawndouglas (talk | contribs) (Saving and adding more.)
Jump to navigationJump to search
Full article title Bioinformatics workflow for clinical whole genome sequencing at Partners HealthCare Personalized Medicine
Journal Journal of Personalized Medicine
Author(s) Tsai, E.A.; Shakbatyan, R.; Evan, J.; Rossetti, P.; Graham, C.; Sharma, H.; Lin, C.-F., Lebo, M.S.
Author affiliation(s) Partners HealthCare, Brigham and Women’s Hospital, Harvard Medical School
Primary contact Tel.: +1-617-768-8292
Editors Weiss, S.T.; Liggett, S.B.
Year published 2016
Volume and issue 6(1)
Page(s) 12
DOI 10.3390/jpm6010012
ISSN 2075-4426
Distribution license Creative Commons Attribution 4.0 International
Website http://www.mdpi.com/2075-4426/6/1/12/htm
Download http://www.mdpi.com/2075-4426/6/1/12/pdf (PDF)

Abstract

Effective implementation of precision medicine will be enhanced by a thorough understanding of each patient’s genetic composition to better treat his or her presenting symptoms or mitigate the onset of disease. This ideally includes the sequence information of a complete genome for each individual. At Partners HealthCare Personalized Medicine, we have developed a clinical process for whole genome sequencing (WGS) with application in both healthy individuals and those with disease. In this manuscript, we will describe our bioinformatics strategy to efficiently process and deliver genomic data to geneticists for clinical interpretation. We describe the handling of data from FASTQ to the final variant list for clinical review for the final report. We will also discuss our methodology for validating this workflow and the cost implications of running WGS.

Keywords: clinical sequencing, WGS, NGS, next generation sequencing, bioinformatics, validation, precision medicine

Introduction

Precision medicine is becoming an increasing focus in medical research.[1] To achieve the resolution necessary to personalize clinical care, greater attention has been drawn towards higher resolution of the patient genome. Next generation sequencing (NGS) provided a cost-effective method for targeted sequencing of known disease genes at base pair resolution.[2] Moreover, the advent of exome sequencing enabled rapid discovery of genes causing Mendelian disorders. While gene panels and exome sequencing have proved fast and cost-effective for delivering genomic results back to the patient, these technologies are limited by our current knowledge of the exome, which changes over time. Additionally, the use of targeted capture may introduce biases to the data, including PCR duplicates, depth of coverage disparities, and failures at difficult to amplify target regions.[3]

Practical considerations such as sequencing costs, data processing and maintenance, and data analysis complexities are important considerations when a laboratory is considering a new NGS program. These issues are amplified in whole genome sequencing (WGS) due to the volume of the data and have long been barriers to entry for clinical laboratories looking to adopt WGS. Despite the ability of WGS to interrogate the entirety of the genome, clinical interpretation still often focuses on only 3% of the genome (i.e., exome data, pharmacogenomics risk variants, and single nucleotide variants associated with complex disease risk).[4][5][6][7] Therefore, WGS services may be overlooked for clinical applications as they trend towards increased costs and longer turnaround times due to a heavier computational load, increased number of variants for analysis, and larger data archives. However, the steadily decreasing cost of sequencing and storage now allow laboratories to consider genome sequencing. WGS, and more specifically PCR-free WGS, also decreases the need to re-sequence each time the coding sequences of targeted regions change, novel genes are discovered, or a new reference genome is released. The balance between cost, turnaround time, accuracy, and completeness has to be addressed when launching a WGS program. Here, we describe the workflow we adopted and the challenges we met supporting the bioinformatics of WGS in a clinical setting.

Results

Bioinformatics validation

Our pipeline performs robustly as different entry points and different runs of the same sample returned exactly the same variants. Using HapMap sample NA12878 and previous Sanger confirmed regions, we identified thresholds for variants of Quality by Depth (QD) ≥ 4 and Fisher Strand Bias (FS) ≤ 30 as providing the optimal balance of sensitivity and specificity. This sample has been well characterized by our laboratory and also by other groups.[8][9] Of 425 total confirmed variants, all 425 variants were detected by genome sequencing for a sensitivity of 100% (95% CI: 99.1%–100% for SNVs and 79.6%–100% for indels; Table 1a). In addition, four likely reference sequence errors, positions where only the alternative allele has ever been identified, were correctly genotyped by genome sequencing as homozygous for the alternative allele, three of which had incorrect genotype calls with previous orthogonal assays. In addition to these true positive variants, calls were also made for 21 false positive (FP) variants, including 20 substitutions and one indel (Table 1b). After filtration of variants based on optimal QD and FS thresholds discussed, only one FP remained.

Table 1. Sensitivity and Specificity of WGS. Utilizing HapMap sample NA12878, we previously identified and confirmed via Sanger sequencing 425 variants in 195 genes across ~700 kb of sequence. (a) Analysis of raw variants identified 21 false positive (FP) variants within this region of NA12878. After application of laboratory-defined thresholds (FS ≤ 30 and QD ≥ 4), this limited the number of FP to only one with no increase in false negatives (FN). (b) All 425 Sanger confirmed variants were identified in the WGS data above our quality thresholds. (c) Concordance with 1000 Genomes data for variants across the genome was high, particularly for SNVs. Concordance remained high when matching on predicted genotype.
(a) Specificity
Variant type FP (before Thresholds) FP (after Thresholds)
SNVs 20 1
Indels 1 0
(b) Sensitivity
Variant type # FN Sensitivity 95% Cl
SNVs 410 0 100% 99.1%–100%
Indels 15 0 100% 79.6%–100%
(c) Concordance with 1000 Genomes data
Variant type 1K Genomes Variants Present in NGS Calls % Present in NGS Calls Present in NGS Calls with Matched Genotypes % Present in NGS Calls with Matched Genotypes
SNVs 2,762,933 2,735,592 99.01% 2,730,826 98.84%
Indels 327,474 299,300 91.39% 285,401 87.15%
Total 3,090,407 3,034,892 98.20% 3,016,227 97.60%

Our genome-wide comparison of the variants detected through this pipeline compared to those detected at a similar coverage through the 1000 Genomes Project for NA12878 confirmed similar results (Table 1c). Out of 2.7 million SNVs and 285,000 indels detected through our pipeline, we found that 98.8% of these SNVs were also called in the 1000 Genomes dataset. In addition, 97.6% of the total variants called were concordant for genotype. Annotation and variant filtration were also performed on NA12878. Greater than 50 variants were randomly selected for manual inspection to ensure that they were properly annotated and filtered.

Known regions of poor coverage

All of the genomes delivered to us from Illumina’s CLIA-laboratory have at least 30X coverage, with an average coverage of 43X. However, this coverage varies across the coding sequences of the genome and affects both clinically relevant and clinically unknown regions of the genome. A gene-level list of the percentage of callable bases is provided in Supplementary Table S1. In total, of the 1381 genes with at least five asserted Pathogenic or Likely pathogenic variants by clinical laboratories in ClinVar, 94 had <90% coverage across their coding region. The 20 genes with the poorest coverage metrics included many with high prevalence and clinical relevance (Table 2). In our clinical process, for indication gene-list driven analyses, we report back the coverage of analyzed genes, highlighting those with coverage issues. This provides a useful guide to determining potential false negative findings, including those due to regions that are difficult to sequence with NGS technologies.

Table 2. Top 20 Poorly Covered Genes with Clinical Relevance. Clinical relevance is defined as having at least five Pathogenic or Likely pathogenic variants in ClinVar reported in the gene by submitting laboratories or working groups.
Gene # Clinically Significant Variants % Callable Disease Disease Prevalence
STRC 8 20 Sensorineural hearing loss Common
ADAMTSL2 5 32 Geleophysic dysplasia Rare
CYP21A2 13 44 Congenital adrenal hyperplasia Common
ARX 19 45 X-linked infantile spasm syndrome Rare
MECP2 250 53 Rett syndrome Common
GJB1 16 53 Charcot-Marie-Tooth disease Common
ABCD1 33 57 X-linked adrenoleukodystrophy Moderate
EMD 11 57 Emery-Dreifuss muscular dystrophy Moderate
G6PD 16 58 Glucose-6-phosphate dehydrogenase deficiency Common
GATA1 12 60 Dyserythropoietic anemia and thrombocytopenia Rare
AVPR2 15 62 Nephrogenic diabetes insipidus Rare
EDA 37 63 Hypohidrotic ectodermal dysplasia Moderate
SLC16A2 11 63 Allan-Herndon-Dudley syndrome Rare
FLNA 42 64 Otopalatodigital syndrome Rare
EBP 24 64 X-linked chondrodysplasia punctata Rare
RPGR 17 64 Retinitis pigmentosa Common
TAZ 17 64 Barth syndrome Rare
IDS 16 64 Hunter syndrome Moderate
FGD1 8 64 Aarskog-Scott syndrome Rare
GPR143 6 65 Ocular albinism Moderate

References

  1. Collins, F.S.; Varmus, H. (2015). "A new initiative on precision medicine". New England Journal of Medicine 372 (9): 793–5. doi:10.1056/NEJMp1500523. PMID 25635347. 
  2. Alfares, A.A.; Kelly, M.A.; McDermott, G. et al. (2015). "Results of clinical genetic testing of 2,912 probands with hypertrophic cardiomyopathy: Expanded panels offer limited additional sensitivity". Genetics in Medicine 17 (11): 880–8. doi:10.1038/gim.2014.205. PMID 25611685. 
  3. Harismendy, O.; Ng, P.C.; Strausberg, R.L. et al. (2009). "Evaluation of next generation sequencing platforms for population targeted sequencing studies". Genome Biology 10 (3): R32. doi:10.1186/gb-2009-10-3-r32. PMC PMC2691003. PMID 19327155. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2691003. 
  4. Biesecker, L.G.; Mullikin, J.C.; Facio, F.M. et al. (2009). "The ClinSeq Project: Piloting large-scale genome sequencing for research in genomic medicine". Genome Research 19 (9): 1665-74. doi:10.1101/gr.092841.109. PMC PMC2752125. PMID 19602640. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2752125. 
  5. Dewey, F.E.; Grove, M.E.; Pan, C. et al. (2014). "Clinical interpretation and implications of whole-genome sequencing". JAMA 311 (10): 1035-45. doi:10.1001/jama.2014.1717. PMC PMC4119063. PMID 24618965. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4119063. 
  6. Jiang, Y.H.; Yuen, R.K.; Jin, X. et al. (2013). "Detection of clinically relevant genetic variants in autism spectrum disorder by whole-genome sequencing". American Journal of Human Genetics 93 (2): 249-63. doi:10.1016/j.ajhg.2013.06.012. PMC PMC3738824. PMID 23849776. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3738824. 
  7. Lupski, J.R.; Reid, J.G.; Gonzaga-Jauregui, C. et al. (2010). "Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy". New England Journal of Medicine 362 (13): 1181-91. doi:10.1056/NEJMoa0908094. PMC PMC4036802. PMID 20220177. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4036802. 
  8. MacArthur, D.G.; Balasubramanian, S.; Frankish, A. et al. (2012). "A systematic survey of loss-of-function variants in human protein-coding genes". Science 335 (6070): 823-8. doi:10.1126/science.1215040. PMC PMC3299548. PMID 22344438. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3299548. 
  9. Zook, J.M., Chapman, B.; Wang, J. et al. (2014). "Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls". Nature Biotechnology 32 (3): 246–51. doi:10.1038/nbt.2835. PMID 24531798. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.