Journal:DGW: An exploratory data analysis tool for clustering and visualisation of epigenomic marks

From LIMSWiki
Revision as of 22:56, 30 January 2017 by Shawndouglas (talk | contribs) (Testing)
Jump to navigationJump to search
Full article title DGW: An exploratory data analysis tool for clustering and visualisation of epigenomic marks
Journal BMC Bioinformatics
Author(s) Lukauskas, Saulius; Visintainer, Roberto; Sanguinetti, Guido; Schweikert, Gabriele B.
Author affiliation(s) Imperial College London, Fondazione Bruno Kessler, University of Edinburgh
Primary contact Email: saulius dot lukauskas13 at imperial dot ac dot uk
Year published 2016
Volume and issue 17(Suppl 16)
Page(s) 447
DOI 10.1186/s12859-016-1306-0
ISSN 1471-2105
Distribution license Creative Commons Attribution 4.0 International
Website http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1306-0
Download http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1306-0 (PDF)

Abstract

Background: Functional genomic and epigenomic research relies fundamentally on sequencing-based methods like ChIP-seq for the detection of DNA-protein interactions. These techniques return large, high-dimensional data sets with visually complex structures, such as multi-modal peaks extended over large genomic regions. Current tools for visualisation and data exploration represent and leverage these complex features only to a limited extent.

Results: We present DGW (Dynamic Gene Warping), an open-source software package for simultaneous alignment and clustering of multiple epigenomic marks. DGW uses dynamic time warping to adaptively rescale and align genomic distances which allows to group regions of interest with similar shapes, thereby capturing the structure of epigenomic marks. We demonstrate the effectiveness of the approach in a simulation study and on a real epigenomic data set from the ENCODE project.

Conclusions: Our results show that DGW automatically recognises and aligns important genomic features such as transcription start sites and splicing sites from histone marks. DGW is available as an open-source Python package.

Keywords: Clustering, ChIP-seq, epigenetics, dynamic time warping

Background

Sequencing-based technologies such as ChIP-Seq and DNAse-Seq [e.g., reviewed in Furey 2012[1]] have revolutionized our understanding of chromatin structure and function, yielding deep insights in the importance of epigenomic marks in the basic processes of life. The emergent picture is that gene expression is controlled by a complex interplay of protein binding and epigenomic modifications. While histone marks (and other epigenomic marks) can be measured in a high-throughput way, exploratory data analysis techniques for these data types are still being developed. Epigenomic marks exhibit characteristics that distinguish them fundamentally from e.g., mRNA gene expression measurements: they are spatially extended across regions as wide as several kilobases within which they often present interesting local structures, such as the presence of multiple peaks and troughs[2], and intriguing asymmetries[3] (see Fig. 1).


Fig1 Lukauskas BMCBioinformatics2016 17-Supp16.gif

Figure 1. The epigenomic marks H3K4me3 (left) and H3K9ac (right) accumulate around transcription start sites often showing a bimodal peak with a valley over the TSS. Shown are two biological replicates for each mark and the input signal. Y axis corresponds to read counts. Annotated genes and the enriched regions called by MACS2 are shown in grey below each profile.

The shape of epigenomic marks across replicate data sets appears to be highly conserved, and has recently been exploited for statistical testing.[4] While the biological reasons for such conservation are not entirely clear, recent studies have suggested that both architectural and regulatory aspects may be at play. Bieberstein and colleagues showed intriguing patterns of accumulation of the histone marks H3K4me3 and H3K9ac at splice sites[5], hinting at an architectural origin of the shape of the marks. More recently, Benveniste et al. showed that histone marks can be very well predicted genome-wide by the binding patterns of transcription factors (TFs).[6] The shape of the peak may therefore be a readout of additional chromatin-related events and genomic regions which are similarly marked may therefore hint at common regulatory or architectural features. Excellent visualisation tools (e.g., UCSC genome browser) enable researchers to appreciate such features for individual enrichment peaks. However, while automatically grouping such marks based on shape similarity may be a valuable tool for hypothesis generation, it has remained a non-trivial task.

Current approaches to clustering regions based on chromatin signatures can be broadly split into two camps: global approaches, such as the celebrated HMM-based reconstruction of the "colours of the chromatin,"[7] try to find a segmentation of large genomic regions based on histone signatures. These approaches usually rely on the "presence vs. absence" characterization of histone marks at genomic loci, such that the clustering is primarily based on combinatorial patterns of multiple histone marks, as opposed to spatial patterns emerging within individual peaks.

Another interesting segmentation approach was recently introduced by Knijnenburg and colleagues.[8] Here, signal enrichment is considered across a wide range of scales spanning several orders of magnitude. While this constitutes a significant improvement compared to earlier approaches, signal patterns within segments are again not taken into account. On the other hand, local approaches attempt to cluster short genomic regions at particular loci based on the quantitative binding or modification pattern measured at the loci (e.g., via ChIP-Seq). Examples of these approaches include the ENCODE Cluster Aggregation Tool (CAGT)[3], or the clustering of genes based on PolII binding profiles.[9]

Local approaches have to address two challenging problems: aligning the peaks to a reference, and standardising the peaks so that they can be represented as vectors of equal dimensions. To align regions, both the method by Taslim and colleagues as well as the CAGT tool rely on anchor points (e.g., transcription start sites (TSS)[9] or transcription factor binding sites from auxiliary ChIP-Seq experiments[3]). The regions are then standardised either by rescaling to a fixed gene length[9] or by applying windows of fixed length either side of the anchor points[3] irrespective of the true extent of the local enrichment. These strategies may be plausible for certain applications. However, the shape and extent of histone marks for instance, appear to be determined by many factors[5], such that a uniform rescaling may be inappropriate. In particular, if one made the assumption that epigenomic marks are directly or indirectly influenced by the underlying DNA sequence, it becomes clear that more flexibility in the comparison and alignment of these marks is needed. For example, ortholog genes may share similar sequence features but their sequence length may vary. Sequence comparisons therefore in general do not require the considered sequences to be of equal length; they allow for insertions, deletions, shifts. Similar local variations should therefore be allowed when comparing epigenomic marks.

In this work, we address the problem of aligning and clustering epigenomic data in a completely unsupervised way: as input data we use ChIP-Seq enrichment measurements within peak regions identified by a peak finder such as MACS.[10] The alignment and the standardisation problems are solved simultaneously without the use of additional information, such as transcription start sites or gene annotation. We introduce a local rescaling which allows to match epigenomic marks based on maximum similarity between shapes. Our method/software, Dynamic Genome Warping (DGW), is based on the classical dynamic time warping algorithm[11][12], which enabled computer scientists to construct robust speech recognisers undeterred by the variability in pitch and speed of enunciation. In DGW we have implemented multidimensional alignment and clustering, such that multiple epigenomic tracks can be analysed simultaneously. This feature can also be used to control for local sequencing bias as DNA inputs or IGG controls can easily be added to the analysis. We first test DGW in a simulation study. Subsequently, we demonstrate that DGW can align genomic landmarks such as TSSs and first splicing sites (FSSs) on real epigenomic data from the ENCODE project[13], thus effectively and automatically solving both the alignment and the standardization problems. DGW is freely available as a stand-alone, platform-independent and fully documented Python package.

Methods

We will first motivate and illustrate our method on a particular data set of histone modifications from the ENCODE project[13], measuring tri-methylation of histone 3 at lysine 4 (H3K4me3) and acetylation of histone 3 at lysine 9 (H3K9ac) in human leukaemia cell line K562. The reason for choosing these two specific marks is that they are known to be characteristically enriched in the flanking regions of TSSs[2] and they were recently shown to accumulate at FSSs[5], hence providing direct evidence of the biological relevance of both the alignment and standardisation problems.

Aligned fragments (BAM files) of both epigenomic marks were processed with the MACS2 peak caller[10] to identify regions which showed enrichment relative to a input control sample; we then merged the two sets by considering every region called for either mark. We stress that the method is independent of the specific marks chosen, or the choice of peak caller, and is readily extendable to other types of genomic and epigenomic data.

Enriched regions normally have very different lengths; nevertheless, visual inspection of peaks can reveal similarities between the shape of the peaks. These similarities are often visualised through a global averaging (aggregation) of the marks[2]; nevertheless, there are strong arguments that global averaging may also mask more subtle patterns. A useful motivating example is given in Fig. 1, which shows four regions which are enriched in the H3K4me3 as well as H3K9ac marks. They all overlap with genes and exhibit broadly similar shapes: a bimodal peak with a trough over the TSS. However, the total lengths of the enriched regions vary, and so does the extent of the two individual sub-peaks, which could be governed by the underlying gene structure. Therefore, the position of the TSS relative to the start of the enriched regions varies.

Dynamic genome warping

To automatically quantify the similarities between peaks such as the ones shown in Fig. 1, we use the classic dynamic time warping (DTW) algorithm.[11][14] A modern review of the basic concepts of dynamic time warping can be found e.g. in Müller 2007.[12] It was originally introduced in the speech recognition community to robustly recognize speech independently of speech speed. There, the problem was to match waveforms of similar shape but potentially different duration. Likewise, our aim is to be able to associate peaks which exhibit similar local structure (shape) regardless of their spatial extension.

Specifically, let a=(a1,…,aN) and b=(b1,…,bM) be two sequences with values

References

  1. Furey, T.S. (2012). "Chip-seq and beyond: New and improved methodologies to detect and characterize protein-DNA interactions". Nature Reviews Genetics 13 (12): 840-52. doi:10.1038/nrg3306. PMC PMC3591838. PMID 23090257. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591838. 
  2. 2.0 2.1 2.2 Barski, A.; Cuddapah, S.; Cui, K. et al. (2007). "High-resolution profiling of histone methylations in the human genome". Cell 129 (4): 823-37. doi:10.1016/j.cell.2007.05.009. PMID 17512414. 
  3. 3.0 3.1 3.2 3.3 Kundaje, A.; Kyriazopoulou-Panagiotopoulou, S.; Libbrecht, M. et al. (2012). "Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements". Genome Research 22 (9): 1735-47. doi:10.1101/gr.136366.111. PMC PMC3431490. PMID 22955985. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431490. 
  4. Schweikert, G.;, Cseke, B.; Clouaire, T. et al. (2013). "MMDiff: Quantitative testing for shape changes in ChIP-Seq data sets". BMC Genomics 14: 826. doi:10.1186/1471-2164-14-826. PMC PMC4008153. PMID 24267901. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4008153. 
  5. 5.0 5.1 5.2 Bieberstein, N.I.; Carrillo Oesterreich, F.; Straube, K. et al. (2012). "First exon length controls active chromatin signatures and transcription". Cell Reports 2 (1): 62–8. doi:10.1016/j.celrep.2012.05.019. PMID 22840397. 
  6. Benveniste, D.; Sonntag, H.J.; Sanguinetti, G. et al. (2014). "Transcription factor binding predicts histone modifications in human cell lines". Proceedings of the National Academy of Sciences of the United States of America 111 (37): 13367-72. doi:10.1073/pnas.1412081111. PMC PMC4169916. PMID 25187560. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4169916. 
  7. Filion, G.J.; van Bemmel, J.G.; Braunschweig, U. et al. (2010). Systematic protein location mapping reveals five principal chromatin types in Drosophila cells. 143. pp. 212–24. doi:10.1016/j.cell.2010.09.009. PMC PMC3119929. PMID 20888037. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3119929. 
  8. Knijnenburg, T.A.; Ramsey, S.A.; Berman B.P. et al. (2014). Multiscale representation of genomic signals. 11. pp. 689-94. doi:10.1038/nmeth.2924. PMC PMC4040162. PMID 24727652. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4040162. 
  9. 9.0 9.1 9.2 Taslim, C.; Wu, J.; Yan, P. et al. (2009). Comparative study on ChIP-seq data: normalization and binding pattern characterization. 25. pp. 2334-40. doi:10.1093/bioinformatics/btp384. PMC PMC2800347. PMID 19561022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2800347. 
  10. 10.0 10.1 Zhang, Y.; Liu, T.; Meyer, C.A. et al. (2008). "Model-based analysis of ChIP-Seq (MACS)". Genome Biology 9 (9): R137. doi:10.1186/gb-2008-9-9-r137. PMC PMC2592715. PMID 18798982. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2592715. 
  11. 11.0 11.1 Sakoe, H.; Chiba, S. (1978). "Dynamic programming algorithm optimization for spoken word recognition". IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1): 62–8. doi:10.1109/TASSP.1978.1163055. 
  12. 12.0 12.1 Müller, M. (2007). Information Retrieval for Music and Motion. Springer-Verlag Berlin Heidelberg. pp. 318. doi:10.1007/978-3-540-74048-3. ISBN 9783540740476. 
  13. 13.0 13.1 ENCODE Project Consortium (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57-74. doi:10.1038/nature11247. PMC PMC3439153. PMID 22955616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153. 
  14. Giorgino, T. (2009). "Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package". Journal of Statistical Software 31 (7). doi:10.18637/jss.v031.i07. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. In some cases, the authors directly referenced a citation number; the author and year of the citation was inserted along with the citation for completeness.