Difference between revisions of "Journal:DGW: An exploratory data analysis tool for clustering and visualisation of epigenomic marks"

Full article title	DGW: An exploratory data analysis tool for clustering and visualisation of epigenomic marks
Journal	BMC Bioinformatics
Author(s)	Lukauskas, Saulius; Visintainer, Roberto; Sanguinetti, Guido; Schweikert, Gabriele B.
Author affiliation(s)	Imperial College London, Fondazione Bruno Kessler, University of Edinburgh
Primary contact	Email: saulius dot lukauskas13 at imperial dot ac dot uk
Year published	2016
Volume and issue	17(Suppl 16)
Page(s)	447
DOI	10.1186/s12859-016-1306-0
ISSN	1471-2105
Distribution license	Creative Commons Attribution 4.0 International
Website	http://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1306-0
Download	http://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1306-0 (PDF)

Revision as of 21:18, 30 January 2017

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Functional genomic and epigenomic research relies fundamentally on sequencing-based methods like ChIP-seq for the detection of DNA-protein interactions. These techniques return large, high-dimensional data sets with visually complex structures, such as multi-modal peaks extended over large genomic regions. Current tools for visualisation and data exploration represent and leverage these complex features only to a limited extent.

Results: We present DGW (Dynamic Gene Warping), an open-source software package for simultaneous alignment and clustering of multiple epigenomic marks. DGW uses dynamic time warping to adaptively rescale and align genomic distances which allows to group regions of interest with similar shapes, thereby capturing the structure of epigenomic marks. We demonstrate the effectiveness of the approach in a simulation study and on a real epigenomic data set from the ENCODE project.

Conclusions: Our results show that DGW automatically recognises and aligns important genomic features such as transcription start sites and splicing sites from histone marks. DGW is available as an open-source Python package.

Keywords: Clustering, ChIP-seq, epigenetics, dynamic time warping

Background

Sequencing-based technologies such as ChIP-Seq and DNAse-Seq [e.g., reviewed in Furey 2012^[1]] have revolutionized our understanding of chromatin structure and function, yielding deep insights in the importance of epigenomic marks in the basic processes of life. The emergent picture is that gene expression is controlled by a complex interplay of protein binding and epigenomic modifications. While histone marks (and other epigenomic marks) can be measured in a high-throughput way, exploratory data analysis techniques for these data types are still being developed. Epigenomic marks exhibit characteristics that distinguish them fundamentally from e.g., mRNA gene expression measurements: they are spatially extended across regions as wide as several kilobases within which they often present interesting local structures, such as the presence of multiple peaks and troughs^[2], and intriguing asymmetries^[3] (see Fig. 1).

Figure 1. The epigenomic marks H3K4me3 (left) and H3K9ac (right) accumulate around transcription start sites often showing a bimodal peak with a valley over the TSS. Shown are two biological replicates for each mark and the input signal. Y axis corresponds to read counts. Annotated genes and the enriched regions called by MACS2 are shown in grey below each profile.

The shape of epigenomic marks across replicate data sets appears to be highly conserved, and has recently been exploited for statistical testing.^[4] While the biological reasons for such conservation are not entirely clear, recent studies have suggested that both architectural and regulatory aspects may be at play. Bieberstein and colleagues showed intriguing patterns of accumulation of the histone marks H3K4me3 and H3K9ac at splice sites^[5], hinting at an architectural origin of the shape of the marks. More recently, Benveniste et al. showed that histone marks can be very well predicted genome-wide by the binding patterns of transcription factors (TFs).^[6] The shape of the peak may therefore be a readout of additional chromatin-related events and genomic regions which are similarly marked may therefore hint at common regulatory or architectural features. Excellent visualisation tools (e.g., UCSC genome browser) enable researchers to appreciate such features for individual enrichment peaks. However, while automatically grouping such marks based on shape similarity may be a valuable tool for hypothesis generation, it has remained a non-trivial task.

Current approaches to clustering regions based on chromatin signatures can be broadly split into two camps: global approaches, such as the celebrated HMM-based reconstruction of the "colours of the chromatin,"^[7] try to find a segmentation of large genomic regions based on histone signatures. These approaches usually rely on the "presence vs. absence" characterization of histone marks at genomic loci, such that the clustering is primarily based on combinatorial patterns of multiple histone marks, as opposed to spatial patterns emerging within individual peaks.

Another interesting segmentation approach was recently introduced by Knijnenburg and colleagues.^[8] Here, signal enrichment is considered across a wide range of scales spanning several orders of magnitude. While this constitutes a significant improvement compared to earlier approaches, signal patterns within segments are again not taken into account. On the other hand, local approaches attempt to cluster short genomic regions at particular loci based on the quantitative binding or modification pattern measured at the loci (e.g., via ChIP-Seq). Examples of these approaches include the ENCODE Cluster Aggregation Tool (CAGT)^[3], or the clustering of genes based on PolII binding profiles.^[9]

Local approaches have to address two challenging problems: aligning the peaks to a reference, and standardising the peaks so that they can be represented as vectors of equal dimensions. To align regions, both the method by Taslim and colleagues as well as the CAGT tool rely on anchor points (e.g., transcription start sites (TSS)^[9] or transcription factor binding sites from auxiliary ChIP-Seq experiments^[3]). The regions are then standardised either by rescaling to a fixed gene length^[9] or by applying windows of fixed length either side of the anchor points^[3] irrespective of the true extent of the local enrichment. These strategies may be plausible for certain applications. However, the shape and extent of histone marks for instance, appear to be determined by many factors^[5], such that a uniform rescaling may be inappropriate. In particular, if one made the assumption that epigenomic marks are directly or indirectly influenced by the underlying DNA sequence, it becomes clear that more flexibility in the comparison and alignment of these marks is needed. For example, ortholog genes may share similar sequence features but their sequence length may vary. Sequence comparisons therefore in general do not require the considered sequences to be of equal length; they allow for insertions, deletions, shifts. Similar local variations should therefore be allowed when comparing epigenomic marks.

In this work, we address the problem of aligning and clustering epigenomic data in a completely unsupervised way: as input data we use ChIP-Seq enrichment measurements within peak regions identified by a peak finder such as MACS.^[10] The alignment and the standardisation problems are solved simultaneously without the use of additional information, such as transcription start sites or gene annotation. We introduce a local rescaling which allows to match epigenomic marks based on maximum similarity between shapes. Our method/software, Dynamic Genome Warping (DGW), is based on the classical dynamic time warping algorithm^[11]^[12], which enabled computer scientists to construct robust speech recognisers undeterred by the variability in pitch and speed of enunciation. In DGW we have implemented multidimensional alignment and clustering, such that multiple epigenomic tracks can be analysed simultaneously. This feature can also be used to control for local sequencing bias as DNA inputs or IGG controls can easily be added to the analysis. We first test DGW in a simulation study. Subsequently, we demonstrate that DGW can align genomic landmarks such as TSSs and first splicing sites (FSSs) on real epigenomic data from the ENCODE project^[13], thus effectively and automatically solving both the alignment and the standardization problems. DGW is freely available as a stand-alone, platform-independent and fully documented Python package.

References

↑ Furey, T.S. (2012). "Chip-seq and beyond: New and improved methodologies to detect and characterize protein-DNA interactions". Nature Reviews Genetics 13 (12): 840-52. doi:10.1038/nrg3306. PMC PMC3591838. PMID 23090257. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591838.
↑ Barski, A.; Cuddapah, S.; Cui, K. et al. (2007). "High-resolution profiling of histone methylations in the human genome". Cell 129 (4): 823-37. doi:10.1016/j.cell.2007.05.009. PMID 17512414.
↑ ^3.0 ^3.1 ^3.2 ^3.3 Kundaje, A.; Kyriazopoulou-Panagiotopoulou, S.; Libbrecht, M. et al. (2012). "Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements". Genome Research 22 (9): 1735-47. doi:10.1101/gr.136366.111. PMC PMC3431490. PMID 22955985. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431490.
↑ Schweikert, G.;, Cseke, B.; Clouaire, T. et al. (2013). "MMDiff: Quantitative testing for shape changes in ChIP-Seq data sets". BMC Genomics 14: 826. doi:10.1186/1471-2164-14-826. PMC PMC4008153. PMID 24267901. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4008153.
↑ ^5.0 ^5.1 Bieberstein, N.I.; Carrillo Oesterreich, F.; Straube, K. et al. (2012). "First exon length controls active chromatin signatures and transcription". Cell Reports 2 (1): 62–8. doi:10.1016/j.celrep.2012.05.019. PMID 22840397.
↑ Benveniste, D.; Sonntag, H.J.; Sanguinetti, G. et al. (2014). "Transcription factor binding predicts histone modifications in human cell lines". Proceedings of the National Academy of Sciences of the United States of America 111 (37): 13367-72. doi:10.1073/pnas.1412081111. PMC PMC4169916. PMID 25187560. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4169916.
↑ Filion, G.J.; van Bemmel, J.G.; Braunschweig, U. et al. (2010). Systematic protein location mapping reveals five principal chromatin types in Drosophila cells. 143. pp. 212–24. doi:10.1016/j.cell.2010.09.009. PMC PMC3119929. PMID 20888037. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3119929.
↑ Knijnenburg, T.A.; Ramsey, S.A.; Berman B.P. et al. (2014). Multiscale representation of genomic signals. 11. pp. 689-94. doi:10.1038/nmeth.2924. PMC PMC4040162. PMID 24727652. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4040162.
↑ ^9.0 ^9.1 ^9.2 Taslim, C.; Wu, J.; Yan, P. et al. (2009). Comparative study on ChIP-seq data: normalization and binding pattern characterization. 25. pp. 2334-40. doi:10.1093/bioinformatics/btp384. PMC PMC2800347. PMID 19561022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2800347.
↑ Zhang, Y.; Liu, T.; Meyer, C.A. et al. (2008). "Model-based analysis of ChIP-Seq (MACS)". Genome Biology 9 (9): R137. doi:10.1186/gb-2008-9-9-r137. PMC PMC2592715. PMID 18798982. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2592715.
↑ Sakoe, H.; Chiba, S. (1978). "Dynamic programming algorithm optimization for spoken word recognition". IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1): 62–8. doi:10.1109/TASSP.1978.1163055.
↑ Müller, M. (2007). Information Retrieval for Music and Motion. Springer-Verlag Berlin Heidelberg. pp. 318. doi:10.1007/978-3-540-74048-3. ISBN 9783540740476.
↑ ENCODE Project Consortium (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57-74. doi:10.1038/nature11247. PMC PMC3439153. PMID 22955616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.

[FureyChip12-1] Furey, T.S. (2012). "Chip-seq and beyond: New and improved methodologies to detect and characterize protein-DNA interactions". Nature Reviews Genetics 13 (12): 840-52. doi:10.1038/nrg3306. PMC PMC3591838. PMID 23090257. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3591838.

[BarskiHigh07-2] Barski, A.; Cuddapah, S.; Cui, K. et al. (2007). "High-resolution profiling of histone methylations in the human genome". Cell 129 (4): 823-37. doi:10.1016/j.cell.2007.05.009. PMID 17512414.

[KundajeUbiq12-3] 3.0 ^3.1 ^3.2 ^3.3 Kundaje, A.; Kyriazopoulou-Panagiotopoulou, S.; Libbrecht, M. et al. (2012). "Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements". Genome Research 22 (9): 1735-47. doi:10.1101/gr.136366.111. PMC PMC3431490. PMID 22955985. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431490.

[SchweikertMMDiff13-4] Schweikert, G.;, Cseke, B.; Clouaire, T. et al. (2013). "MMDiff: Quantitative testing for shape changes in ChIP-Seq data sets". BMC Genomics 14: 826. doi:10.1186/1471-2164-14-826. PMC PMC4008153. PMID 24267901. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4008153.

[BiebersteinFirst12-5] 5.0 ^5.1 Bieberstein, N.I.; Carrillo Oesterreich, F.; Straube, K. et al. (2012). "First exon length controls active chromatin signatures and transcription". Cell Reports 2 (1): 62–8. doi:10.1016/j.celrep.2012.05.019. PMID 22840397.

[BenvenisteTransc14-6] Benveniste, D.; Sonntag, H.J.; Sanguinetti, G. et al. (2014). "Transcription factor binding predicts histone modifications in human cell lines". Proceedings of the National Academy of Sciences of the United States of America 111 (37): 13367-72. doi:10.1073/pnas.1412081111. PMC PMC4169916. PMID 25187560. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4169916.

[FilionSyst10-7] Filion, G.J.; van Bemmel, J.G.; Braunschweig, U. et al. (2010). Systematic protein location mapping reveals five principal chromatin types in Drosophila cells. 143. pp. 212–24. doi:10.1016/j.cell.2010.09.009. PMC PMC3119929. PMID 20888037. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3119929.

[KnijnenburgMulti14-8] Knijnenburg, T.A.; Ramsey, S.A.; Berman B.P. et al. (2014). Multiscale representation of genomic signals. 11. pp. 689-94. doi:10.1038/nmeth.2924. PMC PMC4040162. PMID 24727652. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4040162.

[TaslimComp09-9] 9.0 ^9.1 ^9.2 Taslim, C.; Wu, J.; Yan, P. et al. (2009). Comparative study on ChIP-seq data: normalization and binding pattern characterization. 25. pp. 2334-40. doi:10.1093/bioinformatics/btp384. PMC PMC2800347. PMID 19561022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2800347.

[ZhangModel08-10] Zhang, Y.; Liu, T.; Meyer, C.A. et al. (2008). "Model-based analysis of ChIP-Seq (MACS)". Genome Biology 9 (9): R137. doi:10.1186/gb-2008-9-9-r137. PMC PMC2592715. PMID 18798982. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2592715.

[SakoeDynamic78-11] Sakoe, H.; Chiba, S. (1978). "Dynamic programming algorithm optimization for spoken word recognition". IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1): 62–8. doi:10.1109/TASSP.1978.1163055.

[M.C3.BCllerInfo07-12] Müller, M. (2007). Information Retrieval for Music and Motion. Springer-Verlag Berlin Heidelberg. pp. 318. doi:10.1007/978-3-540-74048-3. ISBN 9783540740476.

[EPCAnInt12-13] ENCODE Project Consortium (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489 (7414): 57-74. doi:10.1038/nature11247. PMC PMC3439153. PMID 22955616. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3439153.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

@@ Line 27: / Line 27: @@
 '''Background''': Functional [[Genomics|genomic]] and epigenomic research relies fundamentally on [[sequencing]]-based methods like ChIP-seq for the detection of DNA-protein interactions. These techniques return large, high-dimensional data sets with visually complex structures, such as multi-modal peaks extended over large genomic regions. Current tools for visualisation and data exploration represent and leverage these complex features only to a limited extent.
-'''Results''': We present DGW, an open-source software package for simultaneous alignment and clustering of multiple epigenomic marks. DGW uses dynamic time warping to adaptively rescale and align genomic distances which allows to group regions of interest with similar shapes, thereby capturing the structure of epigenomic marks. We demonstrate the effectiveness of the approach in a simulation study and on a real epigenomic data set from the ENCODE project.
+'''Results''': We present DGW (Dynamic Gene Warping), an open-source software package for simultaneous alignment and clustering of multiple epigenomic marks. DGW uses dynamic time warping to adaptively rescale and align genomic distances which allows to group regions of interest with similar shapes, thereby capturing the structure of epigenomic marks. We demonstrate the effectiveness of the approach in a simulation study and on a real epigenomic data set from the ENCODE project.
 '''Conclusions''': Our results show that DGW automatically recognises and aligns important genomic features such as transcription start sites and splicing sites from histone marks. DGW is available as an open-source Python package.
@@ Line 36: / Line 36: @@
 Sequencing-based technologies such as ChIP-Seq and DNAse-Seq [e.g., reviewed in Furey 2012<ref name="FureyChip12">{{cite journal |title=Chip-seq and beyond: New and improved methodologies to detect and characterize protein-DNA interactions |journal=Nature Reviews Genetics |author=Furey, T.S. |volume=13 |issue=12 |pages=840-52 |year=2012 |doi=10.1038/nrg3306 |pmid=23090257 |pmc=PMC3591838}}</ref>] have revolutionized our understanding of chromatin structure and function, yielding deep insights in the importance of epigenomic marks in the basic processes of life. The emergent picture is that gene expression is controlled by a complex interplay of protein binding and epigenomic modifications. While histone marks (and other epigenomic marks) can be measured in a high-throughput way, exploratory data analysis techniques for these data types are still being developed. Epigenomic marks exhibit characteristics that distinguish them fundamentally from e.g., mRNA gene expression measurements: they are spatially extended across regions as wide as several kilobases within which they often present interesting local structures, such as the presence of multiple peaks and troughs<ref name="BarskiHigh07">{{cite journal |title=High-resolution profiling of histone methylations in the human genome |journal=Cell |author=Barski, A.; Cuddapah, S.; Cui, K. et al. |volume=129 |issue=4 |pages=823-37 |year=2007 |doi=10.1016/j.cell.2007.05.009 |pmid=17512414}}</ref>, and intriguing asymmetries<ref name="KundajeUbiq12">{{cite journal |title=Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements |journal=Genome Research |author=Kundaje, A.; Kyriazopoulou-Panagiotopoulou, S.; Libbrecht, M. et al. |volume=22 |issue=9 |pages=1735-47 |year=2012 |doi=10.1101/gr.136366.111 |pmid=22955985 |pmc=PMC3431490}}</ref> (see Fig. 1).
-The shape of epigenomic marks across replicate data sets appears to be highly conserved, and has recently been exploited for statistical testing [4]. While the biological reasons for such conservation are not entirely clear, recent studies have suggested that both architectural and regulatory aspects may be at play. Bieberstein and colleagues showed intriguing patterns of accumulation of the histone marks H3K4me3 and H3K9ac at splice sites [5], hinting at an architectural origin of the shape of the marks. More recently, Benveniste et al showed that histone marks can be very well predicted genome-wide by the binding patterns of transcription factors (TFs) [6]. The shape of the peak may therefore be a readout of additional chromatin-related events and genomic regions which are similarly marked may therefore hint at common regulatory or architectural features. Excellent visualisation tools (e.g. UCSC genome browser) enable researchers to appreciate such features for individual enrichment peaks. However, while automatically grouping such marks based on shape similarity may be a valuable tool for hypothesis generation, it has remained a non-trivial task.
+[[File:Fig1 Lukauskas BMCBioinformatics2016 17-Supp16.gif|779px]]
+{{clear}}
+{|
+ | STYLE="vertical-align:top;"|
+{| border="0" cellpadding="5" cellspacing="0" width="779px"
+ |-
+  | style="background-color:white; padding-left:10px; padding-right:10px;"| <blockquote>'''Figure 1.''' The epigenomic marks H3K4me3 (left) and H3K9ac (right) accumulate around transcription start sites often showing a bimodal peak with a valley over the TSS. Shown are two biological replicates for each mark and the input signal. Y axis corresponds to read counts. Annotated genes and the enriched regions called by MACS2 are shown in grey below each profile.</blockquote>
+ |-
+|}
+|}
+The shape of epigenomic marks across replicate data sets appears to be highly conserved, and has recently been exploited for statistical testing.<ref name="SchweikertMMDiff13">{{cite journal |title=MMDiff: Quantitative testing for shape changes in ChIP-Seq data sets |journal=BMC Genomics |author=Schweikert, G.;, Cseke, B.; Clouaire, T. et al. |volume=14 |pages=826 |year=2013 |doi=10.1186/1471-2164-14-826 |pmid=24267901 |pmc=PMC4008153}}</ref> While the biological reasons for such conservation are not entirely clear, recent studies have suggested that both architectural and regulatory aspects may be at play. Bieberstein and colleagues showed intriguing patterns of accumulation of the histone marks H3K4me3 and H3K9ac at splice sites<ref name="BiebersteinFirst12">{{cite journal |title=First exon length controls active chromatin signatures and transcription |journal=Cell Reports |author=Bieberstein, N.I.; Carrillo Oesterreich, F.; Straube, K. et al. |volume=2 |issue=1 |pages=62–8 |year=2012 |doi=10.1016/j.celrep.2012.05.019 |pmid=22840397}}</ref>, hinting at an architectural origin of the shape of the marks. More recently, Benveniste ''et al.'' showed that histone marks can be very well predicted genome-wide by the binding patterns of transcription factors (TFs).<ref name="BenvenisteTransc14">{{cite journal |title=Transcription factor binding predicts histone modifications in human cell lines |journal=Proceedings of the National Academy of Sciences of the United States of America |author=Benveniste, D.; Sonntag, H.J.; Sanguinetti, G. et al. |volume=111 |issue=37 |pages=13367-72 |year=2014 |doi=10.1073/pnas.1412081111 |pmid=25187560 |pmc=PMC4169916}}</ref> The shape of the peak may therefore be a readout of additional chromatin-related events and genomic regions which are similarly marked may therefore hint at common regulatory or architectural features. Excellent visualisation tools (e.g., UCSC genome browser) enable researchers to appreciate such features for individual enrichment peaks. However, while automatically grouping such marks based on shape similarity may be a valuable tool for hypothesis generation, it has remained a non-trivial task.
+Current approaches to clustering regions based on chromatin signatures can be broadly split into two camps: global approaches, such as the celebrated HMM-based reconstruction of the "colours of the chromatin,"<ref name="FilionSyst10">{{cite journal |title=Systematic protein location mapping reveals five principal chromatin types in Drosophila cells |author=Filion, G.J.; van Bemmel, J.G.; Braunschweig, U. et al. |volume=143 |issue=2 |pages=212–24 |year=2010 |doi=10.1016/j.cell.2010.09.009 |pmid=20888037 |pmc=PMC3119929}}</ref> try to find a segmentation of large genomic regions based on histone signatures. These approaches usually rely on the "presence vs. absence" characterization of histone marks at genomic loci, such that the clustering is primarily based on combinatorial patterns of multiple histone marks, as opposed to spatial patterns emerging within individual peaks.
+Another interesting segmentation approach was recently introduced by Knijnenburg and colleagues.<ref name="KnijnenburgMulti14">{{cite journal |title=Multiscale representation of genomic signals |author=Knijnenburg, T.A.; Ramsey, S.A.; Berman B.P. et al. |volume=11 |issue=6 |pages=689-94 |year=2014 |doi=10.1038/nmeth.2924 |pmid=24727652 |pmc=PMC4040162}}</ref> Here, signal enrichment is considered across a wide range of scales spanning several orders of magnitude. While this constitutes a significant improvement compared to earlier approaches, signal patterns within segments are again not taken into account. On the other hand, local approaches attempt to cluster short genomic regions at particular loci based on the quantitative binding or modification pattern measured at the loci (e.g., via ChIP-Seq). Examples of these approaches include the ENCODE Cluster Aggregation Tool (CAGT)<ref name="KundajeUbiq12" />, or the clustering of genes based on PolII binding profiles.<ref name="TaslimComp09">{{cite journal |title=Comparative study on ChIP-seq data: normalization and binding pattern characterization |author=Taslim, C.; Wu, J.; Yan, P. et al. |volume=25 |issue=18 |pages=2334-40 |year=2009 |doi=10.1093/bioinformatics/btp384 |pmid=19561022 |pmc=PMC2800347}}</ref>
+Local approaches have to address two challenging problems: aligning the peaks to a reference, and standardising the peaks so that they can be represented as vectors of equal dimensions. To align regions, both the method by Taslim and colleagues as well as the CAGT tool rely on anchor points (e.g., transcription start sites (TSS)<ref name="TaslimComp09" /> or transcription factor binding sites from auxiliary ChIP-Seq experiments<ref name="KundajeUbiq12" />). The regions are then standardised either by rescaling to a fixed gene length<ref name="TaslimComp09" /> or by applying windows of fixed length either side of the anchor points<ref name="KundajeUbiq12" /> irrespective of the true extent of the local enrichment. These strategies may be plausible for certain applications. However, the shape and extent of histone marks for instance, appear to be determined by many factors<ref name="BiebersteinFirst12" />, such that a uniform rescaling may be inappropriate. In particular, if one made the assumption that epigenomic marks are directly or indirectly influenced by the underlying DNA sequence, it becomes clear that more flexibility in the comparison and alignment of these marks is needed. For example, ortholog genes may share similar sequence features but their sequence length may vary. Sequence comparisons therefore in general do not require the considered sequences to be of equal length; they allow for insertions, deletions, shifts. Similar local variations should therefore be allowed when comparing epigenomic marks.
+In this work, we address the problem of aligning and clustering epigenomic data in a completely unsupervised way: as input data we use ChIP-Seq enrichment measurements within peak regions identified by a peak finder such as MACS.<ref name="ZhangModel08">{{cite journal |title=Model-based analysis of ChIP-Seq (MACS) |journal=Genome Biology |author=Zhang, Y.; Liu, T.; Meyer, C.A. et al. |volume=9 |issue=9 |pages=R137 |year=2008 |doi=10.1186/gb-2008-9-9-r137 |pmid=18798982 |pmc=PMC2592715}}</ref> The alignment and the standardisation problems are solved simultaneously without the use of additional information, such as transcription start sites or gene annotation. We introduce a local rescaling which allows to match epigenomic marks based on maximum similarity between shapes. Our method/software, Dynamic Genome Warping (DGW), is based on the classical dynamic time warping algorithm<ref name="SakoeDynamic78">{{cite journal |title=Dynamic programming algorithm optimization for spoken word recognition |journal=IEEE Transactions on Acoustics, Speech, and Signal Processing |author=Sakoe, H.; Chiba, S. |volume=26 |issue=1 |pages=62–8 |year=1978 |doi=10.1109/TASSP.1978.1163055}}</ref><ref name="MüllerInfo07">{{cite book |title=Information Retrieval for Music and Motion |author=Müller, M. |publisher=Springer-Verlag Berlin Heidelberg |year=2007 |pages=318 |doi=10.1007/978-3-540-74048-3 |isbn=9783540740476}}</ref>, which enabled computer scientists to construct robust speech recognisers undeterred by the variability in pitch and speed of enunciation. In DGW we have implemented multidimensional alignment and clustering, such that multiple epigenomic tracks can be analysed simultaneously. This feature can also be used to control for local sequencing bias as DNA inputs or IGG controls can easily be added to the analysis. We first test DGW in a simulation study. Subsequently, we demonstrate that DGW can align genomic landmarks such as TSSs and first splicing sites (FSSs) on real epigenomic data from the ENCODE project<ref name="EPCAnInt12">{{cite journal |title=An integrated encyclopedia of DNA elements in the human genome |journal=Nature |author=ENCODE Project Consortium |volume=489 |issue=7414 |pages=57-74 |year=2012 |doi=10.1038/nature11247 |pmid=22955616 |pmc=PMC3439153}}</ref>, thus effectively and automatically solving both the alignment and the standardization problems. DGW is freely available as a stand-alone, platform-independent and fully documented Python package.
 ==References==

Difference between revisions of "Journal:DGW: An exploratory data analysis tool for clustering and visualisation of epigenomic marks"

Revision as of 21:18, 30 January 2017

Contents

Abstract

Background

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export