Journal:Intervene: A tool for intersection and visualization of multiple gene or genomic region sets
|Full article title||Intervene: A tool for intersection and visualization of multiple gene or genomic region sets|
|Author(s)||Khan, Aziz; Mathelier, Anthony|
|Author affiliation(s)||Centre for Molecular Medicine Norway, Norwegian Radium Hospital|
|Primary contact||Email: aziz dot khan at ncmm dot uio dot no and anthony dot mathelier at ncmm dot uio dot no|
|Volume and issue||18|
|Distribution license||Creative Commons Attribution 4.0 International|
Background: A common task for scientists relies on comparing lists of genes or genomic regions derived from high-throughput sequencing experiments. While several tools exist to intersect and visualize sets of genes, similar tools dedicated to the visualization of genomic region sets are currently limited.
Results: To address this gap, we have developed the Intervene tool, which provides an easy and automated interface for the effective intersection and visualization of genomic region or list sets, thus facilitating their analysis and interpretation. Intervene contains three modules: venn to generate Venn diagrams of up to six sets, upset to generate UpSet plots of multiple sets, and pairwise to compute and visualize intersections of multiple sets as clustered heat maps. Intervene, and its interactive web ShinyApp companion, generate publication-quality figures for the interpretation of genomic region and list sets.
Conclusions: Intervene and its web application companion provide an easy command line and an interactive web interface to compute intersections of multiple genomic and list sets. They have the capacity to plot intersections using easy-to-interpret visual approaches. Intervene is developed and designed to meet the needs of both computer scientists and biologists. The source code is freely available at https://bitbucket.org/CBGR/intervene, with the web application available at https://asntech.shinyapps.io/intervene.
Keywords: visualization, Venn diagrams, UpSet plots, heat maps, genome analysis
Effective visualization of transcriptomic, genomic, and epigenomic data generated by next-generation sequencing-based high-throughput assays have become an area of great interest. Most of the data sets generated by such assays are lists of genes or variants, and genomic region sets. The genomic region sets represent genomic locations for specific features, such as transcription factor – DNA interactions, transcription start sites, histone modifications, and DNase hypersensitivity sites. A common task in the interpretation of these features is to find similarities, differences, and enrichments between such sets, which come from different samples, experimental conditions, or cell and tissue types.
Classically, the intersection or overlap between different sets, such as gene lists, is represented by Venn diagrams or Edwards-Venn. If the number of sets exceeds four, such diagrams become complex and difficult to interpret. The key challenge is that there are 2n combinations to visually represent when considering n sets. An alternative approach, the UpSet plots, was introduced to depict the intersection of more than three sets. The advantage of UpSet plots is their capacity to rank the intersections and alternatively hide combinations without intersection, which is not possible using a Venn diagram. However, with a large number of sets, UpSet plots become an ineffective way of illustrating set intersections. To visualize a large number of sets, one can represent pairwise intersections using a clustered heat map as suggested by Lex and Gehlenborg.
There are several web applications and R packages available to compute intersection and visualization of up to six list sets by using Venn diagrams. Although tools exist to perform genomic region set intersections, there is a limited number of tools available to visualize them. To our knowledge no tool exists to generate UpSet plots for genomic region sets. Consequently, there is a great need for integrative tools to compute and visualize intersection of multiple sets of both genomic regions and gene/list sets.
To address this need, we developed Intervene, an easy-to-use command line tool to compute and visualize intersections of genomic regions with Venn diagrams, UpSet plots, or clustered heat maps. Moreover, we provide an interactive web application companion to upload list sets or the output of Intervene to further customize plots.
Intervene comes as a command line tool, along with an interactive Shiny web application to customize the visual representation of intersections. The command line tool is implemented in Python (version 2.7) and R programming language (version 3.3.2). The build also works with Python versions 3.4, 3.5, and 3.6. The accompanying web interface is developed using Shiny (version 1.0.0), a web application framework for R. Intervene uses pybedtools to perform genomic region set intersections and Seaborn (https://seaborn.pydata.org/), Matplotlib, UpSetR, and Corrplot to generate figures. The web application uses the R package Venerable for different types of Venn diagrams, UpSetR for UpSet plots, and heatmap.2 and Corrplot for pairwise intersection clustered heat maps. The UpSet module of the web ShinyApp was derived from the UpSetR ShinyApp, which was extended by adding more options and features to customize the UpSet plots.
Intervene can be installed by using pip install intervene or using the source code available on bitbucket https://bitbucket.org/CBGR/intervene. The tool has been tested on Linux and MAC systems. The Shiny web application is hosted with shinyapps.io by RStudio, and is compatible with all modern web browsers. A detailed documentation including installation instructions and how to use the tool is provided in Additional file 1 and is available at http://intervene.readthedocs.io.
An integrated tool for effective visualization of multiple set intersections
As visualization of sets and their intersections is becoming more and more challenging due to the increasing number of generated data sets, there is a strong need to have an integrated tool to compute and visualize intersections effectively. To address this challenge, we have developed Intervene, which is composed of three different modules, accessible through the subcommands venn, upset, and pairwise. Intervene accepts two types of input files: genomic regions in BED, GFF, or VCF format and gene/name lists in plain text format. A detailed sketch of Intervene’s command line interface and web application utility with types of inputs is provided in Fig. 1.
Intervene provides flexibility to the user to choose figure colors, label text, size, resolution, and type to make them publication-standard quality. To read the help about any module, the user can type intervene < subcommand > −-help on the command line. Furthermore, Intervene produces results as text files, which can be easily imported to the web application for interactive visualization and customization of plots (see “An interactive web application” section).
Venn diagrams module
Venn diagrams are the classical approach to show intersections between sets. There are several web-based applications and R packages available to visualize intersections of up-to six list sets in classical Venn, Euler, or Edward’s diagrams. However, a very limited number of tools are available to visualize genomic region intersections using classical Venn diagrams.
Intervene provides up-to six-way classical Venn diagrams for gene lists or genomic region sets. The associated web interface can also be used to compute the intersection of multiple gene sets and visualize it using different flavors of weighted and unweighted Venn and Euler diagrams. These different types include: classical Venn diagrams (up-to five sets), Chow-Ruskey (up-to five sets), Edwards’ diagrams (up-to five sets), and Battle (up-to nine sets).
As an example, one might be interested to calculate the number of overlapping ChIP-seq (chromatin immunoprecipitation followed by sequencing) peaks between different types of histone modification marks (H3K27ac, H3K4me3, and H3K27me3) in human embryonic stem cells (hESC) (Fig. 2a , can be generated with the command intervene venn --test).
UpSet plots module
When the number of sets exceeds four, Venn diagrams become difficult to read and interpret. An alternative and more effective approach is to use UpSet plots to visualize the intersections. An R package with a ShinyApp (https://gehlenborglab.shinyapps.io/upsetr/) and an interactive web-based tool are available at http://vcg.github.io/upset to visualize multiple list sets. However, to our knowledge, there is no tool available to draw the UpSet plots for genomic region set intersections. Intervene’s upset subcommand can be used to visualize the intersection of multiple genomic region sets using UpSet plots.
As an example, we show the intersections of ChIP-seq peaks for histone modifications (H3K27ac, H3K4me3, H3K27me3, and H3K4me2) in hESC using an UpSet plot, where interactions were ranked by frequency (Fig. 2b, can be generated with the command intervene upset --test). This plot is easier to understand than the four-way Venn diagram (Additional file 1).
Pairwise intersection heat maps module
With an increasing number of data sets, visualizing all possible intersections becomes unfeasible by using Venn diagrams or UpSet plots. One possibility is to compute pairwise intersections and plot-associated metrics as a clustered heat map. Intervene’s pairwise module provides several metrics to assess intersections, including number of overlaps, fraction of overlap, Jaccard statistics, Fisher’s exact test, and distribution of relative distances. Moreover, the user can choose from different styles of heat maps and clustering approaches.
As an example, we obtained the genomic regions of super enhancers in 24 mouse cell type and tissues from dbSUPER and computed the pairwise intersections in terms of Jaccard statistics (Fig. 2c). The triangular heat map shows the pairwise Jaccard index, which is between 0 and 1, where 0 means no overlap and 1 means full overlap. The bar plot shows the number of regions in each cell-type or tissue. This plot can be generated using the command intervene pairwise --test).
An interactive web application
Intervene comes with a web application companion to further explore and filter the results in an interactive way. Indeed, intersections between large data sets can be computed locally using Intervene’s command line interface, then the output files can be uploaded to the ShinyApp for further exploration and customization of the figures (Fig. 1 ).
The ShinyApp web interface takes four types of inputs: (i) a text/csv file where each column represents a set, (ii) a binary representation of intersections, (iii) a pairwise matrix of intersections, and (iv) a matrix of overlap counts. The web application provides several easy and intuitive customization options for responsive adjustments of the figures (Figs. 1 and 3). Users can change colors, fonts and plot sizes, change labels, and select and deselect specific sets. These customized and publication-ready figures can be downloaded in PDF, SVG, TIFF, and PNG formats. The pairwise modules also provides three types of correlation coefficients and hierarchical clustering with eight clustering methods and four distance measurement methods. It further provides interactive features to explore data values; this is done by hovering the mouse cursor over each heat map cell, or by using a searchable and sortable data table. The data table can be downloaded as a CSV file and interactive heat maps can be downloaded as HTML. The Shiny-based web application is freely available at https://asntech.shinyapps.io/intervene.
Case study: Highlighting co-binding factors in the MCF-7 cell line
Transcription factors (TFs) are key proteins regulating transcription through their cooperative binding to the DNA. To highlight Intervene’s capabilities, we used the command-line tool and its ShinyApp companion to predict and visualize cooperative interactions between TFs at cis-regulatory regions in the MCF-7 breast cancer cell line. Specifically, we considered (i) TF binding regions derived from uniformly processed TF ChIP-seq experiments compiled in the ReMap database and (ii) promoter and enhancer regions predicted by chromHMM from histone modifications and regulatory factors ChIP-seq. The pairwise module of Intervene was used to compute the fraction of overlap between all pairs of ChIP-seq data sets and regulatory regions. The output matrix was provided to the ShinyApp to compute Spearman correlations of the computed values and to generate the corresponding clustering heat map (default parameters; Fig. 4).
The largest cluster (green cluster) was composed of the three key cooperative TFs involved in oestrogen-positive breast cancers: ESR1, FOXA1, and GATA3. They were clustered with enhancer regions where they have been shown to interact. The cluster highlights potential TF cooperators: ARNT, AHR, GREB1, and TLE3. Promoter regions were found in the second largest cluster (red cluster), along with CTCF, STAG1, and RAD21, which are known to orchestrate chromatin architecture in human cells. The last cluster was principally composed by TFAP2C data sets. Taken together, Intervene visually highlighted the cooperation of different sets TFs at MCF-7 promoters and enhancers, in agreement with the literature.
A comparative analysis of different tools to compute and visualize intersections as Venn diagrams, UpSet plots, and pairwise heat maps is provided in Table 1. Most of the tools available currently can only draw Venn diagrams for up-to six list sets. Intervene provides Venn diagrams, UpSet plots, and pairwise heat maps for both list sets and genomic region sets. To the best of our knowledge, it is the only tool available to draw UpSet plots for the intersections of genomic region sets. Intervene is the first of its kind to allow for the computation and visualization of intersections between multiple genomic region and list sets with three different approaches.
In the near future, Intervene will be integrated to the Galaxy Tool Shed to be easily installed to any Galaxy instance with one click. We plan to develop a dedicated web application allowing users to upload genomic region sets for intersections and visualization.
We described Intervene as an integrated tool that provides an easy and automated interface for intersection, and effective visualization of genomic region and list sets. To our knowledge, Intervene is the first tool to provide three types of visualization approaches for multiple sets of gene or genomic intervals. The three modules are developed to overcome the situations where the number of sets is large. Intervene and its web application companion are developed and designed to fit the needs of a wide range of scientists.
Availability and requirements
Project name: Intervene
Project home page: https://bitbucket.org/CBGR/intervene
Project documentation page: http://intervene.readthedocs.io
Project Shiny App page: https://asntech.shinyapps.io/intervene/
Operating system(s): The ShinyApp is platform independent and command line interface is available for Linux and Mac OS X
Programming language: Python, R
Other requirements: Web browser for the ShinyApp
License: GNU GPL
Any restrictions to use by non-academics: GNU GPL
ChIP-seq: Chromatin immunoprecipitation followed by sequencing
ENCODE: The Encyclopedia of DNA Elements
hESCs: Human embryonic stem cells
TFs: Transcription factors
We thank the developers of the tools we have used to build Intervene and Intervene ShinyApp for sharing their code in open-source software. We thank Marius Gheorghe and Dimitris Polychronopoulos for their useful suggestions and testing the tool, and Annabel Darby for providing suggestions on the manuscript text.
This work has been supported by the Norwegian Research Council, Helse Sør-Øst, and the University of Oslo through the Centre for Molecular Medicine Norway (NCMM), which is part of the Nordic European Molecular Biology Laboratory Partnership for Molecular Medicine.
Availability of data and materials
The source code of Intervene and test data are freely available at https://bitbucket.org/CBGR/intervene and a detailed documentation can be found at http://intervene.readthedocs.io. An interactive Shiny App is available at https://asntech.shinyapps.io/intervene.
AK conceived the project. AK and AM designed the tool. AM supervised the project. AK implemented both Intervene and the Shiny web application. AK wrote the manuscript draft and AM revised it. All authors read and approved the manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional File 1: Intervene Documentation - Release v0.5.8: A PDF version of detailed documentation including installation instruction and how to use the command line interface and web application (PDF 1429 kb)
- ↑ Venn, J. (1880). "On the diagrammatic and mechanical representation of propositions and reasonings". Philisophical Magazine and Journal of Science 10 (59): 1–18. doi:10.1080/14786448008626877.
- ↑ Edwards, A.W.F. (2004). Cogwheels of the Mind: The Story of Venn Diagrams. Johns Hopkins University Press. pp. 128. ISBN 9780801874345.
- ↑ 3.0 3.1 Lex, A.; Gehlenborg, N.; Strobelt, H. et al. (2014). "UpSet: Visualization of Intersecting Sets". IEEE Transactions on Visualization and Computer Graphics 20 (12): 1983-92. doi:10.1109/TVCG.2014.2346248.
- ↑ Lex, A.; Gehlenborg, N. (2014). "Points of view: Sets and intersections". IEEE Transactions on Visualization and Computer Graphics 11: 779. doi:10.1038/nmeth.3033.
- ↑ 5.0 5.1 5.2 5.3 Zhu, L.J.; Gazin, C.; Lawson, N.D. et al. (2010). "ChIPpeakAnno: A Bioconductor package to annotate ChIP-seq and ChIP-chip data". BMC Bioinformatics 11: 237. doi:10.1186/1471-2105-11-237. PMC PMC3098059. PMID 20459804. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3098059.
- ↑ 6.0 6.1 6.2 6.3 6.4 Dale, R.K.; Pedersen, B.S.; Quinlan, A.R. (2011). "Pybedtools: A flexible Python library for manipulating genomic datasets and annotations". Bioinformatics 27 (24): 3423–4. doi:10.1093/bioinformatics/btr539. PMC PMC3232365. PMID 21949271. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3232365.
- ↑ 7.0 7.1 Hunter, J.D. (2007). "Matplotlib: A 2D Graphics Environment". Computing in Science & Engineering 9 (3). doi:10.1109/MCSE.2007.55.
- ↑ 8.0 8.1 8.2 Conway, J.R.; Lex, A.; Gehlenborg, N. (25 March 2017). "UpSetR: An R Package For The Visualization Of Intersecting Sets And Their Properties". bioRxiv. doi:10.1101/120600.
- ↑ Wei, T.; Simko, V. (21 April 2016). "corrplot: Visualization of a Correlation Matrix". https://cran.r-project.org/package=corrplot.
- ↑ Swinton, J. (23 September 2009). "Venn diagrams in R with Vennerable package". https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/Vennerable/inst/doc/Venn.pdf?revision=58&root=vennerable.
- ↑ 11.0 11.1 11.2 Hulsen, T.; de Vlieg, J.; Alkema, W. (2008). "BioVenn - A web application for the comparison and visualization of biological lists using area-proportional Venn diagrams". BMC Genomics 9: 488. doi:10.1186/1471-2164-9-488. PMC PMC2584113. PMID 18925949. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC2584113.
- ↑ 12.0 12.1 Lam, F.; Lalansingh, C.M.; Babaran, H.E. (2016). "VennDiagramWeb: A web application for the generation of highly customizable Venn and Euler diagrams". BMC Bioinformatics 17 (1): 401. doi:10.1186/s12859-016-1281-5. PMC PMC5048655. PMID 27716034. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5048655.
- ↑ 13.0 13.1 Bardou, P.; Mariette, J.; Escudié, F. et al. (2014). "jvenn: An interactive Venn diagram viewer". BMC Bioinformatics 15: 293. doi:10.1186/1471-2105-15-293. PMC PMC4261873. PMID 25176396. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4261873.
- ↑ 14.0 14.1 Lin, G.; Chai, J.; Yuan, S. et al. (2016). "VennPainter: A Tool for the Comparison and Identification of Candidate Genes Based on Venn Diagrams". PLoS One 11 (4): e0154315. doi:10.1371/journal.pone.0154315. PMC PMC4847855. PMID 27120465. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4847855.
- ↑ 15.0 15.1 Martin, B.; Chadwick, W.; Yi, T. et al. (2012). "VENNTURE--A novel Venn diagram investigational tool for multiple pharmacological dataset analysis". PLoS One 7 (5): e36911. doi:10.1371/journal.pone.0036911. PMC PMC3351456. PMID 22606307. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3351456.
- ↑ 16.0 16.1 Heberle, H.; Meirelles, G.V.; da Silva, F.R. et al. (2015). "InteractiVenn: A web-based tool for the analysis of sets through Venn diagrams". BMC Bioinformatics 16: 169. doi:10.1186/s12859-015-0611-3. PMC PMC4455604. PMID 25994840. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4455604.
- ↑ Dunham, I.; Kundaje, A.; Aldred, S.F. et al. (2012). "An integrated encyclopedia of DNA elements in the human genome". Nature 489: 7414. doi:10.1038/nature11247. PMC PMC3439153. PMID 22955616. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3439153.
- ↑ Khan, A.; Zhang, X. (2016). "dbSUPER: a database of super-enhancers in mouse and human genome". Nucleic Acids Research 44 (D1): D164-71. doi:10.1093/nar/gkv1002. PMC PMC4702767. PMID 26438538. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4702767.
- ↑ Chronis, C.; Fiziev, P.; Papp, B. et al. (2017). "Cooperative Binding of Transcription Factors Orchestrates Reprogramming". Cell 168 (3): 442-459.e20. doi:10.1016/j.cell.2016.12.016. PMC PMC5302508. PMID 28111071. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC5302508.
- ↑ Spitz, F.; Furlong, E.E. (2012). "Transcription factors: From enhancer binding to developmental control". Nature Reviews Genetics 13 (9): 613-26. doi:10.1038/nrg3207. PMID 22868264.
- ↑ Griffon, A.; Barbier, Q.; Dalino, J. et al. (2015). "Integrative analysis of public ChIP-seq experiments reveals a complex multi-cell regulatory landscape". Nucleic Acids Research 43 (4): e27. doi:10.1093/nar/gku1280.
- ↑ Ernst, J.; Kellis, M. (2012). "ChromHMM: automating chromatin-state discovery and characterization". Nature Methods 9 (3): 215–6. doi:10.1038/nmeth.1906. PMC PMC3577932. PMID 22373907. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3577932.
- ↑ Taberlay, P.C.; Statham, A.L.; Kelly, T.K. et al. (2014). "Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer". Genome Research 24 (9): 1421-32. doi:10.1101/gr.163485.113. PMC PMC4158760. PMID 24916973. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC4158760.
- ↑ Theodorou, V.; Stark, R.; Menon, S.; Carroll, J.S. (2013). "GATA3 acts upstream of FOXA1 in mediating ESR1 binding by shaping enhancer accessibility". Genome Research 23 (1): 12-22. doi:10.1101/gr.139469.112. PMC PMC3530671. PMID 23172872. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3530671.
- ↑ Zuin, J.; Dixon, J.R.; van der Reijden, M.I. (2014). "Cohesin and CTCF differentially affect chromatin architecture and gene expression in human cells". Proceedings of the National Academy of Sciences of the United States of America 111 (3): 996-1001. doi:10.1073/pnas.1317788111. PMC PMC3903193. PMID 24335803. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=PMC3903193.
This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.