Journal:Practical approaches for mining frequent patterns in molecular datasets

From LIMSWiki
Revision as of 00:26, 24 August 2016 by Shawndouglas (talk | contribs) (Added content. Saving and adding more.)
Jump to navigationJump to search
Full article title Practical approaches for mining frequent patterns in molecular datasets
Journal Bioinformatics and Biology Insights
Author(s) Naulaerts, S.; Moens, S.; Engelen, K.; Vanden Berghe, W.; Goethals, B.; Laukens, K.; Meysman, P.
Author affiliation(s) University of Antwerp, Antwerp University Hospital, Fondazione Edmund Mach
Primary contact Email: pieter dot meysman at uantwerpen dot be
Editors Dandekar, T.
Year published 2016
Volume and issue 10
Page(s) 37–47
DOI 10.4137/BBI.S38419
ISSN 1177-9322
Distribution license Creative Commons Attribution 3.0 Unported
Website http://www.la-press.com/ (HTML)
Download http://www.la-press.com/ (PDF)

Abstract

Pattern detection is an inherent task in the analysis and interpretation of complex and continuously accumulating biological data. Numerous itemset mining algorithms have been developed in the last decade to efficiently detect specific pattern classes in data. Although many of these have proven their value for addressing bioinformatics problems, several factors still slow down promising algorithms from gaining popularity in the life science community. Many of these issues stem from the low user-friendliness of these tools and the complexity of their output, which is often large, static, and consequently hard to interpret. Here, we apply three software implementations on common bioinformatics problems and illustrate some of the advantages and disadvantages of each, as well as inherent pitfalls of biological data mining. Frequent itemset mining exists in many different flavors, and users should decide their software choice based on their research question, programming proficiency, and added value of extra features.

Keywords: frequent itemset mining, protein domain structure, protein–protein interaction, gene expression, Mycobacterium tuberculosis

Introduction

In the last decade, various information-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.[1][2][3][4] A popular group of pattern mining techniques are itemset mining and their derivative, association rule mining. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.[5] In this context, the shopping cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (i.e. itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coexpressed in tissue samples or which mutations often occur together in cancer tumors of a given type.

Frequent itemset mining has proven especially useful in capturing and summarizing the characteristics of complex datasets to their important and most interesting aspects. Frequent patterns can be converted into rules with a discriminatory value that can, in turn, be used to build transparent classifications. For example, if a gene C is always upregulated when genes A and B are downregulated, the frequent itemset {A|Down, B|Down, C|Up} can be rewritten as the rule {A|Down, B|Down} ≥ {C|Up}, where the left-hand side (antecedent) of the rule leads to the consequent (right-hand side) of the rule. Rules of this type can be used to distinguish between tumor types, gene clusters, and various other biological contrasts. The advantage of this approach is that the rules immediately explain why a particular label was given, which is an advantage over machine learning methods such as neural networks that act as a black box. The strengths of frequent itemset mining have been consequently demonstrated in a broad range of bioinformatics applications, ranging from gene expression data[6][7][8], annotation mining[9][10], and combinations thereof[11][12] to interaction networks.[13] A comprehensive overview of the broad range of implementations and bioinformatics applications of frequent itemset mining techniques was recently published.[14]

Despite their demonstrated suitability to address various bioinformatics problems, frequent itemset mining techniques have not been generally adopted in day-to-day omics data analysis workflows, and their popularity is only slowly gaining traction. This can be partially attributed to a number of shortcomings in the existing implementations. First of all, most are command line tools that often need to be compiled from the source code, and clear documentation regarding their installation is often lacking. This lack of user-friendliness poses a serious entrance barrier that daunts many life scientists. Second, the output of the implementations is often presented in a format that is not readily interpretable by domain experts. The results of the mining process are typically long pattern lists containing flat text files. However, these lists are often very lengthy and highly redundant. This is caused, in part, by the fact that if a set is frequent, any of the smaller subsets that it contains will also be frequent. This is also known as the apriori principle. For many pattern mining applications, there is often a so-called pattern explosion with results that list millions of patterns. Due to the verbose nature of these lists, user-friendly tools to process, query, and visualize this output are indispensible.

Convenient prioritization, filtering, cleaning, and interpretation of pattern result lists require certain functionalities that are rarely covered by existing implementations. Third, iterative optimization of the pattern list and browsing through the output of these algorithms is often hard, as they create static output that needs to be processed and converted to a compatible format before the next step in the iterative mining process can start. This can make result prioritization, an inherent part of many pattern discovery projects, a very cumbersome process.[14]

To address some of these limitations, software frameworks have been developed for interactive visual pattern mining, such as the MIME tool.[15] Such toolboxes offer intuitive access to interest-level measures, mining algorithms, and post-processing algorithms to assist in identifying interesting patterns. By enabling interactive mining, it allows the user to combine their subjective interest-level measure and background knowledge with a wide variety of objective measures to easily and quickly mine the most important and interesting patterns. In this article, we demonstrate the opportunities of frequent itemset mining in real-world bioinformatics scenarios and describe the application of three commonly used methods, namely, Apriori[5], arules[16], and MIME.[15] This comparison is based on three representative bioinformatics use cases, i.e. domain co-occurrence within proteins, interactions between domains in interacting proteins, and the response of the pathogen Mycobacterium tuberculosis to several drug treatments. For this purpose, we utilize data from UniProt[2], IntAct[4], and COLOMBOS.[1] The data files and step-by-step tutorials on how to install and run the three presented tools on the three use cases are available in Supplementary Files 1–5. The goal of this study is to explore how interesting and biologically relevant patterns can be effectively generated with different tools and provide the community with some guidance on how frequent itemset mining tools can be used in complex life science scenarios.

References

  1. 1.0 1.1 Meysman, P.; Sonego, P.; Bianco, L. et al. (2014). "COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia". Nucleic Acids Research 42 (D1): D649-D653. doi:10.1093/nar/gkt1086. PMC PMC3965013. PMID 24214998. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965013. 
  2. 2.0 2.1 UniProt Consortium (2013). "Update on activities at the Universal Protein Resource (UniProt) in 2013". Nucleic Acids Research 41 (D1): D43-7. doi:10.1093/nar/gks1068. PMC PMC3531094. PMID 23161681. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531094. 
  3. Maietta, P.; Lopez, G.; Carro, A. et al. (2014). "FireDB: A compendium of biological and pharmacologically relevant ligands". Nucleic Acids Research 42 (D1): D267-72. doi:10.1093/nar/gkt1127. PMC PMC3965074. PMID 24243844. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965074. 
  4. 4.0 4.1 Orchard, S.; Ammari, M.; Aranda, B. et al. (2014). "The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases". Nucleic Acids Research 42 (D1): D358-63. doi:10.1093/nar/gkt1115. PMC PMC3965093. PMID 24234451. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965093. 
  5. 5.0 5.1 Agrawal, R.; Imielinksi, T.; Swami, A. (1993). "Mining association rules between sets of items in large database". Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22 (2): 207–16. 
  6. Pan, F.; Cong, G.; Tung, A.K.H. et al. (2003). "Carpenter: Finding closed patterns in long biological datasets". Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 637–642. doi:10.1145/956750.956832. 
  7. Cong, G.; Tan, K.-L.; Tung, A.K.H.; Xu, X. (2005). "Mining top-K covering rule groups for gene expression data". Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data: 670-681. doi:10.1145/1066157.1066234. 
  8. Gouda, K.; Zaki, M.J. (2005). "GenMax: An efficient algorithm for mining maximal frequent itemsets". Data Mining and Knowledge Discovery 11 (3): 223-242. doi:10.1007/s10618-005-0002-x. 
  9. Artamonova, I.I.; Frishman, G.; Gelfand, M.S; Frishman, D. (2005). "Mining sequence annotation databanks for association patterns". Bioinformatics 21 (Suppl 3): iii49-iii57. doi:10.1093/bioinformatics/bti1206. PMID 16306393. 
  10. Manda, P.; Ozkan, S.; Wang, H. et al. (2012). "Cross-ontology multi-level association rule mining in the gene ontology". PLoS One 7 (10): e47411. doi:10.1371/journal.pone.0047411. PMC PMC3470562. PMID 23071802. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470562. 
  11. Martinez, R.; Pasquier, N.; Pasquier, C. (2009). "Mining Association Rule Bases from Integrated Genomic Data and Annotations". In Masulli, F.; Tagliaferri, R.; Verkhivker, G.M.. Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer Berlin Heidelberg. pp. 78–90. doi:10.1007/978-3-642-02504-4_7. ISBN 9783642025044. 
  12. Tseng, V.S.; Yu, H.-H.; Yang, S.-C. (2009). "Efficient mining of multilevel gene association rules from microarray and gene ontology". Information Systems Frontiers 11 (4): 433-447. doi:10.1007/s10796-009-9156-1. 
  13. Karpinets, T.V.; Park, B.H.; Uberbacher, E.C. (2012). "Analyzing large biological datasets with association networks". Nucleic Acids Research 40 (17): e131. doi:10.1093/nar/gks403. PMC PMC3458522. PMID 22638576. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458522. 
  14. 14.0 14.1 Naulaerts, S.; Meysman, P.; Bittremieux, W. et al. (2015). "Analyzing large biological datasets with association networks". Briefings in Bioinformatics 16 (2): 216–31. doi:10.1093/bib/bbt074. PMC PMC4364064. PMID 24162173. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4364064. 
  15. 15.0 15.1 Goethals, B.; Moens, S.; Vreeken, J. (2011). "MIME: A framework for interactive visual pattern mining". Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 757–760. doi:10.1145/2020408.2020529. 
  16. Hahsler, M.; Grün, B.; Hornik, K. (2005). "arules – A computational environment for mining association rules and frequent item sets". Journal of Statistical Software 14 (15): 1–25. doi:10.18637/jss.v014.i15. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.