Journal:Practical approaches for mining frequent patterns in molecular datasets

Full article title	Practical approaches for mining frequent patterns in molecular datasets
Journal	Bioinformatics and Biology Insights
Author(s)	Naulaerts, S.; Moens, S.; Engelen, K.; Vanden Berghe, W.; Goethals, B.; Laukens, K.; Meysman, P.
Author affiliation(s)	University of Antwerp, Antwerp University Hospital, Fondazione Edmund Mach
Primary contact	Email: pieter dot meysman at uantwerpen dot be
Editors	Dandekar, T.
Year published	2016
Volume and issue	10
Page(s)	37–47
DOI	10.4137/BBI.S38419
ISSN	1177-9322
Distribution license	Creative Commons Attribution 3.0 Unported
Website	http://www.la-press.com/ (HTML)
Download	http://www.la-press.com/ (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Pattern detection is an inherent task in the analysis and interpretation of complex and continuously accumulating biological data. Numerous itemset mining algorithms have been developed in the last decade to efficiently detect specific pattern classes in data. Although many of these have proven their value for addressing bioinformatics problems, several factors still slow down promising algorithms from gaining popularity in the life science community. Many of these issues stem from the low user-friendliness of these tools and the complexity of their output, which is often large, static, and consequently hard to interpret. Here, we apply three software implementations on common bioinformatics problems and illustrate some of the advantages and disadvantages of each, as well as inherent pitfalls of biological data mining. Frequent itemset mining exists in many different flavors, and users should decide their software choice based on their research question, programming proficiency, and added value of extra features.

Keywords: frequent itemset mining, protein domain structure, protein–protein interaction, gene expression, Mycobacterium tuberculosis

Introduction

In the last decade, various information-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.^[1]^[2]^[3]^[4] A popular group of pattern mining techniques are itemset mining and their derivative, association rule mining. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.^[5] In this context, the shopping cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (i.e. itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coexpressed in tissue samples or which mutations often occur together in cancer tumors of a given type.

Frequent itemset mining has proven especially useful in capturing and summarizing the characteristics of complex datasets to their important and most interesting aspects. Frequent patterns can be converted into rules with a discriminatory value that can, in turn, be used to build transparent classifications. For example, if a gene C is always upregulated when genes A and B are downregulated, the frequent itemset {A|Down, B|Down, C|Up} can be rewritten as the rule {A|Down, B|Down} ≥ {C|Up}, where the left-hand side (antecedent) of the rule leads to the consequent (right-hand side) of the rule. Rules of this type can be used to distinguish between tumor types, gene clusters, and various other biological contrasts. The advantage of this approach is that the rules immediately explain why a particular label was given, which is an advantage over machine learning methods such as neural networks that act as a black box. The strengths of frequent itemset mining have been consequently demonstrated in a broad range of bioinformatics applications, ranging from gene expression data^[6]^[7]^[8], annotation mining^[9]^[10], and combinations thereof^[11]^[12] to interaction networks.^[13] A comprehensive overview of the broad range of implementations and bioinformatics applications of frequent itemset mining techniques was recently published.^[14]

Despite their demonstrated suitability to address various bioinformatics problems, frequent itemset mining techniques have not been generally adopted in day-to-day omics data analysis workflows, and their popularity is only slowly gaining traction. This can be partially attributed to a number of shortcomings in the existing implementations. First of all, most are command line tools that often need to be compiled from the source code, and clear documentation regarding their installation is often lacking. This lack of user-friendliness poses a serious entrance barrier that daunts many life scientists. Second, the output of the implementations is often presented in a format that is not readily interpretable by domain experts. The results of the mining process are typically long pattern lists containing flat text files. However, these lists are often very lengthy and highly redundant. This is caused, in part, by the fact that if a set is frequent, any of the smaller subsets that it contains will also be frequent. This is also known as the apriori principle. For many pattern mining applications, there is often a so-called pattern explosion with results that list millions of patterns. Due to the verbose nature of these lists, user-friendly tools to process, query, and visualize this output are indispensible.

Convenient prioritization, filtering, cleaning, and interpretation of pattern result lists require certain functionalities that are rarely covered by existing implementations. Third, iterative optimization of the pattern list and browsing through the output of these algorithms is often hard, as they create static output that needs to be processed and converted to a compatible format before the next step in the iterative mining process can start. This can make result prioritization, an inherent part of many pattern discovery projects, a very cumbersome process.^[14]

To address some of these limitations, software frameworks have been developed for interactive visual pattern mining, such as the MIME tool.^[15] Such toolboxes offer intuitive access to interest-level measures, mining algorithms, and post-processing algorithms to assist in identifying interesting patterns. By enabling interactive mining, it allows the user to combine their subjective interest-level measure and background knowledge with a wide variety of objective measures to easily and quickly mine the most important and interesting patterns. In this article, we demonstrate the opportunities of frequent itemset mining in real-world bioinformatics scenarios and describe the application of three commonly used methods, namely, Apriori^[5], arules^[16], and MIME.^[15] This comparison is based on three representative bioinformatics use cases, i.e. domain co-occurrence within proteins, interactions between domains in interacting proteins, and the response of the pathogen Mycobacterium tuberculosis to several drug treatments. For this purpose, we utilize data from UniProt^[2], IntAct^[4], and COLOMBOS.^[1] The data files and step-by-step tutorials on how to install and run the three presented tools on the three use cases are available in Supplementary Files 1–5. The goal of this study is to explore how interesting and biologically relevant patterns can be effectively generated with different tools and provide the community with some guidance on how frequent itemset mining tools can be used in complex life science scenarios.

Materials and methods

Frequent itemset mining

For datasets with large amounts of objects, features, or observations, it is not tractable to check all possible combinations and find correlated annotation terms or biological entities. Frequent itemset mining is composed of a set of tools that are able to find co-occurring terms, known as items, in big data. These items can be any entity, ranging from genes and RNAs to proteins or drugs, and thus allow, for example, to identify co-expressed genes or proteins. The mining algorithms typically start from transactional databases, as shown in Figure 1.

Figure 1. Toy example of frequent itemset mining. The input of a frequent itemset mining approach is a transaction database (shown to the left). The output of the approach is a list of patterns and their support (shown to the right).

A transaction is simply a set of items, and a transactional database layout contains one transaction per row, in which each transaction refers to an observation, such as the collection of domains associated with a protein (Fig. 1) or the aggregation of supermarket articles found in a shopping basket that was checked out by a single customer. These algorithms use heuristics to reduce the search space. For example, priori-based methods use the Apriori principle, which states that if an itemset is not frequent, all supersets of these items will also not be frequent. A pattern is frequent when its items appear more than expected, which means that its support, the absolute number of times that a pattern occurs, needs to exceed a userdefined minimal threshold. It is worth noting that frequency and support are often inter-exchanged, with frequency thresholds stating the relative minimum (often 10%) threshold that needs to be exceeded. A hot topic in this field is the inability of frequent itemset mining to identify co-occurrence of continuous values, as most methods require the user to conduct well-chosen discretization steps that define the items evaluated by the algorithm. We have described our discretization steps where needed in the case studies described subsequently.

Datasets

In order to compare the benefits of each frequent itemset method, several datasets were created from public resources, more specifically using InterPro, COLOMBOS, and IntAct. For each of the case studies, a transaction dataset was constructed based on the data downloaded from the database and saved in a simple space- or tab-delimited file. All of the transaction data files have been made available in Supplementary Files 1–3. A tutorial on how to practically execute the mining process has also been included (Supplementary File 4), as well as a small Python script that compares the mining results for the Borgelt’s Apriori implementation (Supplementary File 5).

Protein domain analysis

For the first use case, we downloaded all known proteins of the human reference proteome on February 19, 2014 from UniProt^[2] and retained only those that contained at least one InterPro domain.^[17] In total, this resulted in a set consisting of 20,636 proteins. In the second use case, we mapped this information on top of the IntAct protein–protein network^[4], which was downloaded on the same day.

COLOMBOS

All the microarray information was obtained from COLOMBOS (a collection of microarrays for bacterial organisms)^[1], which contains numerous renormalized gene expression experiments extracted from the Gene Expression Omnibus^[18] and ArrayExpress.^[19] Using the advanced search option, we created a dataset for M. tuberculosis composed of several experiments in which the bacteria responded to antibiotics added to their growth medium. Next, the information in this dataset was discretized based on the fold change. Traditionally, log2-fold changes ranging from 1 to 1.5 are used to define the differential regulatory state of a gene. Here, we used a log2-fold change of 1.2, which results in a dataset that includes 25% of the protein-coding genes in M. tuberculosis that were differentially expressed in at least one condition contrast. All log2-fold changes greater than 1.2 were considered upregulated, while those smaller than −1.2 were labeled as being downregulated. Fold changes between these two values were excluded from the dataset. For each gene, the log-fold change was simplified to a discretized state (up or down) and appended as a suffix to the contrast information. For example, the gene Rv0823c has been identified to be downregulated (fold change −1.43) in study GSE1642, in which the treatment is an addition of 1 µM valinomycin. The combination of all this information into one label results in Rv0823c|Down, which is a discrete item that can be used in the mining process.

Tools

Apriori

Apriori is one of the oldest, most simple, and popular frequent itemset mining algorithms.^[20] It is available in various implementations that vary from command line tools to parts of data analysis software suites. In this article, we use Christian Borgelt's Apriori implementation, which is available at http://www.borgelt.net/apriori.html.^[21] We compiled the C source code and ran the implementation using the synthaxis./apriori [options] infile [outfile] from the terminal on a Macintosh running OS 10.9 Mavericks.

One of the major advantages of Borgelt’s Apriori version, other than its improved pattern identification efficiency, is its support for distinctly different types of itemsets, including maximal, closed, and open itemsets. By default, Apriori generates all possible itemsets (open), which are typically far too many to analyze. The information contained by these patterns is largely redundant, which means that the resulting pattern list can be efficiently reduced with a minimal loss of information content. For example, only itemsets that have no frequent superset can be retained (maximal itemsets). This results in a major reduction in patterns that cannot be justified in every scenario. A more balanced general approach that still effectively reduces the output of the mining process consists of mining only closed patterns. These are formally defined as itemsets that have no immediate superset with the same support value. Although all these three pattern classes have been used to approach various life science problems, closed itemsets have been the most prominent in analyses that deal with the typical genome- or proteome-wide scale datasets.^[14]

arules

In addition to command line tools, several efforts have been made to bring frequent itemset mining to other platforms. One such project is the R package arules, which also contains the Apriori implementation by Borgelt in addition to other mining algorithms.^[16]

arules provides an entire toolbox for the representation, manipulation, and analysis of frequent itemsets and association rules in R. It contains several scoring metrics and allows the calculation of various properties, such as dissimilarity. arules is also compatible with arulesViz^[22], which allows rapid visualization of the mining results. arules is available for download through the R interface as a CRAN distribution.

MIME

The third tool covered in this study is MIME.^[15] MIME provides a dynamic and interactive graphical environment to interact with the dataset and retrieve patterns. It hosts several popular pattern mining algorithms, such as Apriori^[5], Eclat^[23], tiling^[24], top-K mining^[25], the OPUS Miner^[26], and Carpenter^[6], as well as various classification algorithms that are based on association rule mining.^[27]^[28]^[29] Similar to arules, MIME has various scoring measures (so-called interestingness metrics) and, furthermore, supports iterative mining. The software is written in Java and depends on two Java libraries, namely, QTJambi (http://qt-jambi.org) and WEKA^[30], and is available from http://adrem.ua.ac.be/mime.

Results and discussion

By means of practical use cases, we will go through increasingly complex life sciences scenarios that at each step require additional preprocessing steps to translate the problem into a format recognizable by the tools.

Case study 1: Analyzing domain co-occurrence within proteins

The simplest use case for frequent itemset mining is the analysis of domain co-occurrence within proteins. The transaction dataset contains all human proteins documented in InterPro with at least one protein domain. Herein, each protein is considered as a transaction, which equals a single line in the input file containing the domains (items). In total, 20,636 proteins were present in the dataset. All three tools were able to find the most frequent individual domains by setting the maximal number of items in an itemset to one. MIME delivers this information without requiring an extra step. This step allows us to identify which domains occur the most in the dataset and which is important for the calculation of several quality measures, such as lift value. It also gives the researcher a quick initial idea of which domains are most likely to appear in frequent itemsets. If these domains associate in an aspecific manner with another domain and can be expected by random combination (low lift value), the pattern is not interesting for a biologist who tries to identify domains that are functionally correlated.

The most frequent domain was identified to be IPR027417, a P-loop containing nucleoside triphosphate hydrolase, which occurred 1329 times (6.44% of all proteins). Other than this domain, only IPR013783 (4.44%), IPR011009 (4.31%), IPR000719 (3.96%), and IPR015943 (3.19%) occur in more than 3% of the human proteins. IPR000719 and IPR011009 (protein kinase-like domain) occurred together in the majority of proteins they occurred in, as represented by the corresponding support value (3.94%). This pattern was the most frequent of all associated InterPro terms, with the highest support and a high lift value, which suggests a nonrandom association. However, this can easily be attributed to the fact that the former is a protein kinase domain and the latter is a protein kinase-like domain, with IPR000719 being a subset of IPR011009 and kinases being a family containing a significant number of proteins. Interestingly, one would then expect IPR000719 (819 proteins) to always co-occur with IPR011009, but this was apparently not the case for five proteins, namely, C9JE15 (aarF domain-containing protein kinase 2), D6RHX9 (calcium/calmodulin-dependent protein kinase type II subunit alpha), E9PPN3 (N-terminal kinase-like protein), F8W0N2 (serine/threonine-protein kinase receptor R3), and H0YAH6 (epithelial discoidin domain-containing receptor 1). In the following updates of UniProt, these five proteins had the IPR011009 domain added, which illustrates that even basic frequent itemset mining can be applied to detect inconsistencies in annotations.

A combination of the low cutoff percentages and the complexity of biological data make setting a minimal threshold a rather daunting task. One approach is iteratively lowering the threshold and keeping it above the value where the number of patterns explodes. This parameter fine-tuning may be time consuming, and the arbitrary threshold can be hard to justify biologically. In this case, it can be useful to simply search for 100 most frequent patterns without much knowledge of their individual protein abundance. In many settings, the manual evaluation of the pattern results will limit itself to those patterns that have the highest support value, as these will be most frequent in the data set and often the most relevant or interesting. For such a purpose, top-K mining would be much more straightforward as the only parameter it requires is setting K to 100. Of the three implementations addressed, this is only possible with MIME. Top-K mining with the other solutions requires knowledge of Java or can be cumbersome.^[31]^[32] Figure 2 shows the output of the three itemset mining implementations. For Apriori and arules, we iteratively optimized a lower threshold in order to display these results. An overview of the most frequent itemsets obtained with MIME is listed in Table 1.

Figure 2. Input (left) and output (right) of frequent itemset mining in popular frequent itemset mining implementations. The upper part shows the look and feel of the terminal-based Apriori-Borgelt implementation. The output of this tool shows each pattern in turn on each line with the support value as a frequency between brackets. In the middle, a similar figure is shown for the arules package. The arules output is an R data object with the items that make up a pattern in one column and the support in the second column. The bottom of the figure features the input and output of MIME. Note here that the red dots in the upper whitespace indicate the items, which are described by their name and their individual support (between brackets). These items can be dragged and dropped in the larger white area below to modify the mining output.

Itemset	Frequency	Support
Table 1. The 30 most frequent InterPro intraprotein patterns
IPR000719 IPR011009	3,94	814
IPR007087 IPR015880	2,76	569
IPR007110 IPR013783	2,72	562
IPR017986 IPR015943	2,71	559
IPR001680 IPR015943	2,55	526
IPR001680 IPR017986	2,49	514
IPR001680 IPR017986 IPR015943	2,47	510
IPR013087 IPR007087	2,36	487
IPR013087 IPR015880	2,30	474
IPR013087 IPR007087 IPR015880	2,28	471
IPR017441 IPR011009	2,21	457
IPR017441 IPR000719 IPR011009	2,19	452
IPR000504 IPR012677	2,18	450
IPR011989 IPR016024	1,78	368
IPR003599 IPR013783	1,76	363
IPR008271 IPR000719	1,73	357
IPR008271 IPR000719 IPR011009	1,73	356
IPR001849 IPR011993	1,72	355
IPR002110 IPR020683	1,62	334
IPR003599 IPR007110 IPR013783	1,50	310
IPR002048 IPR011992	1,39	286
IPR003961 IPR013783	1,33	275
IPR019775 IPR001680	1,24	255
IPR019775 IPR001680 IPR017986	1,23	254
IPR019775 IPR001680 IPR015943	1,23	254
IPR019775 IPR001680 IPR017986 IPR015943	1,23	253
IPR001841 IPR013083	1,22	251
IPR013032 IPR000742	1,18	243
IPR001806 IPR027417	1,12	232
IPR003598 IPR013783	1,11	229
IPR003598 IPR007110	1,10	227
IPR000008 IPR008973	1,05	217
IPR013098 IPR013783	1,04	214
IPR013098 IPR007110 IPR013783	1,02	211

References

↑ ^1.0 ^1.1 ^1.2 Meysman, P.; Sonego, P.; Bianco, L. et al. (2014). "COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia". Nucleic Acids Research 42 (D1): D649-D653. doi:10.1093/nar/gkt1086. PMC PMC3965013. PMID 24214998. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965013.
↑ ^2.0 ^2.1 ^2.2 UniProt Consortium (2013). "Update on activities at the Universal Protein Resource (UniProt) in 2013". Nucleic Acids Research 41 (D1): D43-7. doi:10.1093/nar/gks1068. PMC PMC3531094. PMID 23161681. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531094.
↑ Maietta, P.; Lopez, G.; Carro, A. et al. (2014). "FireDB: A compendium of biological and pharmacologically relevant ligands". Nucleic Acids Research 42 (D1): D267-72. doi:10.1093/nar/gkt1127. PMC PMC3965074. PMID 24243844. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965074.
↑ ^4.0 ^4.1 ^4.2 Orchard, S.; Ammari, M.; Aranda, B. et al. (2014). "The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases". Nucleic Acids Research 42 (D1): D358-63. doi:10.1093/nar/gkt1115. PMC PMC3965093. PMID 24234451. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965093.
↑ ^5.0 ^5.1 ^5.2 Agrawal, R.; Imielinksi, T.; Swami, A. (1993). "Mining association rules between sets of items in large database". Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22 (2): 207–16.
↑ ^6.0 ^6.1 Pan, F.; Cong, G.; Tung, A.K.H. et al. (2003). "Carpenter: Finding closed patterns in long biological datasets". Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 637–642. doi:10.1145/956750.956832.
↑ Cong, G.; Tan, K.-L.; Tung, A.K.H.; Xu, X. (2005). "Mining top-K covering rule groups for gene expression data". Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data: 670-681. doi:10.1145/1066157.1066234.
↑ Gouda, K.; Zaki, M.J. (2005). "GenMax: An efficient algorithm for mining maximal frequent itemsets". Data Mining and Knowledge Discovery 11 (3): 223-242. doi:10.1007/s10618-005-0002-x.
↑ Artamonova, I.I.; Frishman, G.; Gelfand, M.S; Frishman, D. (2005). "Mining sequence annotation databanks for association patterns". Bioinformatics 21 (Suppl 3): iii49-iii57. doi:10.1093/bioinformatics/bti1206. PMID 16306393.
↑ Manda, P.; Ozkan, S.; Wang, H. et al. (2012). "Cross-ontology multi-level association rule mining in the gene ontology". PLoS One 7 (10): e47411. doi:10.1371/journal.pone.0047411. PMC PMC3470562. PMID 23071802. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470562.
↑ Martinez, R.; Pasquier, N.; Pasquier, C. (2009). "Mining Association Rule Bases from Integrated Genomic Data and Annotations". In Masulli, F.; Tagliaferri, R.; Verkhivker, G.M.. Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer Berlin Heidelberg. pp. 78–90. doi:10.1007/978-3-642-02504-4_7. ISBN 9783642025044.
↑ Tseng, V.S.; Yu, H.-H.; Yang, S.-C. (2009). "Efficient mining of multilevel gene association rules from microarray and gene ontology". Information Systems Frontiers 11 (4): 433-447. doi:10.1007/s10796-009-9156-1.
↑ Karpinets, T.V.; Park, B.H.; Uberbacher, E.C. (2012). "Analyzing large biological datasets with association networks". Nucleic Acids Research 40 (17): e131. doi:10.1093/nar/gks403. PMC PMC3458522. PMID 22638576. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458522.
↑ ^14.0 ^14.1 ^14.2 Naulaerts, S.; Meysman, P.; Bittremieux, W. et al. (2015). "A primer to frequent itemset mining for bioinformatics". Briefings in Bioinformatics 16 (2): 216–31. doi:10.1093/bib/bbt074. PMC PMC4364064. PMID 24162173. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4364064.
↑ ^15.0 ^15.1 ^15.2 Goethals, B.; Moens, S.; Vreeken, J. (2011). "MIME: A framework for interactive visual pattern mining". Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 757–760. doi:10.1145/2020408.2020529.
↑ ^16.0 ^16.1 Hahsler, M.; Grün, B.; Hornik, K. (2005). "arules – A computational environment for mining association rules and frequent item sets". Journal of Statistical Software 14 (15): 1–25. doi:10.18637/jss.v014.i15.
↑ Hunter, S.; Jones, P.; Mitchell, A. et al. (2012). "InterPro in 2011: New developments in the family and domain prediction database". Nucleic Acids Research 40 (D1): D306-12. doi:10.1093/nar/gkr948. PMC PMC3245097. PMID 22096229. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245097.
↑ Barrett, T.; Wilhite, S.E.; Ledoux, P. et al. (2013). "NCBI GEO: Archive for functional genomics data sets - Update". Nucleic Acids Research 41 (D1): D991-5. doi:10.1093/nar/gks1193. PMC PMC3531084. PMID 23193258. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531084.
↑ Rustici, G.; Kolesnikov, N.; Brandizi, M. et al. (2013). "ArrayExpress update - Trends in database growth and links to data analysis tools". Nucleic Acids Research 41 (D1): D987-90. doi:10.1093/nar/gks1174. PMC PMC3531147. PMID 23193272. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531147.
↑ Agrawal, R.; Srikant, R. (1994). "Fast algorithms for mining association rules in large databases". Proceedings of the 20th International Conference on Very Large Data Bases: 487–499. http://dl.acm.org/citation.cfm?id=672836.
↑ Borgelt, C. (2003). "Efficient implementations of Apriori and Eclat". Proceedings for the 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations 2003: 90.
↑ Hahsler, M.; Chelluboina, S.; Hornik, K.; Buchta, C. (2011). "The arules R-Package ecosystem: Analyzing interesting patterns from large transaction data sets". The Journal of Machine Learning Research 12 (2): 2021-2025. http://dl.acm.org/citation.cfm?id=2021064.
↑ Zaki, M.J.; Parthasarathy, S.; Ogihara, M.; Wei, L. (1997). "New algorithms for fast discovery of association rules". 3rd International Conference on Knowledge Discovery and Data Mining: 283–286. http://dl.acm.org/citation.cfm?id=2021064.
↑ Geerts, F.; Goethals, B.; Mielikäinen, T. (2004). "Tiling Databases". In Suzuki, E.; Arikawa, S.. Discovery Science. Springer Berlin Heidelberg. pp. 278-289. doi:10.1007/978-3-540-30214-8_22. ISBN 9783540302148.
↑ Han, J.; Wang, J.; Lu, Y.; Tzvetkov (2002). [http://dl.acm.org/citation.cfm?id=844747 "Mining top-k frequent closed patterns without minimum support"]. Proceedings of the 2002 IEEE International Conference on Data Mining: 211–218. http://dl.acm.org/citation.cfm?id=844747.
↑ Webb, G.I. (1995). "OPUS: An efficient admissible algorithm for unordered search". Journal of Artificial Intelligence Research 3 (1): 431–465. http://dl.acm.org/citation.cfm?id=1622635.
↑ Liu, B.; Hsu, W.; Ma, Y. (1998). "Integrating classification and association rule mining". Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining: 80–86.
↑ Li, W.; Han, J.; Pei, J. (2001). "CMAR: Accurate and efficient classification based on multiple class-association rules". Proceedings of the 2001 IEEE International Conference on Data Mining: 369-376. http://dl.acm.org/citation.cfm?id=657866.
↑ Yin, X.; Han, J. (2003). "CPAR: Classification based on predictive association rules". Proceedings of the SIAM International Conference on Data Mining: 331-335.
↑ Hall, M.; Frank, E.; Holmes, G. et al. (2009). "The WEKA data mining software: An update". ACM SIGKDD Explorations Newsletter 11 (1): 10-18. doi:10.1145/1656274.1656278.
↑ Liu, Y.C.; Cheng, C.P.; Tseng, V.S. (2013). "Mining differential top-k co-expression patterns from time course comparative gene expression datasets". BMC Bioinformatics 14: 230. doi:10.1186/1471-2105-14-230. PMC PMC3751367. PMID 23870110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751367.
↑ Fournier-Viger, P.; Gomariz, A.; Gueniche, T. et al. (2014). "SPMF: A Java open-source pattern mining library". The Journal of Machine Learning Research 15 (1): 3389-3393. http://dl.acm.org/citation.cfm?id=2750353.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.

[MeysmanCOLO14-1] 1.0 ^1.1 ^1.2 Meysman, P.; Sonego, P.; Bianco, L. et al. (2014). "COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia". Nucleic Acids Research 42 (D1): D649-D653. doi:10.1093/nar/gkt1086. PMC PMC3965013. PMID 24214998. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965013.

[UniProtUpdate13-2] 2.0 ^2.1 ^2.2 UniProt Consortium (2013). "Update on activities at the Universal Protein Resource (UniProt) in 2013". Nucleic Acids Research 41 (D1): D43-7. doi:10.1093/nar/gks1068. PMC PMC3531094. PMID 23161681. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531094.

[MaiettaFireDB14-3] Maietta, P.; Lopez, G.; Carro, A. et al. (2014). "FireDB: A compendium of biological and pharmacologically relevant ligands". Nucleic Acids Research 42 (D1): D267-72. doi:10.1093/nar/gkt1127. PMC PMC3965074. PMID 24243844. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965074.

[OrchardTheMint14-4] 4.0 ^4.1 ^4.2 Orchard, S.; Ammari, M.; Aranda, B. et al. (2014). "The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases". Nucleic Acids Research 42 (D1): D358-63. doi:10.1093/nar/gkt1115. PMC PMC3965093. PMID 24234451. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965093.

[AgrawalMining93-5] 5.0 ^5.1 ^5.2 Agrawal, R.; Imielinksi, T.; Swami, A. (1993). "Mining association rules between sets of items in large database". Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22 (2): 207–16.

[PanCarp03-6] 6.0 ^6.1 Pan, F.; Cong, G.; Tung, A.K.H. et al. (2003). "Carpenter: Finding closed patterns in long biological datasets". Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 637–642. doi:10.1145/956750.956832.

[CongMining05-7] Cong, G.; Tan, K.-L.; Tung, A.K.H.; Xu, X. (2005). "Mining top-K covering rule groups for gene expression data". Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data: 670-681. doi:10.1145/1066157.1066234.

[GoudaGenMax05-8] Gouda, K.; Zaki, M.J. (2005). "GenMax: An efficient algorithm for mining maximal frequent itemsets". Data Mining and Knowledge Discovery 11 (3): 223-242. doi:10.1007/s10618-005-0002-x.

[ArtamonovaMining05-9] Artamonova, I.I.; Frishman, G.; Gelfand, M.S; Frishman, D. (2005). "Mining sequence annotation databanks for association patterns". Bioinformatics 21 (Suppl 3): iii49-iii57. doi:10.1093/bioinformatics/bti1206. PMID 16306393.

[MandaCross12-10] Manda, P.; Ozkan, S.; Wang, H. et al. (2012). "Cross-ontology multi-level association rule mining in the gene ontology". PLoS One 7 (10): e47411. doi:10.1371/journal.pone.0047411. PMC PMC3470562. PMID 23071802. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470562.

[MartinezMining09-11] Martinez, R.; Pasquier, N.; Pasquier, C. (2009). "Mining Association Rule Bases from Integrated Genomic Data and Annotations". In Masulli, F.; Tagliaferri, R.; Verkhivker, G.M.. Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer Berlin Heidelberg. pp. 78–90. doi:10.1007/978-3-642-02504-4_7. ISBN 9783642025044.

[TsengEfficient09-12] Tseng, V.S.; Yu, H.-H.; Yang, S.-C. (2009). "Efficient mining of multilevel gene association rules from microarray and gene ontology". Information Systems Frontiers 11 (4): 433-447. doi:10.1007/s10796-009-9156-1.

[KarpinetsAnal12-13] Karpinets, T.V.; Park, B.H.; Uberbacher, E.C. (2012). "Analyzing large biological datasets with association networks". Nucleic Acids Research 40 (17): e131. doi:10.1093/nar/gks403. PMC PMC3458522. PMID 22638576. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458522.

[NaulaertsAPrimer15-14] 14.0 ^14.1 ^14.2 Naulaerts, S.; Meysman, P.; Bittremieux, W. et al. (2015). "A primer to frequent itemset mining for bioinformatics". Briefings in Bioinformatics 16 (2): 216–31. doi:10.1093/bib/bbt074. PMC PMC4364064. PMID 24162173. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4364064.

[GoethalsMIME11-15] 15.0 ^15.1 ^15.2 Goethals, B.; Moens, S.; Vreeken, J. (2011). "MIME: A framework for interactive visual pattern mining". Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 757–760. doi:10.1145/2020408.2020529.

[HahslerArules05-16] 16.0 ^16.1 Hahsler, M.; Grün, B.; Hornik, K. (2005). "arules – A computational environment for mining association rules and frequent item sets". Journal of Statistical Software 14 (15): 1–25. doi:10.18637/jss.v014.i15.

[HunterInter12-17] Hunter, S.; Jones, P.; Mitchell, A. et al. (2012). "InterPro in 2011: New developments in the family and domain prediction database". Nucleic Acids Research 40 (D1): D306-12. doi:10.1093/nar/gkr948. PMC PMC3245097. PMID 22096229. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3245097.

[BarrettNCBI13-18] Barrett, T.; Wilhite, S.E.; Ledoux, P. et al. (2013). "NCBI GEO: Archive for functional genomics data sets - Update". Nucleic Acids Research 41 (D1): D991-5. doi:10.1093/nar/gks1193. PMC PMC3531084. PMID 23193258. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531084.

[RusticiArray13-19] Rustici, G.; Kolesnikov, N.; Brandizi, M. et al. (2013). "ArrayExpress update - Trends in database growth and links to data analysis tools". Nucleic Acids Research 41 (D1): D987-90. doi:10.1093/nar/gks1174. PMC PMC3531147. PMID 23193272. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531147.

[AgrawalFast94-20] Agrawal, R.; Srikant, R. (1994). "Fast algorithms for mining association rules in large databases". Proceedings of the 20th International Conference on Very Large Data Bases: 487–499. http://dl.acm.org/citation.cfm?id=672836.

[BorgeltEffic03-21] Borgelt, C. (2003). "Efficient implementations of Apriori and Eclat". Proceedings for the 1st IEEE ICDM Workshop on Frequent Item Set Mining Implementations 2003: 90.

[HahslerThearules11-22] Hahsler, M.; Chelluboina, S.; Hornik, K.; Buchta, C. (2011). "The arules R-Package ecosystem: Analyzing interesting patterns from large transaction data sets". The Journal of Machine Learning Research 12 (2): 2021-2025. http://dl.acm.org/citation.cfm?id=2021064.

[ZakiNew97-23] Zaki, M.J.; Parthasarathy, S.; Ogihara, M.; Wei, L. (1997). "New algorithms for fast discovery of association rules". 3rd International Conference on Knowledge Discovery and Data Mining: 283–286. http://dl.acm.org/citation.cfm?id=2021064.

[GeertsTiling04-24] Geerts, F.; Goethals, B.; Mielikäinen, T. (2004). "Tiling Databases". In Suzuki, E.; Arikawa, S.. Discovery Science. Springer Berlin Heidelberg. pp. 278-289. doi:10.1007/978-3-540-30214-8_22. ISBN 9783540302148.

[HanMining02-25] Han, J.; Wang, J.; Lu, Y.; Tzvetkov (2002). [http://dl.acm.org/citation.cfm?id=844747 "Mining top-k frequent closed patterns without minimum support"]. Proceedings of the 2002 IEEE International Conference on Data Mining: 211–218. http://dl.acm.org/citation.cfm?id=844747.

[WebbOPUS95-26] Webb, G.I. (1995). "OPUS: An efficient admissible algorithm for unordered search". Journal of Artificial Intelligence Research 3 (1): 431–465. http://dl.acm.org/citation.cfm?id=1622635.

[LiuInt98-27] Liu, B.; Hsu, W.; Ma, Y. (1998). "Integrating classification and association rule mining". Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining: 80–86.

[LiCMAR01-28] Li, W.; Han, J.; Pei, J. (2001). "CMAR: Accurate and efficient classification based on multiple class-association rules". Proceedings of the 2001 IEEE International Conference on Data Mining: 369-376. http://dl.acm.org/citation.cfm?id=657866.

[YinCPAR03-29] Yin, X.; Han, J. (2003). "CPAR: Classification based on predictive association rules". Proceedings of the SIAM International Conference on Data Mining: 331-335.

[HallWEKA09-30] Hall, M.; Frank, E.; Holmes, G. et al. (2009). "The WEKA data mining software: An update". ACM SIGKDD Explorations Newsletter 11 (1): 10-18. doi:10.1145/1656274.1656278.

[LiuMining13-31] Liu, Y.C.; Cheng, C.P.; Tseng, V.S. (2013). "Mining differential top-k co-expression patterns from time course comparative gene expression datasets". BMC Bioinformatics 14: 230. doi:10.1186/1471-2105-14-230. PMC PMC3751367. PMID 23870110. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3751367.

[Fournier-VigerSOMF14-32] Fournier-Viger, P.; Gomariz, A.; Gueniche, T. et al. (2014). "SPMF: A Java open-source pattern mining library". The Journal of Machine Learning Research 15 (1): 3389-3393. http://dl.acm.org/citation.cfm?id=2750353.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

Journal:Practical approaches for mining frequent patterns in molecular datasets

Contents

Abstract

Introduction

Materials and methods

Frequent itemset mining

Datasets

Protein domain analysis

COLOMBOS

Tools

Apriori

arules

MIME

Results and discussion

Case study 1: Analyzing domain co-occurrence within proteins

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export