Difference between revisions of "Journal:Practical approaches for mining frequent patterns in molecular datasets"

Full article title	Practical approaches for mining frequent patterns in molecular datasets
Journal	Bioinformatics and Biology Insights
Author(s)	Naulaerts, S.; Moens, S.; Engelen, K.; Vanden Berghe, W.; Goethals, B.; Laukens, K.; Meysman, P.
Author affiliation(s)	University of Antwerp, Antwerp University Hospital, Fondazione Edmund Mach
Primary contact	Email: pieter dot meysman at uantwerpen dot be
Editors	Dandekar, T.
Year published	2016
Volume and issue	10
Page(s)	37–47
DOI	10.4137/BBI.S38419
ISSN	1177-9322
Distribution license	Creative Commons Attribution 3.0 Unported
Website	http://www.la-press.com/ (HTML)
Download	http://www.la-press.com/ (PDF)

Revision as of 23:26, 23 August 2016

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Pattern detection is an inherent task in the analysis and interpretation of complex and continuously accumulating biological data. Numerous itemset mining algorithms have been developed in the last decade to efficiently detect specific pattern classes in data. Although many of these have proven their value for addressing bioinformatics problems, several factors still slow down promising algorithms from gaining popularity in the life science community. Many of these issues stem from the low user-friendliness of these tools and the complexity of their output, which is often large, static, and consequently hard to interpret. Here, we apply three software implementations on common bioinformatics problems and illustrate some of the advantages and disadvantages of each, as well as inherent pitfalls of biological data mining. Frequent itemset mining exists in many different flavors, and users should decide their software choice based on their research question, programming proficiency, and added value of extra features.

Keywords: frequent itemset mining, protein domain structure, protein–protein interaction, gene expression, Mycobacterium tuberculosis

Introduction

In the last decade, various information-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.^[1]^[2]^[3]^[4] A popular group of pattern mining techniques are itemset mining and their derivative, association rule mining. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.^[5] In this context, the shopping cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (i.e. itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coexpressed in tissue samples or which mutations often occur together in cancer tumors of a given type.

Frequent itemset mining has proven especially useful in capturing and summarizing the characteristics of complex datasets to their important and most interesting aspects. Frequent patterns can be converted into rules with a discriminatory value that can, in turn, be used to build transparent classifications. For example, if a gene C is always upregulated when genes A and B are downregulated, the frequent itemset {A|Down, B|Down, C|Up} can be rewritten as the rule {A|Down, B|Down} ≥ {C|Up}, where the left-hand side (antecedent) of the rule leads to the consequent (right-hand side) of the rule. Rules of this type can be used to distinguish between tumor types, gene clusters, and various other biological contrasts. The advantage of this approach is that the rules immediately explain why a particular label was given, which is an advantage over machine learning methods such as neural networks that act as a black box. The strengths of frequent itemset mining have been consequently demonstrated in a broad range of bioinformatics applications, ranging from gene expression data^[6]^[7]^[8], annotation mining^[9]^[10], and combinations thereof^[11]^[12] to interaction networks.^[13] A comprehensive overview of the broad range of implementations and bioinformatics applications of frequent itemset mining techniques was recently published.^[14]

References

↑ Meysman, P.; Sonego, P.; Bianco, L. et al. (2014). "COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia". Nucleic Acids Research 42 (D1): D649-D653. doi:10.1093/nar/gkt1086. PMC PMC3965013. PMID 24214998. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965013.
↑ UniProt Consortium (2013). "Update on activities at the Universal Protein Resource (UniProt) in 2013". Nucleic Acids Research 41 (D1): D43-7. doi:10.1093/nar/gks1068. PMC PMC3531094. PMID 23161681. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531094.
↑ Maietta, P.; Lopez, G.; Carro, A. et al. (2014). "FireDB: A compendium of biological and pharmacologically relevant ligands". Nucleic Acids Research 42 (D1): D267-72. doi:10.1093/nar/gkt1127. PMC PMC3965074. PMID 24243844. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965074.
↑ Orchard, S.; Ammari, M.; Aranda, B. et al. (2014). "The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases". Nucleic Acids Research 42 (D1): D358-63. doi:10.1093/nar/gkt1115. PMC PMC3965093. PMID 24234451. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965093.
↑ Agrawal, R.; Imielinksi, T.; Swami, A. (1993). "Mining association rules between sets of items in large database". Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22 (2): 207–16.
↑ Pan, F.; Cong, G.; Tung, A.K.H. et al. (2003). "Carpenter: Finding closed patterns in long biological datasets". Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 637–642. doi:10.1145/956750.956832.
↑ Cong, G.; Tan, K.-L.; Tung, A.K.H.; Xu, X. (2005). "Mining top-K covering rule groups for gene expression data". Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data: 670-681. doi:10.1145/1066157.1066234.
↑ Gouda, K.; Zaki, M.J. (2005). "GenMax: An efficient algorithm for mining maximal frequent itemsets". Data Mining and Knowledge Discovery 11 (3): 223-242. doi:10.1007/s10618-005-0002-x.
↑ Artamonova, I.I.; Frishman, G.; Gelfand, M.S; Frishman, D. (2005). "Mining sequence annotation databanks for association patterns". Bioinformatics 21 (Suppl 3): iii49-iii57. doi:10.1093/bioinformatics/bti1206. PMID 16306393.
↑ Manda, P.; Ozkan, S.; Wang, H. et al. (2012). "Cross-ontology multi-level association rule mining in the gene ontology". PLoS One 7 (10): e47411. doi:10.1371/journal.pone.0047411. PMC PMC3470562. PMID 23071802. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470562.
↑ Martinez, R.; Pasquier, N.; Pasquier, C. (2009). "Mining Association Rule Bases from Integrated Genomic Data and Annotations". In Masulli, F.; Tagliaferri, R.; Verkhivker, G.M.. Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer Berlin Heidelberg. pp. 78–90. doi:10.1007/978-3-642-02504-4_7. ISBN 9783642025044.
↑ Tseng, V.S.; Yu, H.-H.; Yang, S.-C. (2009). "Efficient mining of multilevel gene association rules from microarray and gene ontology". Information Systems Frontiers 11 (4): 433-447. doi:10.1007/s10796-009-9156-1.
↑ Karpinets, T.V.; Park, B.H.; Uberbacher, E.C. (2012). "Analyzing large biological datasets with association networks". Nucleic Acids Research 40 (17): e131. doi:10.1093/nar/gks403. PMC PMC3458522. PMID 22638576. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458522.
↑ Naulaerts, S.; Meysman, P.; Bittremieux, W. et al. (2015). "Analyzing large biological datasets with association networks". Briefings in Bioinformatics 16 (2): 216–31. doi:10.1093/bib/bbt074. PMC PMC4364064. PMID 24162173. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4364064.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.

[MeysmanCOLO14-1] Meysman, P.; Sonego, P.; Bianco, L. et al. (2014). "COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia". Nucleic Acids Research 42 (D1): D649-D653. doi:10.1093/nar/gkt1086. PMC PMC3965013. PMID 24214998. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965013.

[UniProtUpdate13-2] UniProt Consortium (2013). "Update on activities at the Universal Protein Resource (UniProt) in 2013". Nucleic Acids Research 41 (D1): D43-7. doi:10.1093/nar/gks1068. PMC PMC3531094. PMID 23161681. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531094.

[MaiettaFireDB14-3] Maietta, P.; Lopez, G.; Carro, A. et al. (2014). "FireDB: A compendium of biological and pharmacologically relevant ligands". Nucleic Acids Research 42 (D1): D267-72. doi:10.1093/nar/gkt1127. PMC PMC3965074. PMID 24243844. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965074.

[OrchardTheMint14-4] Orchard, S.; Ammari, M.; Aranda, B. et al. (2014). "The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases". Nucleic Acids Research 42 (D1): D358-63. doi:10.1093/nar/gkt1115. PMC PMC3965093. PMID 24234451. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965093.

[RakeshMining93-5] Agrawal, R.; Imielinksi, T.; Swami, A. (1993). "Mining association rules between sets of items in large database". Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22 (2): 207–16.

[PanCarp03-6] Pan, F.; Cong, G.; Tung, A.K.H. et al. (2003). "Carpenter: Finding closed patterns in long biological datasets". Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 637–642. doi:10.1145/956750.956832.

[CongMining05-7] Cong, G.; Tan, K.-L.; Tung, A.K.H.; Xu, X. (2005). "Mining top-K covering rule groups for gene expression data". Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data: 670-681. doi:10.1145/1066157.1066234.

[GoudaGenMax05-8] Gouda, K.; Zaki, M.J. (2005). "GenMax: An efficient algorithm for mining maximal frequent itemsets". Data Mining and Knowledge Discovery 11 (3): 223-242. doi:10.1007/s10618-005-0002-x.

[ArtamonovaMining05-9] Artamonova, I.I.; Frishman, G.; Gelfand, M.S; Frishman, D. (2005). "Mining sequence annotation databanks for association patterns". Bioinformatics 21 (Suppl 3): iii49-iii57. doi:10.1093/bioinformatics/bti1206. PMID 16306393.

[MandaCross12-10] Manda, P.; Ozkan, S.; Wang, H. et al. (2012). "Cross-ontology multi-level association rule mining in the gene ontology". PLoS One 7 (10): e47411. doi:10.1371/journal.pone.0047411. PMC PMC3470562. PMID 23071802. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470562.

[MartinezMining09-11] Martinez, R.; Pasquier, N.; Pasquier, C. (2009). "Mining Association Rule Bases from Integrated Genomic Data and Annotations". In Masulli, F.; Tagliaferri, R.; Verkhivker, G.M.. Computational Intelligence Methods for Bioinformatics and Biostatistics. Springer Berlin Heidelberg. pp. 78–90. doi:10.1007/978-3-642-02504-4_7. ISBN 9783642025044.

[TsengEfficient09-12] Tseng, V.S.; Yu, H.-H.; Yang, S.-C. (2009). "Efficient mining of multilevel gene association rules from microarray and gene ontology". Information Systems Frontiers 11 (4): 433-447. doi:10.1007/s10796-009-9156-1.

[KarpinetsAnal12-13] Karpinets, T.V.; Park, B.H.; Uberbacher, E.C. (2012). "Analyzing large biological datasets with association networks". Nucleic Acids Research 40 (17): e131. doi:10.1093/nar/gks403. PMC PMC3458522. PMID 22638576. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458522.

[NaulaertsAPrimer15-14] Naulaerts, S.; Meysman, P.; Bittremieux, W. et al. (2015). "Analyzing large biological datasets with association networks". Briefings in Bioinformatics 16 (2): 216–31. doi:10.1093/bib/bbt074. PMC PMC4364064. PMID 24162173. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4364064.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

@@ Line 30: / Line 30: @@
 ==Introduction==
-In the last decade, various [[information]]-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.<ref name="MeysmanCOLO14">{{cite journal |title=COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia |journal=Nucleic Acids Research |author=Meysman, P.; Sonego, P.; Bianco, L. et al. |volume=42 |issue=D1 |pages=D649-D653 |year=2014 |doi=10.1093/nar/gkt1086 |pmid=24214998 |pmc=PMC3965013}}</ref><ref name="UniProtUpdate13">{{cite journal |title=Update on activities at the Universal Protein Resource (UniProt) in 2013 |journal=Nucleic Acids Research |author=UniProt Consortium |volume=41 |issue=D1 |pages=D43-7 |year=2013 |doi=10.1093/nar/gks1068 |pmid=23161681 |pmc=PMC3531094}}</ref><ref name="MaiettaFireDB14">{{cite journal |title=FireDB: A compendium of biological and pharmacologically relevant ligands |journal=Nucleic Acids Research |author=Maietta, P.; Lopez, G.; Carro, A. et al. |volume=42 |issue=D1 |pages=D267-72 |year=2014 |doi=10.1093/nar/gkt1127 |pmid=24243844 |pmc=PMC3965074}}</ref><ref name="OrchardTheMint14">{{cite journal |title=The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases |journal=Nucleic Acids Research |author=Orchard, S.; Ammari, M.; Aranda, B. et al. |volume=42 |issue=D1 |pages=D358-63 |year=2014 |doi=10.1093/nar/gkt1115 |pmid=24234451 |pmc=PMC3965093}}</ref> A popular group of pattern mining techniques are itemset mining and their derivative, association rule mining. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.<ref name="RakeshMining93">{{cite journal |title=Mining association rules between sets of
+In the last decade, various [[information]]-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.<ref name="MeysmanCOLO14">{{cite journal |title=COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia |journal=Nucleic Acids Research |author=Meysman, P.; Sonego, P.; Bianco, L. et al. |volume=42 |issue=D1 |pages=D649-D653 |year=2014 |doi=10.1093/nar/gkt1086 |pmid=24214998 |pmc=PMC3965013}}</ref><ref name="UniProtUpdate13">{{cite journal |title=Update on activities at the Universal Protein Resource (UniProt) in 2013 |journal=Nucleic Acids Research |author=UniProt Consortium |volume=41 |issue=D1 |pages=D43-7 |year=2013 |doi=10.1093/nar/gks1068 |pmid=23161681 |pmc=PMC3531094}}</ref><ref name="MaiettaFireDB14">{{cite journal |title=FireDB: A compendium of biological and pharmacologically relevant ligands |journal=Nucleic Acids Research |author=Maietta, P.; Lopez, G.; Carro, A. et al. |volume=42 |issue=D1 |pages=D267-72 |year=2014 |doi=10.1093/nar/gkt1127 |pmid=24243844 |pmc=PMC3965074}}</ref><ref name="OrchardTheMint14">{{cite journal |title=The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases |journal=Nucleic Acids Research |author=Orchard, S.; Ammari, M.; Aranda, B. et al. |volume=42 |issue=D1 |pages=D358-63 |year=2014 |doi=10.1093/nar/gkt1115 |pmid=24234451 |pmc=PMC3965093}}</ref> A popular group of pattern mining techniques are itemset mining and their derivative, association rule mining. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.<ref name="RakeshMining93">{{cite journal |title=Mining association rules between sets of items in large database |journal=Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data |author=Agrawal, R.; Imielinksi, T.; Swami, A. |volume=22 |issue=2 |pages=207–16 |year=1993}}</ref> In this context, the shopping cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (i.e. itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coexpressed in tissue samples or which mutations often occur together in cancer tumors of a given type.
-items in large database |journal=Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data |author=Agrawal, R.; Imielinksi, T.; Swami, A. |volume=22 |issue=2 |pages=207–16 |year=1993}}</ref> In this context, the shopping cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (i.e. itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coexpressed in tissue samples or which mutations often occur together in cancer tumors of a given type.
+Frequent itemset mining has proven especially useful in capturing and summarizing the characteristics of complex datasets to their important and most interesting aspects. Frequent patterns can be converted into rules with a discriminatory value that can, in turn, be used to build transparent classifications. For example, if a gene C is always upregulated when genes A and B are downregulated, the frequent itemset {A|Down, B|Down, C|Up} can be rewritten as the rule {A|Down, B|Down} ≥ {C|Up}, where the left-hand side (antecedent) of the rule leads to the consequent (right-hand side) of the rule. Rules of this type can be used to distinguish between tumor types, gene clusters, and various other biological contrasts. The advantage of this approach is that the rules immediately explain why a particular label was given, which is an advantage over machine learning methods such as neural networks that act as a black box. The strengths of frequent itemset mining have been consequently demonstrated in a broad range of bioinformatics applications, ranging from gene expression data<ref name="PanCarp03">{{cite journal |title=Carpenter: Finding closed patterns in long biological datasets |journal=Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining |author=Pan, F.; Cong, G.; Tung, A.K.H. et al. |pages=637–642 |year=2003 |doi=10.1145/956750.956832}}</ref><ref name="CongMining05">{{cite journal |title=Mining top-K covering rule groups for gene expression data |journal=Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data |author=Cong, G.; Tan, K.-L.; Tung, A.K.H.; Xu, X. |pages=670-681 |year=2005 |doi=10.1145/1066157.1066234}}</ref><ref name="GoudaGenMax05">{{cite journal |title=GenMax: An efficient algorithm for mining maximal frequent itemsets |journal=Data Mining and Knowledge Discovery |author=Gouda, K.; Zaki, M.J. |volume=11 |issue=3 |pages=223-242 |year=2005 |doi=10.1007/s10618-005-0002-x}}</ref>, annotation mining<ref name="ArtamonovaMining05">{{cite journal |title=Mining sequence annotation databanks for association patterns |journal=Bioinformatics |author=Artamonova, I.I.; Frishman, G.; Gelfand, M.S; Frishman, D. |volume=21 |issue=Suppl 3 |pages=iii49-iii57 |year=2005 |doi=10.1093/bioinformatics/bti1206 |pmid=16306393}}</ref><ref name="MandaCross12">{{cite journal |title= Cross-ontology multi-level association rule mining in the gene ontology |journal=PLoS One |author=Manda, P.; Ozkan, S.; Wang, H. et al. |volume=7 |issue=10 |pages=e47411 |year=2012 |doi=10.1371/journal.pone.0047411 |pmid=23071802 |pmc=PMC3470562}}</ref>, and combinations thereof<ref name="MartinezMining09">{{cite book |chapter=Mining Association Rule Bases from Integrated Genomic Data and Annotations |title=Computational Intelligence Methods for Bioinformatics and Biostatistics |author=Martinez, R.; Pasquier, N.; Pasquier, C. |editor=Masulli, F.; Tagliaferri, R.; Verkhivker, G.M. |publisher=Springer Berlin Heidelberg |pages=78–90 |year=2009 |isbn=9783642025044 |doi=10.1007/978-3-642-02504-4_7}}</ref><ref name="TsengEfficient09">{{cite journal |title=Efficient mining of multilevel gene association rules from microarray and gene ontology |journal=Information Systems Frontiers |author=Tseng, V.S.; Yu, H.-H.; Yang, S.-C. |volume=11 |issue=4 |pages=433-447 |year=2009 |doi=10.1007/s10796-009-9156-1}}</ref> to interaction networks.<ref name="KarpinetsAnal12">{{cite journal |title=Analyzing large biological datasets with association networks |journal=Nucleic Acids Research |author=Karpinets, T.V.; Park, B.H.; Uberbacher, E.C. |volume=40 |issue=17 |pages=e131 |year=2012 |doi=10.1093/nar/gks403 |pmid=22638576 |pmc=PMC3458522}}</ref> A comprehensive overview of the broad range of implementations and bioinformatics applications of frequent itemset mining techniques was recently published.<ref name="NaulaertsAPrimer15">{{cite journal |title=Analyzing large biological datasets with association networks |journal=Briefings in Bioinformatics |author=Naulaerts, S.; Meysman, P.; Bittremieux, W. et al. |volume=16 |issue=2 |pages=216–31 |year=2015 |doi=10.1093/bib/bbt074 |pmid=24162173 |pmc=PMC4364064}}</ref>
 ==References==

Difference between revisions of "Journal:Practical approaches for mining frequent patterns in molecular datasets"

Revision as of 23:26, 23 August 2016

Contents

Abstract

Introduction

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export