Journal:Practical approaches for mining frequent patterns in molecular datasets

From LIMSWiki
Revision as of 22:56, 23 August 2016 by Shawndouglas (talk | contribs) (Created stub. Saving and adding more.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Full article title Practical approaches for mining frequent patterns in molecular datasets
Journal Bioinformatics and Biology Insights
Author(s) Naulaerts, S.; Moens, S.; Engelen, K.; Vanden Berghe, W.; Goethals, B.; Laukens, K.; Meysman, P.
Author affiliation(s) University of Antwerp, Antwerp University Hospital, Fondazione Edmund Mach
Primary contact Email: pieter dot meysman at uantwerpen dot be
Editors Dandekar, T.
Year published 2016
Volume and issue 10
Page(s) 37–47
DOI 10.4137/BBI.S38419
ISSN 1177-9322
Distribution license Creative Commons Attribution 3.0 Unported
Website http://www.la-press.com/ (HTML)
Download http://www.la-press.com/ (PDF)

Abstract

Pattern detection is an inherent task in the analysis and interpretation of complex and continuously accumulating biological data. Numerous itemset mining algorithms have been developed in the last decade to efficiently detect specific pattern classes in data. Although many of these have proven their value for addressing bioinformatics problems, several factors still slow down promising algorithms from gaining popularity in the life science community. Many of these issues stem from the low user-friendliness of these tools and the complexity of their output, which is often large, static, and consequently hard to interpret. Here, we apply three software implementations on common bioinformatics problems and illustrate some of the advantages and disadvantages of each, as well as inherent pitfalls of biological data mining. Frequent itemset mining exists in many different flavors, and users should decide their software choice based on their research question, programming proficiency, and added value of extra features.

Keywords: frequent itemset mining, protein domain structure, protein–protein interaction, gene expression, Mycobacterium tuberculosis

Introduction

In the last decade, various information-rich resources have become available to study organisms on a systems-wide scale. The rapid accumulation of complex biological data in extensive compendia demands powerful and specialized pattern mining techniques.[1][2][3][4] A popular group of pattern mining techniques are itemset mining and their derivative, association rule mining. These methods are typically known for their ability to detect frequently co-occurring products in lists of customer supermarket baskets, effectively identifying the patterns in customers’ shopping behavior.[5] In this context, the shopping cart is formally known as a transaction, while the individual products are the items. The discovery of sets of correlated items (i.e. itemsets) is the goal of this data mining approach, which can be highly relevant in the context of life sciences. For example, one can investigate which genes are often coexpressed in tissue samples or which mutations often occur together in cancer tumors of a given type.

References

  1. Meysman, P.; Sonego, P.; Bianco, L. et al. (2014). "COLOMBOS v2.0: An ever expanding collection of bacterial expression compendia". Nucleic Acids Research 42 (D1): D649-D653. doi:10.1093/nar/gkt1086. PMC PMC3965013. PMID 24214998. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965013. 
  2. UniProt Consortium (2013). "Update on activities at the Universal Protein Resource (UniProt) in 2013". Nucleic Acids Research 41 (D1): D43-7. doi:10.1093/nar/gks1068. PMC PMC3531094. PMID 23161681. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531094. 
  3. Maietta, P.; Lopez, G.; Carro, A. et al. (2014). "FireDB: A compendium of biological and pharmacologically relevant ligands". Nucleic Acids Research 42 (D1): D267-72. doi:10.1093/nar/gkt1127. PMC PMC3965074. PMID 24243844. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965074. 
  4. Orchard, S.; Ammari, M.; Aranda, B. et al. (2014). "The MIntAct project: IntAct as a common curation platform for 11 molecular interaction databases". Nucleic Acids Research 42 (D1): D358-63. doi:10.1093/nar/gkt1115. PMC PMC3965093. PMID 24234451. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965093. 
  5. Agrawal, R.; Imielinksi, T.; Swami, A. (1993). "Mining association rules between sets of items in large database". Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22 (2): 207–16. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added.