Difference between revisions of "Journal:Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)"

Full article title	Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)
Journal	Journal of Biomedical Informatics
Author(s)	Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
Author affiliation(s)	Stanford University
Primary contact	Email: olivier dot gevaert at stanford dot edu
Year published	2017
Volume and issue	72(8)
Page(s)	132–139
DOI	10.1016/j.jbi.2017.06.017
ISSN	1532-0464
Distribution license	Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Website	http://datascience.codata.org/articles/10.5334/dsj-2017-001/
Download	http://datascience.codata.org/articles/10.5334/dsj-2017-001/galley/620/download/ (PDF)

Revision as of 22:08, 22 August 2017

Abstract

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3 million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table.

All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

Graphical abstract

Keywords

Data mining; Prediction; Metadata; GEO; CEDAR

1. Introduction

Biomedical data is increasingly being viewed as a valuable commodity that can be mined for new insights beyond that for which it was created. Large community-focused databases such as the Gene Expression Omnibus (GEO)^[1] or the database of Genotypes and Phenotypes (dbGAP)^[2] offer a wealth of omics’ data that have been used in developing diagnostic, prognostic, and therapeutic models.^[3]^[4] One crucial and limiting factor in the reuse of data lies in having access to accurate descriptions about the data – known as metadata. Community standards to describe an experiment (e.g. Minimum Information About a Microarray Experiment; MIAME^[5]) are being widely promoted to highlight essential metadata, but creating good metadata can be challenging.^[6]^[7]

Indeed, metadata is often of low quality, and many entries are absent, erroneous or inconsistent. The largest database of gene expression studies, the GEO microarray database, contains 50,000 studies, over 1.3 million samples, and is still growing [1]. Yet the description of these samples suffers from a lack of consistency and completeness. For example, a preliminary analysis revealed that are 32 different ways to specify the age in GEO (e.g. age, Age, Age years, age year). Yet, these metadata are essential for researchers to find and reuse datasets of interest. When metadata are incomplete or inaccurate, researchers will miss relevant hits while being forced to sift through irrelevant results - resulting in lower productivity and potentially weaker scientific analyses. These issues are often attributed to lack of appropriate supporting infrastructure.^[8]

Metadata authoring applications such as ISA-Tools^[9] or RightField^[10] can be used to codify guidelines that specify multiple metadata elements and require users to use a set of controlled terms, such as terms from specified ontologies contained in the NCBO BioPortal.^[11] Yet even with such tools, authoring good metadata is tedious and error-prone, and could benefit from more automation. The development of more effective platforms for metadata authoring and discovery is one of the goals of the Center for Expanded Data Annotation and Retrieval (CEDAR).^[7]^[8]

In this study, we examine the utility of supervised machine learning to predict metadata from existing metadata. This will help metadata submitter during the submission process. Predicting metadata could be a guideline for template authors during the process of metadata definition. This facility will not only significantly facilitate the template definition task but also will make the resulting templates more comprehensive and reflective of the actual data. In CEDAR we also take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values.^[7]

Learning value sets from data will help ensure that template authors do not miss important value sets that appear frequently in the data. Thus, data submitters will be able to find the terms they need, hence improving the quality of the metadata.

We use the increasing amounts of structured metadata to learn from as the project progresses and learn value sets conditional on the experimental level metadata. This incorporation of structural knowledge into the learning technology will allow us to infer common metadata patterns and their value sets in the context of technology platform, organism, molecule, label or sample type. Our key goal is to facilitate as much of the metadata collection process as possible, by suggesting possible value sets for the fields based on available data. This process will limit the value options, will reduce the burden of entering metadata terms and will significantly shorten the time that is needed for investigators to enter metadata.

We found that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

2. Background

Supervised learning uses classification algorithms to learn from data and make predictions. The goal of supervised learning is to build a model of the distribution of class labels from instances.^[12] The classifier can then assign class labels to instances in which the values of the predictor features are known, but the value of the class label is unknown. Numerous supervised classification techniques have been developed including decision trees, artificial neural networks, and statistical techniques such as bayesian networks.^[12] Machine learning has been widely applied across domains including the biomedical domain^[13], such as protein function prediction^[14], clinical outcome prediction^[15] and survival analysis.^[16]

As we mentioned earlier, this study specifically is about metadata and association between them. Therefore, using machine learning will be helpful to mine the data, learn from the data, and find this association. In our study, we wanted to find the correlation between metadata elements and their values. Association rules are the main technique for data mining to find these correlations. Sharma et al., compared association rule mining algorithms (e.g. AIS and FP-Growth, and Apriori).^[17] Each algorithm has advantages and disadvantages according to their comparison. For example, AIS requires multiple scanning of the database, only rules that have one item in right side can be generated, and too many candidate itemsets are generated. FP-Growth also has some disadvantages such as the resulting FP-Tree is not unique for the same logical database and it cannot be used in interactive mining system. Apriori is scanning the complete database multiple times but still, it is easy to implement. Predictive Apriori algorithm overcomes this disadvantage of the Apriori algorithm with scanning the beast n rules instead of scanning all rules. PART algorithm uses partial decision trees to generate the decision list that is shown in the output, but only this final list is what is used to make classifications and with that, we have better performance.

In previously published manuscript^[18], we proposed a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results based on GEO database showed that structured metadata can be predicted with TF-IDF more accurate than LDA. And both TF-IDF and LDA are outperforming the majority vote baseline as well. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. Considering that metadata is structured and unstructured in GEO and other resources, we decided to find the correlation between structured metadata. In this study, we found the correlation between selected structured metadata elements versus in previous work we predicted structure metadata from the free text. Structure metadata has a potential to be predicted and suggested to metadata template author or metadata submitter during the submission process based on each other.

Several studies have been done regarding GEO metadata prediction. For instance Buckberry et al.^[19] presented a method for predicting the sex of samples in gene expression microarray datasets. They believe that the metadata associated with many publicly available expression microarray datasets often lacks sample sex information, therefore limiting the reuse of these data in new analyses or larger meta-analyses where the effect of sex is to be considered. The package called massiR provides a method for researchers to predict the sex of samples in microarray datasets. “This package implements unsupervised clustering methods to classify samples into male and female groups, providing an efficient way to identify or confirm the sex of samples in mammalian microarray datasets”.^[19] As it is clear this study is just about particular field in GEO data and it is specialized to predict the sex of the samples.

In this study, we propose methods to predict structured metadata. This method is applicable to any structured metadata in biomedical field. We use association rule mining (ARM) algorithms due to their interpretability and good performance.^[20] ARM is a method for discovering relations between variables in large databases.^[21] ARM was defined by Agrawal in the early 90s in relation to a so called market basket analysis using APRIORI.^[20] Since then, multiple studies have used this technique successfully to model data.^[22] For example, ARM has been used to predict infection detection^[23], to detect common risk factors in pediatric diseases^[24], to understand the interaction between proteins^[25], to discover frequent patterns in gene data^[22], and to understand what drugs are co-prescribed with antacids [26]. To the best of our knowledge, ARM has not yet been applied for predicting experimental metadata.^[26]

References

↑ Barrett, T.; Wilhite, S.E.; Ledoux, P. et al. (2013). "NCBI GEO: Archive for functional genomics data sets - Update". Nucleic Acids Research 41 (DB1): D991-5. doi:10.1093/nar/gks1193. PMC PMC3531084. PMID 23193258. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531084.
↑ Tryka, K.A.; Hao, L.; Sturcke, A. et al. (2014). "NCBI's Database of Genotypes and Phenotypes: dbGaP". Nucleic Acids Research 42 (DB1): D975-9. doi:10.1093/nar/gkt1211. PMC PMC3965052. PMID 24297256. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965052.
↑ Fan, C.; Prat, A.; Parker, J.S. et al. (2011). "Building prognostic models for breast cancer patients using clinical variables and hundreds of gene expression signatures". BMC Medical Genomics 4: 3. doi:10.1186/1755-8794-4-3. PMC PMC3025826. PMID 21214954. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025826.
↑ Sutherland, A.; Thomas, M.; Brandon, R.A. et al. (2011). "Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis". Critical Care 15 (3): R149. doi:10.1186/cc10274. PMC PMC3219023. PMID 21682927. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219023.
↑ Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365–71. doi:10.1038/ng1201-365. PMID 11726920.
↑ Field, D.; Sansone, S.; Delong, E.F. et al. (2010). "Meeting Report: BioSharing at ISMB 2010". Standards in Genomic Sciences 3 (3): 254-8. doi:10.4056/sigs/1403501. PMC PMC3035313. PMID 21304729. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035313.
↑ ^7.0 ^7.1 ^7.2 Musen, M.A.; Bean, C.A.; Cheung, K.H. et al. (2015). "The center for expanded data annotation and retrieval". JAMIA 22 (6): 1148–52. doi:10.1093/jamia/ocv048. PMC PMC5009916. PMID 26112029. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009916.
↑ ^8.0 ^8.1 Panahiazar, M.; Dumontier, M.; Gevaert, O. (2015). "Context aware recommendation engine for metadata submission". First International Workshop on Capturing Scientific Knowledge: 3–7. https://www.isi.edu/ikcap/sciknow2015/papers/Panahiazar.pdf.
↑ Rocca-Serra, P.; Brandizi, M.; Maguire, E. et al. (2010). "ISA software suite: Supporting standards-compliant experimental annotation and enabling curation at the community level". Bioinformatics 26 (18): 2354-6. doi:10.1093/bioinformatics/btq415. PMC PMC2935443. PMID 20679334. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935443.
↑ Wolstencroft, K.; Owen, S.; Horridge, M. et al. (2011). "RightField: Embedding ontology annotation in spreadsheets". Bioinformatics 27 (14): 2021–2022. doi:10.1093/bioinformatics/btr312.
↑ Noy, N.F.; Shah, N.H.; Whetzel, P.L. et al. (2009). "BioPortal: Ontologies and integrated data resources at the click of a mouse". Nucleic Acids Research 37 (WS1): W170-3. doi:10.1093/nar/gkp440. PMC PMC2703982. PMID 19483092. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2703982.
↑ ^12.0 ^12.1 Kotsiantis, S.B. (2007). "Supervised machine learning: A review of classification techniques". Informatica 31 (3): 249–268. http://www.informatica.si/index.php/informatica/article/view/148.
↑ Bellazzi, R.; Zupan, B. (2008). "Predictive data mining in clinical medicine: Current issues and guidelines". International Journal of Medical Informatics 77 (2): 81–97. doi:10.1016/j.ijmedinf.2006.11.006. PMID 17188928.
↑ Xiong, W.; Liu, H.; Guan, J.; Zhou, S. (2013). "Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks". BMC Bioinformatics 14 (Suppl 12): S4. doi:10.1186/1471-2105-14-S12-S4. PMC PMC3848795. PMID 24267980. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848795.
↑ Daemen, A.; Gevaert, O.; De Moor, B. (2007). "Integration of clinical and microarray data with kernel methods". Conference Proceedings of the IEEE Engineering in Medicine and Biology Society 2007: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.
↑ Panahiazar, M.; Taslimitehrani, V.; Pereira, N.; Pathak, J. (2007). "Using EHRs and Machine Learning for Heart Failure Survival Analysis". Studies in Health Technology and Informatics 2015: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.
↑ Kumbhare, T.A.; Chobe, S.V. (2014). "An overview of association rule mining algorithms". International Journal of Computer Science and Information Technologies 5 (1): 927–930. http://ijcsit.com/docs/Volume%205/vol5issue01/ijcsit20140501201.pdf.
↑ Posch, L.; Panahiazar, M.; Dumontier, M.; Gevaert, O. (2016). "Predicting structured metadata from unstructured metadata". Database 2016: baw080. doi:10.1093/database/baw080. PMC PMC4892825. PMID 128637268. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4892825.
↑ ^19.0 ^19.1 Buckberry, S.; Bent, S.J.; Bianco-Miotto, T.; Roberts, C.T. (2014). "massiR: a method for predicting the sex of samples in gene expression microarray datasets". Bioinformatics 30 (14): 2084-5. doi:10.1093/bioinformatics/btu161. PMC PMC4080740. PMID 24659105. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080740.
↑ ^20.0 ^20.1 Agrawal, R.; Srikant, R. (1994). "Fast Algorithms for Mining Association Rules in Large Databases". Proceedings of the 20th International Conference on Very Large Data Bases: 487–99. ISBN 1558601538.
↑ Piatestsky-Shapiro, G. (1991). "Chapter 13: Discovery, analysis, and presentation of strong rules". In Piatestsky-Shapiro, G.; Frawley, W.. Knowledge Discovery in Databases. MIT Press. pp. 229–248. ISBN 9780262660709.
↑ ^22.0 ^22.1 Ordonez, C. (2006). "Comparing association rules and decision trees for disease prediction". Proceedings of the International Workshop on Healthcare Information and Knowledge Management: 17–24. doi:10.1145/1183568.1183573. ISBN 1595935282.
↑ Brossette, S.E.; Sprague, A.P.; Hardin, J.M. et al. (1998). "Association rules and data mining in hospital infection control and public health surveillance". JAMIA 5 (4): 373-81. PMC PMC61314. PMID 9670134. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC61314.
↑ Downs, S.M.; Wallace, M.Y. (2000). "Mining association rules from a pediatric primary care decision support system". Proceedings AMIA Symposium: 200–4. PMC PMC2243862. PMID 11079873. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2243862.
↑ Oyama, T.; Kitano, K.; Satou, K.; Ito, T. (2002). "Extraction of knowledge on protein-protein interaction by association rule discovery". Bioinformatics 18 (5): 705-14. PMID 12050067.
↑ Chen, T.J.; Chou, L.F.; Hwang, S.J. (2003). "Application of a data-mining technique to analyze coprescription patterns for antacids in Taiwan". Clinical Therapeutics 25 (9): 2453-63. PMID 14604744.

Notes

Per the original license, this presentation is faithful to the original, with only a few minor changes to presentation/format.

[BarrettNCBI13-1] Barrett, T.; Wilhite, S.E.; Ledoux, P. et al. (2013). "NCBI GEO: Archive for functional genomics data sets - Update". Nucleic Acids Research 41 (DB1): D991-5. doi:10.1093/nar/gks1193. PMC PMC3531084. PMID 23193258. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531084.

[TrykaNCBIs14-2] Tryka, K.A.; Hao, L.; Sturcke, A. et al. (2014). "NCBI's Database of Genotypes and Phenotypes: dbGaP". Nucleic Acids Research 42 (DB1): D975-9. doi:10.1093/nar/gkt1211. PMC PMC3965052. PMID 24297256. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965052.

[FanBuilding11-3] Fan, C.; Prat, A.; Parker, J.S. et al. (2011). "Building prognostic models for breast cancer patients using clinical variables and hundreds of gene expression signatures". BMC Medical Genomics 4: 3. doi:10.1186/1755-8794-4-3. PMC PMC3025826. PMID 21214954. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025826.

[SutherlandDevelop11-4] Sutherland, A.; Thomas, M.; Brandon, R.A. et al. (2011). "Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis". Critical Care 15 (3): R149. doi:10.1186/cc10274. PMC PMC3219023. PMID 21682927. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219023.

[BrazmaMinimum01-5] Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365–71. doi:10.1038/ng1201-365. PMID 11726920.

[FieldMeeting10-6] Field, D.; Sansone, S.; Delong, E.F. et al. (2010). "Meeting Report: BioSharing at ISMB 2010". Standards in Genomic Sciences 3 (3): 254-8. doi:10.4056/sigs/1403501. PMC PMC3035313. PMID 21304729. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035313.

[MusenTheCenter15-7] 7.0 ^7.1 ^7.2 Musen, M.A.; Bean, C.A.; Cheung, K.H. et al. (2015). "The center for expanded data annotation and retrieval". JAMIA 22 (6): 1148–52. doi:10.1093/jamia/ocv048. PMC PMC5009916. PMID 26112029. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009916.

[PanahiazarContext15-8] 8.0 ^8.1 Panahiazar, M.; Dumontier, M.; Gevaert, O. (2015). "Context aware recommendation engine for metadata submission". First International Workshop on Capturing Scientific Knowledge: 3–7. https://www.isi.edu/ikcap/sciknow2015/papers/Panahiazar.pdf.

[Rocca-SerraISA10-9] Rocca-Serra, P.; Brandizi, M.; Maguire, E. et al. (2010). "ISA software suite: Supporting standards-compliant experimental annotation and enabling curation at the community level". Bioinformatics 26 (18): 2354-6. doi:10.1093/bioinformatics/btq415. PMC PMC2935443. PMID 20679334. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935443.

[WolstencroftRightField11-10] Wolstencroft, K.; Owen, S.; Horridge, M. et al. (2011). "RightField: Embedding ontology annotation in spreadsheets". Bioinformatics 27 (14): 2021–2022. doi:10.1093/bioinformatics/btr312.

[NoyBioPortal09-11] Noy, N.F.; Shah, N.H.; Whetzel, P.L. et al. (2009). "BioPortal: Ontologies and integrated data resources at the click of a mouse". Nucleic Acids Research 37 (WS1): W170-3. doi:10.1093/nar/gkp440. PMC PMC2703982. PMID 19483092. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2703982.

[KotsiantisSupervised07-12] 12.0 ^12.1 Kotsiantis, S.B. (2007). "Supervised machine learning: A review of classification techniques". Informatica 31 (3): 249–268. http://www.informatica.si/index.php/informatica/article/view/148.

[BellazziPredict08-13] Bellazzi, R.; Zupan, B. (2008). "Predictive data mining in clinical medicine: Current issues and guidelines". International Journal of Medical Informatics 77 (2): 81–97. doi:10.1016/j.ijmedinf.2006.11.006. PMID 17188928.

[XiongProtein13-14] Xiong, W.; Liu, H.; Guan, J.; Zhou, S. (2013). "Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks". BMC Bioinformatics 14 (Suppl 12): S4. doi:10.1186/1471-2105-14-S12-S4. PMC PMC3848795. PMID 24267980. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848795.

[DaemenInteg07-15] Daemen, A.; Gevaert, O.; De Moor, B. (2007). "Integration of clinical and microarray data with kernel methods". Conference Proceedings of the IEEE Engineering in Medicine and Biology Society 2007: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.

[PanahiazarUsing15-16] Panahiazar, M.; Taslimitehrani, V.; Pereira, N.; Pathak, J. (2007). "Using EHRs and Machine Learning for Heart Failure Survival Analysis". Studies in Health Technology and Informatics 2015: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.

[KumbhareAnOver14-17] Kumbhare, T.A.; Chobe, S.V. (2014). "An overview of association rule mining algorithms". International Journal of Computer Science and Information Technologies 5 (1): 927–930. http://ijcsit.com/docs/Volume%205/vol5issue01/ijcsit20140501201.pdf.

[PoschPredict16-18] Posch, L.; Panahiazar, M.; Dumontier, M.; Gevaert, O. (2016). "Predicting structured metadata from unstructured metadata". Database 2016: baw080. doi:10.1093/database/baw080. PMC PMC4892825. PMID 128637268. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4892825.

[BuckberryMassiR14-19] 19.0 ^19.1 Buckberry, S.; Bent, S.J.; Bianco-Miotto, T.; Roberts, C.T. (2014). "massiR: a method for predicting the sex of samples in gene expression microarray datasets". Bioinformatics 30 (14): 2084-5. doi:10.1093/bioinformatics/btu161. PMC PMC4080740. PMID 24659105. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080740.

[AgrawalFast94-20] 20.0 ^20.1 Agrawal, R.; Srikant, R. (1994). "Fast Algorithms for Mining Association Rules in Large Databases". Proceedings of the 20th International Conference on Very Large Data Bases: 487–99. ISBN 1558601538.

[Piatestsky-ShapiroKnowledge91-21] Piatestsky-Shapiro, G. (1991). "Chapter 13: Discovery, analysis, and presentation of strong rules". In Piatestsky-Shapiro, G.; Frawley, W.. Knowledge Discovery in Databases. MIT Press. pp. 229–248. ISBN 9780262660709.

[OrdonezComparing06-22] 22.0 ^22.1 Ordonez, C. (2006). "Comparing association rules and decision trees for disease prediction". Proceedings of the International Workshop on Healthcare Information and Knowledge Management: 17–24. doi:10.1145/1183568.1183573. ISBN 1595935282.

[BrossetteAssociation98-23] Brossette, S.E.; Sprague, A.P.; Hardin, J.M. et al. (1998). "Association rules and data mining in hospital infection control and public health surveillance". JAMIA 5 (4): 373-81. PMC PMC61314. PMID 9670134. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC61314.

[DownsMining00-24] Downs, S.M.; Wallace, M.Y. (2000). "Mining association rules from a pediatric primary care decision support system". Proceedings AMIA Symposium: 200–4. PMC PMC2243862. PMID 11079873. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2243862.

[OyamaExtraction02-25] Oyama, T.; Kitano, K.; Satou, K.; Ito, T. (2002). "Extraction of knowledge on protein-protein interaction by association rule discovery". Bioinformatics 18 (5): 705-14. PMID 12050067.

[ChenApplication03-26] Chen, T.J.; Chou, L.F.; Hwang, S.J. (2003). "Application of a data-mining technique to analyze coprescription patterns for antacids in Taiwan". Clinical Therapeutics 25 (9): 2453-63. PMID 14604744.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

@@ Line 50: / Line 50: @@
 ==2. Background==
-Supervised learning uses classification algorithms to learn from data and make predictions. The goal of supervised learning is to build a model of the distribution of class labels from instances.<ref name="KotsiantisSupervised07">{{cite journal |title=Supervised machine learning: A review of classification techniques |journal=Informatica |author=Kotsiantis, S.B. |volume=31 |issue=3 |pages=249–268 |year=2007 |url=http://www.informatica.si/index.php/informatica/article/view/148}}</ref> The classifier can then assign class labels to instances in which the values of the predictor features are known, but the value of the class label is unknown. Numerous supervised classification techniques have been developed including decision trees, artificial neural networks, and statistical techniques such as bayesian networks.<ref name="KotsiantisSupervised07" /> Machine learning has been widely applied across domains including the biomedical domain<ref name="BellazziPredict08">{{cite journal |title=Predictive data mining in clinical medicine: Current issues and guidelines |journal=International Journal of Medical Informatics |author=Bellazzi, R.; Zupan, B. |volume=77 |issue=2 |pages=81–97 |year=2008 |doi=10.1016/j.ijmedinf.2006.11.006 |pmid=17188928}}</ref>, such as protein function prediction<ref name="XiongProtein13">{{cite journal |title=Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks |journal=BMC Bioinformatics |author=Xiong, W.; Liu, H.; Guan, J.; Zhou, S. |volume=14 |issue=Suppl 12 |pages=S4 |year=2013 |doi=10.1186/1471-2105-14-S12-S4 |pmid=24267980 |pmc=PMC3848795}}</ref>, clinical outcome prediction<ref name="DaemenInteg07">{{cite journal |title=Integration of clinical and microarray data with kernel methods |journal=Conference Proceedings of the IEEE Engineering in Medicine and Biology Society |author=Daemen, A.; Gevaert, O.; De Moor, B. |volume=2007 |pages=5411–5 |year=2007 |doi=10.1109/IEMBS.2007.4353566 |pmid=18003232}}</ref> and survival analysis [16].<ref name="PanahiazarUsing15">{{cite journal |title=Using EHRs and Machine Learning for Heart Failure Survival Analysis |journal=Studies in Health Technology and Informatics |author=Panahiazar, M.; Taslimitehrani, V.; Pereira, N.; Pathak, J. |volume=2015 |pages=5411–5 |year=2007 |doi=10.1109/IEMBS.2007.4353566 |pmid=18003232}}</ref>
+Supervised learning uses classification algorithms to learn from data and make predictions. The goal of supervised learning is to build a model of the distribution of class labels from instances.<ref name="KotsiantisSupervised07">{{cite journal |title=Supervised machine learning: A review of classification techniques |journal=Informatica |author=Kotsiantis, S.B. |volume=31 |issue=3 |pages=249–268 |year=2007 |url=http://www.informatica.si/index.php/informatica/article/view/148}}</ref> The classifier can then assign class labels to instances in which the values of the predictor features are known, but the value of the class label is unknown. Numerous supervised classification techniques have been developed including decision trees, artificial neural networks, and statistical techniques such as bayesian networks.<ref name="KotsiantisSupervised07" /> Machine learning has been widely applied across domains including the biomedical domain<ref name="BellazziPredict08">{{cite journal |title=Predictive data mining in clinical medicine: Current issues and guidelines |journal=International Journal of Medical Informatics |author=Bellazzi, R.; Zupan, B. |volume=77 |issue=2 |pages=81–97 |year=2008 |doi=10.1016/j.ijmedinf.2006.11.006 |pmid=17188928}}</ref>, such as protein function prediction<ref name="XiongProtein13">{{cite journal |title=Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks |journal=BMC Bioinformatics |author=Xiong, W.; Liu, H.; Guan, J.; Zhou, S. |volume=14 |issue=Suppl 12 |pages=S4 |year=2013 |doi=10.1186/1471-2105-14-S12-S4 |pmid=24267980 |pmc=PMC3848795}}</ref>, clinical outcome prediction<ref name="DaemenInteg07">{{cite journal |title=Integration of clinical and microarray data with kernel methods |journal=Conference Proceedings of the IEEE Engineering in Medicine and Biology Society |author=Daemen, A.; Gevaert, O.; De Moor, B. |volume=2007 |pages=5411–5 |year=2007 |doi=10.1109/IEMBS.2007.4353566 |pmid=18003232}}</ref> and survival analysis.<ref name="PanahiazarUsing15">{{cite journal |title=Using EHRs and Machine Learning for Heart Failure Survival Analysis |journal=Studies in Health Technology and Informatics |author=Panahiazar, M.; Taslimitehrani, V.; Pereira, N.; Pathak, J. |volume=2015 |pages=5411–5 |year=2007 |doi=10.1109/IEMBS.2007.4353566 |pmid=18003232}}</ref>
+As we mentioned earlier, this study specifically is about metadata and association between them. Therefore, using machine learning will be helpful to mine the data, learn from the data, and find this association. In our study, we wanted to find the correlation between metadata elements and their values. Association rules are the main technique for data mining to find these correlations. Sharma et al., compared association rule mining algorithms (e.g. AIS and FP-Growth, and Apriori).<ref name="KumbhareAnOver14">{{cite journal |title=An overview of association rule mining algorithms |journal=International Journal of Computer Science and Information Technologies |author=Kumbhare, T.A.; Chobe, S.V. |volume=5 |issue=1 |pages=927–930 |year=2014 |url=http://ijcsit.com/docs/Volume%205/vol5issue01/ijcsit20140501201.pdf}}</ref> Each algorithm has advantages and disadvantages according to their comparison. For example, AIS requires multiple scanning of the database, only rules that have one item in right side can be generated, and too many candidate itemsets are generated. FP-Growth also has some disadvantages such as the resulting FP-Tree is not unique for the same logical database and it cannot be used in interactive mining system. Apriori is scanning the complete database multiple times but still, it is easy to implement. Predictive Apriori algorithm overcomes this disadvantage of the Apriori algorithm with scanning the beast n rules instead of scanning all rules. PART algorithm uses partial decision trees to generate the decision list that is shown in the output, but only this final list is what is used to make classifications and with that, we have better performance.
+In previously published manuscript<ref name="PoschPredict16">{{cite journal |title=Predicting structured metadata from unstructured metadata |journal=Database |author=Posch, L.; Panahiazar, M.; Dumontier, M.; Gevaert, O. |volume=2016 |pages=baw080 |year=2016 |doi=10.1093/database/baw080 |pmid=128637268 |pmc=PMC4892825}}</ref>, we proposed a framework to predict structured metadata terms from unstructured metadata for improving quality and quantity of metadata, using the Gene Expression Omnibus (GEO) microarray database. Our framework consists of classifiers trained using term frequency-inverse document frequency (TF-IDF) features and a second approach based on topics modeled using a Latent Dirichlet Allocation model (LDA) to reduce the dimensionality of the unstructured data. Our results based on GEO database showed that structured metadata can be predicted with TF-IDF more accurate than LDA. And both TF-IDF and LDA are outperforming the majority vote baseline as well. Overall this is a promising approach for metadata prediction that is likely to be applicable to other datasets and has implications for researchers interested in biomedical metadata curation and metadata prediction. Considering that metadata is structured and unstructured in GEO and other resources, we decided to find the correlation between structured metadata. In this study, we found the correlation between selected structured metadata elements versus in previous work we predicted structure metadata from the free text. Structure metadata has a potential to be predicted and suggested to metadata template author or metadata submitter during the submission process based on each other.
+Several studies have been done regarding GEO metadata prediction. For instance Buckberry et al.<ref name="BuckberryMassiR14">{{cite journal |title=massiR: a method for predicting the sex of samples in gene expression microarray datasets |journal=Bioinformatics |author=Buckberry, S.; Bent, S.J.; Bianco-Miotto, T.; Roberts, C.T. |volume=30 |issue=14 |pages=2084-5 |year=2014 |doi=10.1093/bioinformatics/btu161 |pmid=24659105 |pmc=PMC4080740}}</ref> presented a method for predicting the sex of samples in gene expression microarray datasets. They believe that the metadata associated with many publicly available expression microarray datasets often lacks sample sex information, therefore limiting the reuse of these data in new analyses or larger meta-analyses where the effect of sex is to be considered. The package called massiR provides a method for researchers to predict the sex of samples in microarray datasets. “This package implements unsupervised clustering methods to classify samples into male and female groups, providing an efficient way to identify or confirm the sex of samples in mammalian microarray datasets”.<ref name="BuckberryMassiR14" /> As it is clear this study is just about particular field in GEO data and it is specialized to predict the sex of the samples.
+In this study, we propose methods to predict structured metadata. This method is applicable to any structured metadata in biomedical field. We use association rule mining (ARM) algorithms due to their interpretability and good performance.<ref name="AgrawalFast94">{{cite journal |title=Fast Algorithms for Mining Association Rules in Large Databases |journal=Proceedings of the 20th International Conference on Very Large Data Bases |author=Agrawal, R.; Srikant, R. |pages=487–99 |year=1994 |isbn=1558601538}}</ref> ARM is a method for discovering relations between variables in large databases.<ref name="Piatestsky-ShapiroKnowledge91">{{cite book |chapter=Chapter 13: Discovery, analysis, and presentation of strong rules |title=Knowledge Discovery in Databases |author=Piatestsky-Shapiro, G. |editor=Piatestsky-Shapiro, G.; Frawley, W. |pages=229–248 |publisher=MIT Press |year=1991 |isbn=9780262660709}}</ref> ARM was defined by Agrawal in the early 90s in relation to a so called market basket analysis using APRIORI.<ref name="AgrawalFast94" /> Since then, multiple studies have used this technique successfully to model data.<ref name="OrdonezComparing06">{{cite journal |title=Comparing association rules and decision trees for disease prediction |journal=Proceedings of the International Workshop on Healthcare Information and Knowledge Management |author=Ordonez, C. |pages=17–24 |year=2006 |isbn=1595935282 |doi=10.1145/1183568.1183573}}</ref> For example, ARM has been used to predict infection detection<ref name="BrossetteAssociation98">{{cite journal |title=Association rules and data mining in hospital infection control and public health surveillance |journal=JAMIA |author=Brossette, S.E.; Sprague, A.P.; Hardin, J.M. et al. |volume=5 |issue=4 |pages=373-81 |year=1998 |pmid=9670134 |pmc=PMC61314}}</ref>, to detect common risk factors in pediatric diseases<ref name="DownsMining00">{{cite journal |title=Mining association rules from a pediatric primary care decision support system |journal=Proceedings AMIA Symposium |author=Downs, S.M.; Wallace, M.Y. |pages=200–4 |year=2000 |pmid=11079873 |pmc=PMC2243862}}</ref>, to understand the interaction between proteins<ref name="OyamaExtraction02">{{cite journal |title=Extraction of knowledge on protein-protein interaction by association rule discovery |journal=Bioinformatics |author=Oyama, T.; Kitano, K.; Satou, K.; Ito, T. |volume=18 |issue=5 |pages=705-14 |year=2002 |pmid=12050067}}</ref>, to discover frequent patterns in gene data<ref name="OrdonezComparing06" />, and to understand what drugs are co-prescribed with antacids [26]. To the best of our knowledge, ARM has not yet been applied for predicting experimental metadata.<ref name="ChenApplication03">{{cite journal |title=Application of a data-mining technique to analyze coprescription patterns for antacids in Taiwan |journal=Clinical Therapeutics |author=Chen, T.J.; Chou, L.F.; Hwang, S.J. |volume=25 |issue=9 |pages=2453-63 |year=2003 |pmid=14604744}}</ref>
 ==References==

Difference between revisions of "Journal:Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)"

Revision as of 22:08, 22 August 2017

Contents

Abstract

Graphical abstract

Keywords

1. Introduction

2. Background

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export