Difference between revisions of "Journal:Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)"

Full article title	Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)
Journal	Journal of Biomedical Informatics
Author(s)	Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
Author affiliation(s)	Stanford University
Primary contact	Email: olivier dot gevaert at stanford dot edu
Year published	2017
Volume and issue	72(8)
Page(s)	132–139
DOI	10.1016/j.jbi.2017.06.017
ISSN	1532-0464
Distribution license	Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Website	http://datascience.codata.org/articles/10.5334/dsj-2017-001/
Download	http://datascience.codata.org/articles/10.5334/dsj-2017-001/galley/620/download/ (PDF)

Revision as of 21:12, 22 August 2017

Abstract

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3 million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table.

All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

Graphical abstract

Keywords

Data mining; Prediction; Metadata; GEO; CEDAR

1. Introduction

Biomedical data is increasingly being viewed as a valuable commodity that can be mined for new insights beyond that for which it was created. Large community-focused databases such as the Gene Expression Omnibus (GEO)^[1] or the database of Genotypes and Phenotypes (dbGAP)^[2] offer a wealth of omics’ data that have been used in developing diagnostic, prognostic, and therapeutic models.^[3]^[4] One crucial and limiting factor in the reuse of data lies in having access to accurate descriptions about the data – known as metadata. Community standards to describe an experiment (e.g. Minimum Information About a Microarray Experiment; MIAME^[5]) are being widely promoted to highlight essential metadata, but creating good metadata can be challenging.^[6]^[7]

Indeed, metadata is often of low quality, and many entries are absent, erroneous or inconsistent. The largest database of gene expression studies, the GEO microarray database, contains 50,000 studies, over 1.3 million samples, and is still growing [1]. Yet the description of these samples suffers from a lack of consistency and completeness. For example, a preliminary analysis revealed that are 32 different ways to specify the age in GEO (e.g. age, Age, Age years, age year). Yet, these metadata are essential for researchers to find and reuse datasets of interest. When metadata are incomplete or inaccurate, researchers will miss relevant hits while being forced to sift through irrelevant results - resulting in lower productivity and potentially weaker scientific analyses. These issues are often attributed to lack of appropriate supporting infrastructure.^[8]

Metadata authoring applications such as ISA-Tools^[9] or RightField^[10] can be used to codify guidelines that specify multiple metadata elements and require users to use a set of controlled terms, such as terms from specified ontologies contained in the NCBO BioPortal.^[11] Yet even with such tools, authoring good metadata is tedious and error-prone, and could benefit from more automation. The development of more effective platforms for metadata authoring and discovery is one of the goals of the Center for Expanded Data Annotation and Retrieval (CEDAR).^[7]^[8]

In this study, we examine the utility of supervised machine learning to predict metadata from existing metadata. This will help metadata submitter during the submission process. Predicting metadata could be a guideline for template authors during the process of metadata definition. This facility will not only significantly facilitate the template definition task but also will make the resulting templates more comprehensive and reflective of the actual data. In CEDAR we also take advantage of emerging community-based standard templates for describing different kinds of biomedical datasets, and we investigate the use of computational techniques to help investigators to assemble templates and to fill in their values.^[7]

Learning value sets from data will help ensure that template authors do not miss important value sets that appear frequently in the data. Thus, data submitters will be able to find the terms they need, hence improving the quality of the metadata.

We use the increasing amounts of structured metadata to learn from as the project progresses and learn value sets conditional on the experimental level metadata. This incorporation of structural knowledge into the learning technology will allow us to infer common metadata patterns and their value sets in the context of technology platform, organism, molecule, label or sample type. Our key goal is to facilitate as much of the metadata collection process as possible, by suggesting possible value sets for the fields based on available data. This process will limit the value options, will reduce the burden of entering metadata terms and will significantly shorten the time that is needed for investigators to enter metadata.

We found that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

2. Background

Supervised learning uses classification algorithms to learn from data and make predictions. The goal of supervised learning is to build a model of the distribution of class labels from instances.^[12] The classifier can then assign class labels to instances in which the values of the predictor features are known, but the value of the class label is unknown. Numerous supervised classification techniques have been developed including decision trees, artificial neural networks, and statistical techniques such as bayesian networks.^[12] Machine learning has been widely applied across domains including the biomedical domain^[13], such as protein function prediction^[14], clinical outcome prediction^[15] and survival analysis [16].^[16]

References

↑ Barrett, T.; Wilhite, S.E.; Ledoux, P. et al. (2013). "NCBI GEO: Archive for functional genomics data sets - Update". Nucleic Acids Research 41 (DB1): D991-5. doi:10.1093/nar/gks1193. PMC PMC3531084. PMID 23193258. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531084.
↑ Tryka, K.A.; Hao, L.; Sturcke, A. et al. (2014). "NCBI's Database of Genotypes and Phenotypes: dbGaP". Nucleic Acids Research 42 (DB1): D975-9. doi:10.1093/nar/gkt1211. PMC PMC3965052. PMID 24297256. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965052.
↑ Fan, C.; Prat, A.; Parker, J.S. et al. (2011). "Building prognostic models for breast cancer patients using clinical variables and hundreds of gene expression signatures". BMC Medical Genomics 4: 3. doi:10.1186/1755-8794-4-3. PMC PMC3025826. PMID 21214954. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025826.
↑ Sutherland, A.; Thomas, M.; Brandon, R.A. et al. (2011). "Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis". Critical Care 15 (3): R149. doi:10.1186/cc10274. PMC PMC3219023. PMID 21682927. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219023.
↑ Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365–71. doi:10.1038/ng1201-365. PMID 11726920.
↑ Field, D.; Sansone, S.; Delong, E.F. et al. (2010). "Meeting Report: BioSharing at ISMB 2010". Standards in Genomic Sciences 3 (3): 254-8. doi:10.4056/sigs/1403501. PMC PMC3035313. PMID 21304729. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035313.
↑ ^7.0 ^7.1 ^7.2 Musen, M.A.; Bean, C.A.; Cheung, K.H. et al. (2015). "The center for expanded data annotation and retrieval". JAMIA 22 (6): 1148–52. doi:10.1093/jamia/ocv048. PMC PMC5009916. PMID 26112029. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009916.
↑ ^8.0 ^8.1 Panahiazar, M.; Dumontier, M.; Gevaert, O. (2015). "Context aware recommendation engine for metadata submission". First International Workshop on Capturing Scientific Knowledge: 3–7. https://www.isi.edu/ikcap/sciknow2015/papers/Panahiazar.pdf.
↑ Rocca-Serra, P.; Brandizi, M.; Maguire, E. et al. (2010). "ISA software suite: Supporting standards-compliant experimental annotation and enabling curation at the community level". Bioinformatics 26 (18): 2354-6. doi:10.1093/bioinformatics/btq415. PMC PMC2935443. PMID 20679334. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935443.
↑ Wolstencroft, K.; Owen, S.; Horridge, M. et al. (2011). "RightField: Embedding ontology annotation in spreadsheets". Bioinformatics 27 (14): 2021–2022. doi:10.1093/bioinformatics/btr312.
↑ Noy, N.F.; Shah, N.H.; Whetzel, P.L. et al. (2009). "BioPortal: Ontologies and integrated data resources at the click of a mouse". Nucleic Acids Research 37 (WS1): W170-3. doi:10.1093/nar/gkp440. PMC PMC2703982. PMID 19483092. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2703982.
↑ ^12.0 ^12.1 Kotsiantis, S.B. (2007). "Supervised machine learning: A review of classification techniques". Informatica 31 (3): 249–268. http://www.informatica.si/index.php/informatica/article/view/148.
↑ Bellazzi, R.; Zupan, B. (2008). "Predictive data mining in clinical medicine: Current issues and guidelines". International Journal of Medical Informatics 77 (2): 81–97. doi:10.1016/j.ijmedinf.2006.11.006. PMID 17188928.
↑ Xiong, W.; Liu, H.; Guan, J.; Zhou, S. (2013). "Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks". BMC Bioinformatics 14 (Suppl 12): S4. doi:10.1186/1471-2105-14-S12-S4. PMC PMC3848795. PMID 24267980. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848795.
↑ Daemen, A.; Gevaert, O.; De Moor, B. (2007). "Integration of clinical and microarray data with kernel methods". Conference Proceedings of the IEEE Engineering in Medicine and Biology Society 2007: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.
↑ Panahiazar, M.; Taslimitehrani, V.; Pereira, N.; Pathak, J. (2007). "Using EHRs and Machine Learning for Heart Failure Survival Analysis". Studies in Health Technology and Informatics 2015: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.

Notes

Per the original license, this presentation is faithful to the original, with only a few minor changes to presentation/format.

[BarrettNCBI13-1] Barrett, T.; Wilhite, S.E.; Ledoux, P. et al. (2013). "NCBI GEO: Archive for functional genomics data sets - Update". Nucleic Acids Research 41 (DB1): D991-5. doi:10.1093/nar/gks1193. PMC PMC3531084. PMID 23193258. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531084.

[TrykaNCBIs14-2] Tryka, K.A.; Hao, L.; Sturcke, A. et al. (2014). "NCBI's Database of Genotypes and Phenotypes: dbGaP". Nucleic Acids Research 42 (DB1): D975-9. doi:10.1093/nar/gkt1211. PMC PMC3965052. PMID 24297256. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965052.

[FanBuilding11-3] Fan, C.; Prat, A.; Parker, J.S. et al. (2011). "Building prognostic models for breast cancer patients using clinical variables and hundreds of gene expression signatures". BMC Medical Genomics 4: 3. doi:10.1186/1755-8794-4-3. PMC PMC3025826. PMID 21214954. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025826.

[SutherlandDevelop11-4] Sutherland, A.; Thomas, M.; Brandon, R.A. et al. (2011). "Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis". Critical Care 15 (3): R149. doi:10.1186/cc10274. PMC PMC3219023. PMID 21682927. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219023.

[BrazmaMinimum01-5] Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365–71. doi:10.1038/ng1201-365. PMID 11726920.

[FieldMeeting10-6] Field, D.; Sansone, S.; Delong, E.F. et al. (2010). "Meeting Report: BioSharing at ISMB 2010". Standards in Genomic Sciences 3 (3): 254-8. doi:10.4056/sigs/1403501. PMC PMC3035313. PMID 21304729. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035313.

[MusenTheCenter15-7] 7.0 ^7.1 ^7.2 Musen, M.A.; Bean, C.A.; Cheung, K.H. et al. (2015). "The center for expanded data annotation and retrieval". JAMIA 22 (6): 1148–52. doi:10.1093/jamia/ocv048. PMC PMC5009916. PMID 26112029. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009916.

[PanahiazarContext15-8] 8.0 ^8.1 Panahiazar, M.; Dumontier, M.; Gevaert, O. (2015). "Context aware recommendation engine for metadata submission". First International Workshop on Capturing Scientific Knowledge: 3–7. https://www.isi.edu/ikcap/sciknow2015/papers/Panahiazar.pdf.

[Rocca-SerraISA10-9] Rocca-Serra, P.; Brandizi, M.; Maguire, E. et al. (2010). "ISA software suite: Supporting standards-compliant experimental annotation and enabling curation at the community level". Bioinformatics 26 (18): 2354-6. doi:10.1093/bioinformatics/btq415. PMC PMC2935443. PMID 20679334. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2935443.

[WolstencroftRightField11-10] Wolstencroft, K.; Owen, S.; Horridge, M. et al. (2011). "RightField: Embedding ontology annotation in spreadsheets". Bioinformatics 27 (14): 2021–2022. doi:10.1093/bioinformatics/btr312.

[NoyBioPortal09-11] Noy, N.F.; Shah, N.H.; Whetzel, P.L. et al. (2009). "BioPortal: Ontologies and integrated data resources at the click of a mouse". Nucleic Acids Research 37 (WS1): W170-3. doi:10.1093/nar/gkp440. PMC PMC2703982. PMID 19483092. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2703982.

[KotsiantisSupervised07-12] 12.0 ^12.1 Kotsiantis, S.B. (2007). "Supervised machine learning: A review of classification techniques". Informatica 31 (3): 249–268. http://www.informatica.si/index.php/informatica/article/view/148.

[BellazziPredict08-13] Bellazzi, R.; Zupan, B. (2008). "Predictive data mining in clinical medicine: Current issues and guidelines". International Journal of Medical Informatics 77 (2): 81–97. doi:10.1016/j.ijmedinf.2006.11.006. PMID 17188928.

[XiongProtein13-14] Xiong, W.; Liu, H.; Guan, J.; Zhou, S. (2013). "Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks". BMC Bioinformatics 14 (Suppl 12): S4. doi:10.1186/1471-2105-14-S12-S4. PMC PMC3848795. PMID 24267980. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848795.

[DaemenInteg07-15] Daemen, A.; Gevaert, O.; De Moor, B. (2007). "Integration of clinical and microarray data with kernel methods". Conference Proceedings of the IEEE Engineering in Medicine and Biology Society 2007: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.

[PanahiazarUsing15-16] Panahiazar, M.; Taslimitehrani, V.; Pereira, N.; Pathak, J. (2007). "Using EHRs and Machine Learning for Heart Failure Survival Analysis". Studies in Health Technology and Informatics 2015: 5411–5. doi:10.1109/IEMBS.2007.4353566. PMID 18003232.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

@@ Line 48: / Line 48: @@
 We found that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.
+==2. Background==
+Supervised learning uses classification algorithms to learn from data and make predictions. The goal of supervised learning is to build a model of the distribution of class labels from instances.<ref name="KotsiantisSupervised07">{{cite journal |title=Supervised machine learning: A review of classification techniques |journal=Informatica |author=Kotsiantis, S.B. |volume=31 |issue=3 |pages=249–268 |year=2007 |url=http://www.informatica.si/index.php/informatica/article/view/148}}</ref> The classifier can then assign class labels to instances in which the values of the predictor features are known, but the value of the class label is unknown. Numerous supervised classification techniques have been developed including decision trees, artificial neural networks, and statistical techniques such as bayesian networks.<ref name="KotsiantisSupervised07" /> Machine learning has been widely applied across domains including the biomedical domain<ref name="BellazziPredict08">{{cite journal |title=Predictive data mining in clinical medicine: Current issues and guidelines |journal=International Journal of Medical Informatics |author=Bellazzi, R.; Zupan, B. |volume=77 |issue=2 |pages=81–97 |year=2008 |doi=10.1016/j.ijmedinf.2006.11.006 |pmid=17188928}}</ref>, such as protein function prediction<ref name="XiongProtein13">{{cite journal |title=Protein function prediction by collective classification with explicit and implicit edges in protein-protein interaction networks |journal=BMC Bioinformatics |author=Xiong, W.; Liu, H.; Guan, J.; Zhou, S. |volume=14 |issue=Suppl 12 |pages=S4 |year=2013 |doi=10.1186/1471-2105-14-S12-S4 |pmid=24267980 |pmc=PMC3848795}}</ref>, clinical outcome prediction<ref name="DaemenInteg07">{{cite journal |title=Integration of clinical and microarray data with kernel methods |journal=Conference Proceedings of the IEEE Engineering in Medicine and Biology Society |author=Daemen, A.; Gevaert, O.; De Moor, B. |volume=2007 |pages=5411–5 |year=2007 |doi=10.1109/IEMBS.2007.4353566 |pmid=18003232}}</ref> and survival analysis [16].<ref name="PanahiazarUsing15">{{cite journal |title=Using EHRs and Machine Learning for Heart Failure Survival Analysis |journal=Studies in Health Technology and Informatics |author=Panahiazar, M.; Taslimitehrani, V.; Pereira, N.; Pathak, J. |volume=2015 |pages=5411–5 |year=2007 |doi=10.1109/IEMBS.2007.4353566 |pmid=18003232}}</ref>
 ==References==

Difference between revisions of "Journal:Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)"

Revision as of 21:12, 22 August 2017

Contents

Abstract

Graphical abstract

Keywords

1. Introduction

2. Background

References

Notes

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

Popular publications

Print/export