Journal:Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)

From LIMSWiki
Revision as of 20:03, 22 August 2017 by Shawndouglas (talk | contribs) (Created stub. Saving and adding more.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search
Full article title Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)
Journal Journal of Biomedical Informatics
Author(s) Panahiazar, Maryam; Dumontier, Michel; Gevaert, Olivier
Author affiliation(s) Stanford University
Primary contact Email: olivier dot gevaert at stanford dot edu
Year published 2017
Volume and issue 72(8)
Page(s) 132–139
DOI 10.1016/j.jbi.2017.06.017
ISSN 1532-0464
Distribution license Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Website http://datascience.codata.org/articles/10.5334/dsj-2017-001/
Download http://datascience.codata.org/articles/10.5334/dsj-2017-001/galley/620/download/ (PDF)

Abstract

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3 million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table.

All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse.

Graphical abstract

Fig1 Panahiazar JofBiomedInformatics2017 72-8.jpg


Keywords

Data mining; Prediction; Metadata; GEO; CEDAR

1. Introduction

Biomedical data is increasingly being viewed as a valuable commodity that can be mined for new insights beyond that for which it was created. Large community-focused databases such as the Gene Expression Omnibus (GEO)[1] or the database of Genotypes and Phenotypes (dbGAP)[2] offer a wealth of omics’ data that have been used in developing diagnostic, prognostic, and therapeutic models.[3][4] One crucial and limiting factor in the reuse of data lies in having access to accurate descriptions about the data – known as metadata. Community standards to describe an experiment (e.g. Minimum Information About a Microarray Experiment; MIAME[5]) are being widely promoted to highlight essential metadata, but creating good metadata can be challenging.[6][7]


References

  1. Barrett, T.; Wilhite, S.E.; Ledoux, P. et al. (2013). "NCBI GEO: Archive for functional genomics data sets - Update". Nucleic Acids Research 41 (DB1): D991-5. doi:10.1093/nar/gks1193. PMC PMC3531084. PMID 23193258. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531084. 
  2. Tryka, K.A.; Hao, L.; Sturcke, A. et al. (2014). "NCBI's Database of Genotypes and Phenotypes: dbGaP". Nucleic Acids Research 42 (DB1): D975-9. doi:10.1093/nar/gkt1211. PMC PMC3965052. PMID 24297256. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3965052. 
  3. Fan, C.; Prat, A.; Parker, J.S. et al. (2011). "Building prognostic models for breast cancer patients using clinical variables and hundreds of gene expression signatures". BMC Medical Genomics 4: 3. doi:10.1186/1755-8794-4-3. PMC PMC3025826. PMID 21214954. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025826. 
  4. Sutherland, A.; Thomas, M.; Brandon, R.A. et al. (2011). "Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis". Critical Care 15 (3): R149. doi:10.1186/cc10274. PMC PMC3219023. PMID 21682927. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219023. 
  5. Brazma, A.; Hingamp, P.; Quackenbush, J. et al. (2001). "Minimum information about a microarray experiment (MIAME)-toward standards for microarray data". Nature Genetics 29 (4): 365–71. doi:10.1038/ng1201-365. PMID 11726920. 
  6. Field, D.; Sansone, S.; Delong, E.F. et al. (2010). "Meeting Report: BioSharing at ISMB 2010". Standards in Genomic Sciences 3 (3): 254-8. doi:10.4056/sigs/1403501. PMC PMC3035313. PMID 21304729. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3035313. 
  7. Musen, M.A.; Bean, C.A.; Cheung, K.H. et al. (2015). "The center for expanded data annotation and retrieval". JAMIA 22 (6): 1148–52. doi:10.1093/jamia/ocv048. PMC PMC5009916. PMID 26112029. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009916. 

Notes

Per the original license, this presentation is faithful to the original, with only a few minor changes to presentation/format.