Difference between revisions of "Journal:Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces"

From LIMSWiki
Jump to navigationJump to search
(Saving and adding more.)
(Saving and adding more.)
Line 33: Line 33:


Biological database designers currently face two main challenges: data heterogeneity and the emergence of new relational connections between data entities. Today, biological data is not limited to sequential information, which is typically stored in primary databases such as NCBI's Nucleotide and Protein data sets. Biological data also encompass graphs<ref name="SharanModel06">{{cite journal |title=Modeling cellular machinery through biological network comparison |journal=Nature Biotechnology |author=Sharan, R.; Ideker, T. |volume=24 |issue=4 |pages=427–33 |year=2006 |doi=10.1038/nbt1196 |pmid=16601728}}</ref>, statistical models<ref name="WilkinsonStochastic09">{{cite journal |title=Stochastic modelling for quantitative description of heterogeneous biological systems |journal=Nature Reviews Genetics |author=Wilkinson, D.J. |volume=10 |issue=2 |pages=122–33 |year=2009 |doi=10.1038/nrg2509 |pmid=19139763}}</ref>, geometric information<ref name="DelpSimbios12">{{cite journal |title=Simbios: An NIH national center for physics-based simulation of biological structures |journal=JAMIA |author=Delp, S.L.; Ku, J.P.; Pande, V.S. et al. |volume=19 |issue=2 |pages=186–89 |year=2012 |doi=10.1136/amiajnl-2011-000488 |pmid=22081222 |pmc=PMC3277621}}</ref>, vector fields<ref name="EzraNondimens13">{{cite journal |title=Non-dimensional analysis of retinal microaneurysms: Critical threshold for treatment |journal=Integrative Biology |author=Ezra, E.; Keinan, E.; Mandel, Y. et al. |volume=5 |issue=3 |pages=474-80 |year=2013 |doi=10.1039/c3ib20259c |pmid=23371018 |pmc=PMC3781337}}</ref>, patterns<ref name="NaumovaDiff12">{{cite journal |title=Differential patterns of whole-genome DNA methylation in institutionalized children and children raised by their biological parents |journal=Development and Psychopathology |author=Naumova, O.Y.; Lee, M.; Koposov, R. et al. |volume=24 |issue=1 |pages=143–55 |year=2012 |doi=10.1017/S0954579411000605 |pmid=22123582 |pmc=PMC3470853}}</ref>, images<ref name="AllanOMERO12">{{cite journal |title=OMERO: Flexible, model-driven data management for experimental biology |journal=Nature Methods |author=Allan, C.; Burel, J.M.; Moore, J. et al. |volume=9 |issue=3 |pages=245–53 |year=2012 |doi=10.1038/nmeth.1896 |pmid=22373911  |pmc=PMC3437820}}</ref>, computational models<ref name="ChelliahBioModels13">{{cite journal |title=BioModels Database: A repository of mathematical models of biological processes |journal=Methods in Molecular Biology |author=Chelliah, V.; Laibe, C.; Le Novère, N. et al. |volume=1021 |pages=189–99 |year=2013 |doi=10.1007/978-1-62703-450-0_10 |pmid=23715986}}</ref> and others. A recent important advance regarding data heterogeneity was developed by Allan and colleagues, who have developed OMERO, an open-source software platform which uses a server-based middleware application to provide a unified interface for images, matrices and tables.<ref name="AllanOMERO12" /> However, while OMERO provides a unified interface for file types, it is currently limited to microscopy images. Another important effort is the development of Semantic Web languages (SWLs), which promote web-based standardization of data formats by utilizing [[Extensible Markup Language]] (XML) and Resource Description Framework (RDF). SWLs have been implemented by many biological portals such as MGED Ontology, which provides terms for annotating microarray experiments; BioPAX, which provides an exchange format for biological pathway data; and Gene Ontology (GO), which describes biological processes, molecular functions and cellular components of gene products.<ref name="PasquierBiolog08">{{cite journal |title=Biological data integration using Semantic Web technologies |journal=Biochimie |author=Pasquier, C. |volume=90 |issue=4 |pages=584–94 |year=2008 |doi=10.1016/j.biochi.2008.02.007 |pmid=18294970}}</ref>
Biological database designers currently face two main challenges: data heterogeneity and the emergence of new relational connections between data entities. Today, biological data is not limited to sequential information, which is typically stored in primary databases such as NCBI's Nucleotide and Protein data sets. Biological data also encompass graphs<ref name="SharanModel06">{{cite journal |title=Modeling cellular machinery through biological network comparison |journal=Nature Biotechnology |author=Sharan, R.; Ideker, T. |volume=24 |issue=4 |pages=427–33 |year=2006 |doi=10.1038/nbt1196 |pmid=16601728}}</ref>, statistical models<ref name="WilkinsonStochastic09">{{cite journal |title=Stochastic modelling for quantitative description of heterogeneous biological systems |journal=Nature Reviews Genetics |author=Wilkinson, D.J. |volume=10 |issue=2 |pages=122–33 |year=2009 |doi=10.1038/nrg2509 |pmid=19139763}}</ref>, geometric information<ref name="DelpSimbios12">{{cite journal |title=Simbios: An NIH national center for physics-based simulation of biological structures |journal=JAMIA |author=Delp, S.L.; Ku, J.P.; Pande, V.S. et al. |volume=19 |issue=2 |pages=186–89 |year=2012 |doi=10.1136/amiajnl-2011-000488 |pmid=22081222 |pmc=PMC3277621}}</ref>, vector fields<ref name="EzraNondimens13">{{cite journal |title=Non-dimensional analysis of retinal microaneurysms: Critical threshold for treatment |journal=Integrative Biology |author=Ezra, E.; Keinan, E.; Mandel, Y. et al. |volume=5 |issue=3 |pages=474-80 |year=2013 |doi=10.1039/c3ib20259c |pmid=23371018 |pmc=PMC3781337}}</ref>, patterns<ref name="NaumovaDiff12">{{cite journal |title=Differential patterns of whole-genome DNA methylation in institutionalized children and children raised by their biological parents |journal=Development and Psychopathology |author=Naumova, O.Y.; Lee, M.; Koposov, R. et al. |volume=24 |issue=1 |pages=143–55 |year=2012 |doi=10.1017/S0954579411000605 |pmid=22123582 |pmc=PMC3470853}}</ref>, images<ref name="AllanOMERO12">{{cite journal |title=OMERO: Flexible, model-driven data management for experimental biology |journal=Nature Methods |author=Allan, C.; Burel, J.M.; Moore, J. et al. |volume=9 |issue=3 |pages=245–53 |year=2012 |doi=10.1038/nmeth.1896 |pmid=22373911  |pmc=PMC3437820}}</ref>, computational models<ref name="ChelliahBioModels13">{{cite journal |title=BioModels Database: A repository of mathematical models of biological processes |journal=Methods in Molecular Biology |author=Chelliah, V.; Laibe, C.; Le Novère, N. et al. |volume=1021 |pages=189–99 |year=2013 |doi=10.1007/978-1-62703-450-0_10 |pmid=23715986}}</ref> and others. A recent important advance regarding data heterogeneity was developed by Allan and colleagues, who have developed OMERO, an open-source software platform which uses a server-based middleware application to provide a unified interface for images, matrices and tables.<ref name="AllanOMERO12" /> However, while OMERO provides a unified interface for file types, it is currently limited to microscopy images. Another important effort is the development of Semantic Web languages (SWLs), which promote web-based standardization of data formats by utilizing [[Extensible Markup Language]] (XML) and Resource Description Framework (RDF). SWLs have been implemented by many biological portals such as MGED Ontology, which provides terms for annotating microarray experiments; BioPAX, which provides an exchange format for biological pathway data; and Gene Ontology (GO), which describes biological processes, molecular functions and cellular components of gene products.<ref name="PasquierBiolog08">{{cite journal |title=Biological data integration using Semantic Web technologies |journal=Biochimie |author=Pasquier, C. |volume=90 |issue=4 |pages=584–94 |year=2008 |doi=10.1016/j.biochi.2008.02.007 |pmid=18294970}}</ref>
The management of relational connections between biological data entities is a great challenge due to the variety of contexts in which data can be related. The vast spectrum of possible relations between biological entities drove the momentum for the curation of specialized databases. Specialized databases include organism-centered datasets such as Flybase (Drosophila)<ref name="FlyBaseTheFlyBase03">{{cite journal |title=The FlyBase database of the Drosophila genome projects and community literature |journal=Nucleic Acids Research |author=FlyBase Consortium |volume=31 |issue=1 |pages=172–5 |year=2003 |pmid=12519974 |pmc=PMC165541}}</ref>, WormBase (Nematode)<ref name="SteinWormBase01">{{cite journal |title=WormBase: Network access to the genome and biology of ''Caenorhabditis elegans'' |journal=Nucleic Acids Research |author=Stein, L.; Sternberg, P.; Durbin, R. et al. |volume=29 |issue=1 |pages=82–6 |year=2001 |pmid=11125056 |pmc=PMC29781}}</ref>, AceDB (C. elegans)<ref name="MartinelliGene97">{{cite journal |title=Gene expression and development databases for ''C. elegans'' |journal=Seminars in Cell & Developmental Biology |author=Martinelli, S.D.; Brown, C.G.; Durbin, R. |volume=8 |issue=5 |pages=459–67 |year=1997 |doi=10.1006/scdb.1997.0171}}</ref>, and TAIR (Arabidopsis)<ref name="PooleTheTAIR07">{{cite journal |title=The TAIR database |journal=Methods in Molecular Biology |author=Poole, R.L. |volume=406 |pages=179–212 |year=2007 |pmid=18287693}}</ref>; biological pathways databases such as MetaCyc and Biocyc<ref name="CaspiTheMetaCyc08">{{cite journal |title=The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases |journal=Nucleic Acids Research |author=Caspi, R.; Foerster, H.; Fulcher, C.A. et al. |volume=36 |issue=DB1 |pages=D623–31 |year=2008 |doi=10.1093/nar/gkm900 |pmid=17965431 |pmc=PMC2238876}}</ref>; and disease databases such as NCBI's OMIM database. Today, specialized databases are often curated to serve consortiums and single laboratories that define their own data relations architecture with their own data format. Specialized database curation is an ever-growing need since new data sources are constantly evolving due to rapidly advancing biological research: new experimental techniques produce types of data greater in both variety and number, requiring database structures to change accordingly. Moreover, most specialized databases contain both new data and data that were derived from established datasets. This hybrid approach of the new and the old presents a major challenge to specialized database designers, which should query, acquire and parse data from existing databases, as well as integrate it into their own database architecture.


==References==
==References==

Revision as of 21:29, 15 August 2017

Full article title Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces
Journal BioData Mining
Author(s) Tsur, Elishai Ezra
Author affiliation(s) Jerusalem College of Technology
Primary contact Email: elishai85 at gmail dot com
Year published 2017
Volume and issue 10
Page(s) 11
DOI 10.1186/s13040-017-0130-z
ISSN 1756-0381
Distribution license Creative Commons Attribution 4.0 International
Website https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0130-z
Download https://biodatamining.biomedcentral.com/track/pdf/10.1186/s13040-017-0130-z (PDF)

Databases are imperative for research in bioinformatics and computational biology. Current challenges in database design include data heterogeneity and context-dependent interconnections between data entities. These challenges drove the development of unified data interfaces and specialized databases. The curation of specialized databases is an ever-growing challenge due to the introduction of new data sources and the emergence of new relational connections between established datasets. Here, an open-source framework for the curation of specialized databases is proposed. The framework supports user-designed models of data encapsulation, object persistence and structured interfaces to local and external data sources such as MalaCards, Biomodels and the National Center for Biotechnology Information (NCBI) databases. The proposed framework was implemented using Java as the development environment, EclipseLink as the data persistence agent and Apache Derby as the database manager. Syntactic analysis was based on J3D, jsoup, Apache Commons and w3c.dom open libraries. Finally, a construction of a specialized database for aneurysm-associated vascular diseases is demonstrated. This database contains three-dimensional geometries of aneurysms, patients' clinical information, articles, biological models, related diseases and our recently published model of aneurysms’ risk of rapture. The framework is available at: http://nbel-lab.com.

Keywords: specialized databases, object-relational databases, EclipseLink, Apache Derby, object-oriented programming

Background

In the last few decades the intersection of computer science and biology has evolved to the point at which answers to fundamental biological questions have emerged.[1] Some of the most important cross-talks between biology and computer science lie within the data-intensive nature of modern biology.[2] It is currently evident that fields such as computational biology and bioinformatics are practically fueled by the increasing computational resources available and the development of software encapsulation and abstraction layers.[3] An important corner stone of the computer-science/biology interface is object-centered reductionism where relations between discrete biological entities such as DNA, protein and RNA are investigated.[1] Data regarding biological entities is stored in databases, which have become the most important corner stone for research in computational biology and bioinformatics.

Biological database designers currently face two main challenges: data heterogeneity and the emergence of new relational connections between data entities. Today, biological data is not limited to sequential information, which is typically stored in primary databases such as NCBI's Nucleotide and Protein data sets. Biological data also encompass graphs[4], statistical models[5], geometric information[6], vector fields[7], patterns[8], images[9], computational models[10] and others. A recent important advance regarding data heterogeneity was developed by Allan and colleagues, who have developed OMERO, an open-source software platform which uses a server-based middleware application to provide a unified interface for images, matrices and tables.[9] However, while OMERO provides a unified interface for file types, it is currently limited to microscopy images. Another important effort is the development of Semantic Web languages (SWLs), which promote web-based standardization of data formats by utilizing Extensible Markup Language (XML) and Resource Description Framework (RDF). SWLs have been implemented by many biological portals such as MGED Ontology, which provides terms for annotating microarray experiments; BioPAX, which provides an exchange format for biological pathway data; and Gene Ontology (GO), which describes biological processes, molecular functions and cellular components of gene products.[11]

The management of relational connections between biological data entities is a great challenge due to the variety of contexts in which data can be related. The vast spectrum of possible relations between biological entities drove the momentum for the curation of specialized databases. Specialized databases include organism-centered datasets such as Flybase (Drosophila)[12], WormBase (Nematode)[13], AceDB (C. elegans)[14], and TAIR (Arabidopsis)[15]; biological pathways databases such as MetaCyc and Biocyc[16]; and disease databases such as NCBI's OMIM database. Today, specialized databases are often curated to serve consortiums and single laboratories that define their own data relations architecture with their own data format. Specialized database curation is an ever-growing need since new data sources are constantly evolving due to rapidly advancing biological research: new experimental techniques produce types of data greater in both variety and number, requiring database structures to change accordingly. Moreover, most specialized databases contain both new data and data that were derived from established datasets. This hybrid approach of the new and the old presents a major challenge to specialized database designers, which should query, acquire and parse data from existing databases, as well as integrate it into their own database architecture.

References

  1. 1.0 1.1 Kitano, H. (2002). "Computational systems biology". Nature 420 (6912): 206–10. doi:10.1038/nature01254. PMID 12432404. 
  2. Stein, L.D. (2003). "Integrating biological databases". Nature Reviews Genetics 4 (5): 337–45. doi:10.1038/nrg1065. PMID 12728276. 
  3. Cannata, N.; Merelli, E.; Altman, R.B. (2005). "Time to organize the bioinformatics resourceome". PLOS Computational Biology 1 (7): e76. doi:10.1371/journal.pcbi.0010076. PMC PMC1323464. PMID 16738704. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1323464. 
  4. Sharan, R.; Ideker, T. (2006). "Modeling cellular machinery through biological network comparison". Nature Biotechnology 24 (4): 427–33. doi:10.1038/nbt1196. PMID 16601728. 
  5. Wilkinson, D.J. (2009). "Stochastic modelling for quantitative description of heterogeneous biological systems". Nature Reviews Genetics 10 (2): 122–33. doi:10.1038/nrg2509. PMID 19139763. 
  6. Delp, S.L.; Ku, J.P.; Pande, V.S. et al. (2012). "Simbios: An NIH national center for physics-based simulation of biological structures". JAMIA 19 (2): 186–89. doi:10.1136/amiajnl-2011-000488. PMC PMC3277621. PMID 22081222. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3277621. 
  7. Ezra, E.; Keinan, E.; Mandel, Y. et al. (2013). "Non-dimensional analysis of retinal microaneurysms: Critical threshold for treatment". Integrative Biology 5 (3): 474-80. doi:10.1039/c3ib20259c. PMC PMC3781337. PMID 23371018. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3781337. 
  8. Naumova, O.Y.; Lee, M.; Koposov, R. et al. (2012). "Differential patterns of whole-genome DNA methylation in institutionalized children and children raised by their biological parents". Development and Psychopathology 24 (1): 143–55. doi:10.1017/S0954579411000605. PMC PMC3470853. PMID 22123582. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3470853. 
  9. 9.0 9.1 Allan, C.; Burel, J.M.; Moore, J. et al. (2012). "OMERO: Flexible, model-driven data management for experimental biology". Nature Methods 9 (3): 245–53. doi:10.1038/nmeth.1896. PMC PMC3437820. PMID 22373911. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3437820. 
  10. Chelliah, V.; Laibe, C.; Le Novère, N. et al. (2013). "BioModels Database: A repository of mathematical models of biological processes". Methods in Molecular Biology 1021: 189–99. doi:10.1007/978-1-62703-450-0_10. PMID 23715986. 
  11. Pasquier, C. (2008). "Biological data integration using Semantic Web technologies". Biochimie 90 (4): 584–94. doi:10.1016/j.biochi.2008.02.007. PMID 18294970. 
  12. FlyBase Consortium (2003). "The FlyBase database of the Drosophila genome projects and community literature". Nucleic Acids Research 31 (1): 172–5. PMC PMC165541. PMID 12519974. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC165541. 
  13. Stein, L.; Sternberg, P.; Durbin, R. et al. (2001). "WormBase: Network access to the genome and biology of Caenorhabditis elegans". Nucleic Acids Research 29 (1): 82–6. PMC PMC29781. PMID 11125056. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC29781. 
  14. Martinelli, S.D.; Brown, C.G.; Durbin, R. (1997). "Gene expression and development databases for C. elegans". Seminars in Cell & Developmental Biology 8 (5): 459–67. doi:10.1006/scdb.1997.0171. 
  15. Poole, R.L. (2007). "The TAIR database". Methods in Molecular Biology 406: 179–212. PMID 18287693. 
  16. Caspi, R.; Foerster, H.; Fulcher, C.A. et al. (2008). "The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases". Nucleic Acids Research 36 (DB1): D623–31. doi:10.1093/nar/gkm900. PMC PMC2238876. PMID 17965431. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2238876. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In some cases important information was missing from the references, and that information was added. Grammar and word use were updated to make the text easier to read.