Genome informatics

The cost of genome sequencing has drastically decreased thanks to the Human Genome Project and associated pushes to further genome informatics.

Genome informatics is a field of computational molecular biology and branch of informatics that uses computers, software, and computational solution techniques to make observations, resolve problems, and manage data related to the genomic function of DNA sequences, comparison of gene structures, determination of the tertiary structure of all proteins, and other molecular biological activities.^[1]

History

A collaboration between the U.S. Department of Energy and the National Institutes of Health brought the Human Genome Project formally into existence on October 1, 1990. The project sought to identify all human genes and determine the related DNA sequences while also improving storage and analysis computing tools. Only two months later, on December 3–4, 1990, the first annual Genome Informatics Workshop (GIW) was hosted in Tokyo, Japan.^[2] (The name of the event changed with the twelfth meeting in 2001 to the International Conference on Genome Informatics.^[3]) While not the first major discussion about applying informatics to genomic research and data management, the Human Genome Project was arguably one of the biggest catalysts for the initial advancement of genome informatics.^[4] In the early 1990s researchers were faced with many challenges, including the question "Can genome informatics keep up with the technology?" Charles Cantor of the Center for Advanced Biotechnology thought that that technology development itself would not hinder the emerging field of genome informatics, but he saw the interface between human and computers to be problematic, particularly for the Human Genome Project.^[5] Interest in informatics tools went beyond researching the human genome, however. In June 1994, the Mouse Genome Informatics Group released version 1.0 of the Mouse Genome Database that included "easy-to-use query options and tools for display, analysis, and reporting" of genomic data.^[6]

As genomic and proteomic informatics tools and technologies continued to advance from 1995 to 2005, the costs associated with DNA sequencing decreased fifty-fold; advances in technology were expected to improve analysis, design, and system integration and reduce the cost even further.^[7] Those cost benefits were realized into 2015, with primary challenges shifting to "organizing this data, maintaining it in a way that is accessible and easy to use for researchers around the world, 24 hours a day."^[8]

Technology had made genomics and proteomics analysis so accessible that term "big data" began being used in relation to it and other types of data management in the 2010s.^[8]^[9] In January 2015, IBM was reportedly helping molecular profiling company Caris Life Sciences make sense of its genomics data. The company was generating "more data per patient through its genomic sequencing than any other lab in the United States — with more than half a terabyte of information being generated on a daily basis for individual patient samples."^[10]

Future genome informatics concerns will likely include taking genomic data analysis to phenotyping to patient care and considering the ethics of genomic data collection, storage, and analysis.^[11]

Application

Genome informatics can help tackle problems and tasks such as the following^[1]:

analyzing DNA sequences
recognizing genes and proteins and predicting their structures
predicting the biochemical function of new genes or fragments
extracting information from "families of homologous sequences and their structures"
detecting and classifying near and distant family relations of genes
molecular profiling

Informatics

The informatics side of genomics has largely focused on analytical tools and methodologies. DNA-microarray and sequencing technology helped researchers for the Human Genome Project analyze and understand thousands of genes and their expressions. By 2000, artificial neural networks were being theorized as a possible informatics tools to aid with data analysis and the problem of "high dimensionality" of the outputted data; by 2014 artificial neural networks were being proposed for cancer genomic research.^[1]^[12]

Aside from creating better algorithms, sequencing tools, and analysis tools, the informatics side of genomics research also involves the development and implementation of public and private genomics databases, which often include data display, analysis, and reporting tools to apply to the contained data. These databases can range in size from small, single-purpose data repositories to multi-terabyte, multi-server installations accessed by tens of thousands of people a month.^[8]

External links

Conferences

Cold Spring Harbor Laboratory Conference on Genome Informatics (U.S.)
Genome Informatics Conference (U.K.)
International Conference on Genome Informatics (Japan)

Databases

VEuPathDB
Mouse Genome Informatics
A list of global genomics databases and analysis tools can be found hosted by the Health Sciences Library System, University of Pittsburgh.