Journal:Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations

Full article title	Incorporating domain knowledge in chemical and biomedical named entity recognition with word representations
Journal	Journal of Cheminformatics
Author(s)	Munkhdalai, Tsendsuren; Li, Meijing; Batsuren, Khuyagbaatar; Park, Hyeon Ah; Choi, Nak Hyeon; Ryu, Keun Ho
Author affiliation(s)	Chungbuk National University
Primary contact	Email: khryu@dblab.chungbuk.ac.kr
Year published	2015
Volume and issue	7 (Suppl 1)
Page(s)	S9
DOI	10.1186/1758-2946-7-S1-S9
ISSN	1758-2946
Distribution license	Creative Commons Attribution 4.0 International
Website	http://www.jcheminf.com/content/7/S1/S9
Download	http://www.jcheminf.com/content/pdf/1758-2946-7-S1-S9.pdf (PDF)

This article should not be considered complete until this message box has been removed. This is a work in progress.

Abstract

Background: Chemical and biomedical Named Entity Recognition (NER) is an essential prerequisite task before effective text mining can begin for biochemical-text data. Exploiting unlabeled text data to leverage system performance has been an active and challenging research topic in text mining due to the recent growth in the amount of biomedical literature.

We present a semi-supervised learning method that efficiently exploits unlabeled data in order to incorporate domain knowledge into a named entity recognition model and to leverage system performance. The proposed method includes Natural Language Processing (NLP) tasks for text preprocessing, learning word representation features from a large amount of text data for feature extraction, and conditional random fields for token classification. Other than the free text in the domain, the proposed method does not rely on any lexicon nor any dictionary in order to keep the system applicable to other NER tasks in bio-text data.

Results: We extended BANNER, a biomedical NER system, with the proposed method. This yields an integrated system that can be applied to chemical and drug NER or biomedical NER. We call our branch of the BANNER system BANNER-CHEMDNER, which is scalable over millions of documents, processing about 530 documents per minute, is configurable via XML, and can be plugged into other systems by using the BANNER Unstructured Information Management Architecture (UIMA) interface.

BANNER-CHEMDNER achieved an 85.68% and an 86.47% F-measure on the testing sets of CHEMDNER Chemical Entity Mention (CEM) and Chemical Document Indexing (CDI) subtasks, respectively, and achieved an 87.04% F-measure on the official testing set of the BioCreative II gene mention task, showing remarkable performance in both chemical and biomedical NER. BANNER-CHEMDNER system is available at: https://bitbucket.org/tsendeemts/banner-chemdner.

Keywords: Feature Representation Learning; Semi-Supervised Learning; Named Entity Recognition; Conditional Random Fields

Background

As biomedical literature on servers grows exponentially in the form of semi-structured documents, biomedical text mining has been intensively investigated to find information in a more accurate and efficient manner. One essential task in developing such an information extraction system is the Named Entity Recognition (NER) process, which basically defines the boundaries between typical words and biomedical terminology in a particular text, and assigns the terminology to specific categories based on domain knowledge.

NER performance in the newswire domain is indistinguishable from human performance, because it has an accuracy that is above 90%. However, performance has not been the same in the biomedical and chemical domain. It has been hampered by problems such as the number of new terms being created on a regular basis, the lack of standardization of technical terms between authors, and often by the fact that technical terms, such as gene names, often occur with other terminologies.^[1]

Proposed solutions include rule-based, dictionary-based, and Machine Learning (ML) approaches. In the dictionary-based approach, a prepared terminology list is matched through a given text to retrieve chunks containing the location of the terminology words.^[2]^[3] However, medical and chemical text can contain new terminology that has yet to be included in the dictionary.

The rule-based approach defines particular rules by observing the general features of the entities in a text [4]. In order to identify any named entity in text data, a rule-generation process has to process a huge amount of text to collect accurate rules. In addition, the rules are usually collected by domain experts, requiring a lot of effort.

Since the machine learning approach was adopted, significant progress in biomedical and chemical NER has been achieved with methods like the Markov Model [5], the Support Vector Machine (SVM) [6-8] the Maximum Entropy Markov Model [9,10], and Conditional Random Fields (CRF) [2,11-13]. However, most of the studies rely on supervised machine learning, and thus, system performance is limited by the training set that is usually built by a domain expert. Studies have shown that the word, the word n-gram and the character n-gram, and the traditional orthographic features are the base for NER, but are poor at representing domain background.

References

↑ Dai, H.J.; Chang, Y.C.; Tsai, R.T.H.; Hsu, W.L. (2010). "New Challenges for Biological Text-Mining in the Next Decade". Journal of Computer Science and Technology 25 (1): 169-179. doi:10.1007/s11390-010-9313-5.
↑ Rocktäschel, T.; Weidlich, M.; Leser, U. (2012). "ChemSpot: A hybrid system for chemical named entity recognition". Bioinformatics 28 (12): 1633-1640. doi:10.1093/bioinformatics/bts183.
↑ Rocktäschel, T.; Weidlich, M.; Leser, U. (2009). "A dictionary to identify small molecules and drugs in free text". Bioinformatics 25 (22): 2983-2991. doi:10.1093/bioinformatics/btp535.

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. In several cases citation information was missing and was added to make the reference more useful.