Difference between revisions of "Journal:Research on information retrieval model based on ontology"

From LIMSWiki
Jump to navigationJump to search
Line 57: Line 57:


<math>{\overset{\rightarrow}{d}}_{j} = \left( w_{1j},w_{2j},w_{3j},\ldots,w_{tj} \right)</math>
<math>{\overset{\rightarrow}{d}}_{j} = \left( w_{1j},w_{2j},w_{3j},\ldots,w_{tj} \right)</math>
The Boolean model is a classical information retrieval (IR) model based on set theory and Boolean algebra. Boolean retrieval can be effective if a query requires unambiguous selection.<ref name="Baeza-YatesModern99">{{cite book |title=Modern Information Retrieval |author=Baeza-Yates, R.; Ribeiro-Neto, B. |publisher=Addison Wesley |year=1999 |pages=544 |isbn=9780201398298}}</ref> But it can only result in whether the document is related or not related. The Boolean model lacks the ability to describe the situation that query words partially match a paper. The similarity result of document ''d<sub>j</sub>'' and query ''q'' is binary, either 0 or 1. The binary value has limitations and the Boolean queries are hard to construct.
The VSM, which is proposed earlier by Salton, is based on the vector space model theory and vector linear algebra operation, which abstract the query conditions and text into vectors in the multidimensional vector space. The multi-keyword matching here can express the meaning of the text more.<ref name="TangTheRes10" /> Compared with the Boolean model, the VSM calculates relevant document ranking by comparing the angle relating similarity between the vector of each document and the original query vector in the spatial representation.
The probabilistic model<ref name="MaAConn16" /> mainly relies on probabilistic operation and Bayes rules to match data information. The probabilistic model not only considers the internal relations between keywords and documents, but it also retrieves texts based on probability dependency. The model, usually based on a group of parameterized probability distributions, consumes the internal relation between keywords and documents and retrieves according to probabilistic dependency. The model requires strong independent assumptions for tractability.
The binary independence retrieval model<ref name="PremalathaText14">{{cite journal |title=Text processing in information retrieval system using vector space model |journal=Proceedings from the 2014 International Conference on Information Communication and Embedded Systems |author=Premalatha, R.; Srinivasan, S. |pages=1–6 |year=2014 |doi=10.1109/ICICES.2014.7033837}}</ref> is evolved from the probabilistic model with better performance. Assuming that document ''D'' and index term ''q'' is described in a two-valued vector (''x<sub>1</sub>'', ''x<sub>2</sub>'', … ''x<sub>n</sub>''), if index term ''k<sub>i</sub>'' ∈ ''D'', then ''x<sub>i</sub>'' = 1; otherwise, ''x<sub>i</sub>'' = 0. The correlation function of index term and document are shown below.
<math>{Sim}\left( D,q \right) = \sum\log\frac{p_{i}\left( 1 - q_{i} \right)}{q_{i}\left( 1 - p_{i} \right)}</math>
Here, ''p<sub>i</sub>'' = ''r<sub>i</sub>/r'', ''q<sub>i</sub>'' = (''f<sub>i</sub>'' − ''r<sub>i</sub>'')/(''f'' − ''r''), ''f'' refers to amount of documents in the training document set. ''r'' is the number of documents related to the user query in the training document set. ''f<sub>i</sub>'' represents a number of documents, including index term ''k<sub>i</sub>'' in the training document set. ''R<sub>i</sub>'' is the number of documents, including ''k<sub>i</sub>'' in ''r'' relation documents.


==References==
==References==

Revision as of 23:02, 12 February 2019

Full article title Research on information retrieval model based on ontology
Journal EURASIP Journal on Wireless Communications and Networking
Author(s) Yu, Binbin
Author affiliation(s) Jilin University, Beihua University
Primary contact Email: yubinbin80 at sina dot com
Year published 2019
Volume and issue 2019
Page(s) 30
DOI 10.1186/s13638-019-1354-z
ISSN 1687-1499
Distribution license Creative Commons Attribution 4.0 International
Website https://jwcn-eurasipjournals.springeropen.com/articles/10.1186/s13638-019-1354-z
Download https://jwcn-eurasipjournals.springeropen.com/track/pdf/10.1186/s13638-019-1354-z (PDF)

Abstract

An information retrieval system not only occupies an important position in the network information platform, but also plays an important role in information acquisition, query processing, and wireless sensor networks. It is a procedure to help researchers extract documents from data sets as document retrieval tools. The classic keyword-based information retrieval models neglect the semantic information which is not able to represent the user’s needs. Therefore, how to efficiently acquire personalized information that users need is of concern. The ontology-based systems lack an expert list to obtain accurate index term frequency. In this paper, a domain ontology model with document processing and document retrieval is proposed, and the feasibility and superiority of the domain ontology model are proved by the method of experiment.

Keywords: ontology, information retrieval, genetic algorithm, sensor networks

Introduction

Information retrieval is the process of extracting relevant documents from large data sets. Along with the increasing accumulation of data and the rising demand of high-quality retrieval results, traditional information retrieval techniques are unable to meet the task of high-quality search results. As a newly emerged knowledge organization system, ontology is vitally important in promoting the function of information retrieval in knowledge management.

Existing information retrieval models, such as the vector space model (VSM)[1], are based on certain rules to model text in pattern recognition and other fields. For example, a VSM splits, filters, and classifies text that looks very abstract and using certain rules calculates statistics such as word frequency.

Probability models[2] mainly rely on probabilistic operation and Bayes rules to match data information, in which the weight values of feature words are all multivalued. The probabilistic model uses the index word to represent the user’s interest, that is, the personalized query request submitted by the user. Meanwhile, there is no vocabulary set with a standard semantic feature and document label. Traditional weighted strategies lack semantic information of the document, which is not representative for the document description. On the basis of semantic annotation results, weighted item frequency[3] and domain ontology of the semantic relation are used to express the semantics of the document.[4]

The VSM and probability model can simplify the text processing into a vector space or probability set. It uses the "term frequency" property to describe the number of occurrences of query words in the paper. Considering the particularity of document segmentation, the word in different sections has a different weight of summarization for the paper, meaning that calculating word appearance is not sufficient. Meanwhile, there is no vocabulary set with standard semantic features and document labels.

The introduction of ontology into the information retrieval system can query users’ semantic information based on ontology and better satisfy users’ personalized retrieval needs.[5] Short of a vocabulary set with semantic description, attempts at a logic view of user information demand are insufficient to express the semantic of the user’s requirement. In such an information retrieval model, even if we choose the appropriate sort function R (R is the reciprocal of the distance between points), the logical view cannot represent the requirements of the document and the user, and the retrieval results will be unconvincing to the user.

In order to improve the accuracy and efficiency of user retrieval, we build a model based on information retrieval and a domain ontology knowledge base. The ontology-based information retrieval system provides semantic retrieval, while the keyword-based information retrieval system calculates a better factor set in document processing, with better recall and precision results.

In order to accomplish this, a genetic algorithm was designed and implemented. A genetic algorithm is a kind of search method that refers to the evolution rule of the biological world. It mainly includes coding mechanisms and control parameters. The genetic algorithm provides a heuristic method which simulates the population evolution by searching through the solution space in each selection, crossover, and mutation to select an optimal factor set by combinations of factors. The option-weighted factor, tuned by a training set using genetic algorithms, is applied to a practical retrieval system.[6]

Domain ontology was applied as the base of semantic representation to effectually represent user requirement and document semantics. Domain ontology involves the detailed description of domain conceptualization which expresses the abstract object, relation, and class in one vocabulary set.[7]

Designing and implementing the information retrieval system was composed of two parts: document processing and document retrieval. In this information retrieval model, an ontology server is added to tag and index the retrieval sources based on ontology; the query conversion module implements semantic processing in users’ needs and expanses the initial query on its synonym, hypernym, and its senses. The retrieval agent module uses the conversion of queries for retrieving the information source.

We've already provided an overview of an ontology-based information retrieval system. The next part introduces the relevant work and methods of this study. The third part discusses the design of an information retrieval model based on domain ontology. The fourth part details the experimental study and analyses of the results. The final part summarizes the full text and declares related issues that need further study.

Methods

Faced with the problem of managing a large volume of data in a network, it remains vital for users to acquire information accurately and efficiently. So far, retrieval methods have been developed using various mathematical models. The classical information retrieval models include the Boolean model[8], probability model[9], vector model[10], binary independent retrieval model, and BM25 model. The following are the solutions of these models.

Suppose ki is the index term, dj is the document, wi,j ≥ 0 is the weight of tuples (ki, dj), which is the significance of ki to dj semantic contents. Let t refer to the number of index terms. K = {k1, …, kt} is index term set. If an index term does not appear in the document, then wi,j = 0. So the document dj is represented by an index term vector :

The Boolean model is a classical information retrieval (IR) model based on set theory and Boolean algebra. Boolean retrieval can be effective if a query requires unambiguous selection.[11] But it can only result in whether the document is related or not related. The Boolean model lacks the ability to describe the situation that query words partially match a paper. The similarity result of document dj and query q is binary, either 0 or 1. The binary value has limitations and the Boolean queries are hard to construct.

The VSM, which is proposed earlier by Salton, is based on the vector space model theory and vector linear algebra operation, which abstract the query conditions and text into vectors in the multidimensional vector space. The multi-keyword matching here can express the meaning of the text more.[1] Compared with the Boolean model, the VSM calculates relevant document ranking by comparing the angle relating similarity between the vector of each document and the original query vector in the spatial representation.

The probabilistic model[2] mainly relies on probabilistic operation and Bayes rules to match data information. The probabilistic model not only considers the internal relations between keywords and documents, but it also retrieves texts based on probability dependency. The model, usually based on a group of parameterized probability distributions, consumes the internal relation between keywords and documents and retrieves according to probabilistic dependency. The model requires strong independent assumptions for tractability.

The binary independence retrieval model[12] is evolved from the probabilistic model with better performance. Assuming that document D and index term q is described in a two-valued vector (x1, x2, … xn), if index term ki ∈ D, then xi = 1; otherwise, xi = 0. The correlation function of index term and document are shown below.

Here, pi = ri/r, qi = (fi − ri)/(f − r), f refers to amount of documents in the training document set. r is the number of documents related to the user query in the training document set. fi represents a number of documents, including index term ki in the training document set. Ri is the number of documents, including ki in r relation documents.


References

  1. 1.0 1.1 Tang, M.; Bian, Y.; Tao, F. (2010). "The Research of Document Retrieval System Based on the Semantic Vector Space Model". Journal of Intelligence 5 (29): 167–77. http://en.cnki.com.cn/Article_en/CJFDTOTAL-QBZZ201005036.htm. 
  2. 2.0 2.1 Ma, C.; Liang, W.; Zheng, M. et al. (2016). "A Connectivity-Aware Approximation Algorithm for Relay Node Placement in Wireless Sensor Networks". IEEE Sensors Journal 16 (2): 515-528. doi:10.1109/JSEN.2015.2456931. 
  3. Yang, X.Q.; Yang, D.; Yuan, M. (2014). "Scientific Literature Retrieval Model Based on Weighted Term Frequency". Proceedings of the 2014 Tenth International Conference on Intelligent Information Hiding and Multimedia Signal Processing: 427–430. doi:10.1109/IIH-MSP.2014.113. 
  4. Xu, M.; Yang, Q.; Kwak, K.S. (2016). "Distributed Topology Control With Lifetime Extension Based on Non-Cooperative Game for Wireless Sensor Networks". IEEE Sensors Journal 16 (9): 3332-3342. doi:10.1109/JSEN.2016.2527056. 
  5. Yang, Y.; Du, J.P.; Ping, Y. (2015). "Ontology-based intelligent information retrieval system". Journal of Software 26 (7): 1675–87. https://mathscinet.ams.org/mathscinet-getitem?mr=3408856. 
  6. Lu, T.; Liang, M. (2014). "Improvement of Text Feature Extraction with Genetic Algorithm". New Technology of Library and Information Service 30 (4): 48–57. doi:10.11925/infotech.1003-3513.2014.04.08. 
  7. Vallet, D.; Fernández, M.; Castells, P. (2005). "An Ontology-Based Information Retrieval Model". Proceedings from ESWC 2005, The Semantic Web: Research and Applications: 455–70. doi:10.1007/11431053_31. 
  8. Manning, C.D.; Raghavan, P.; Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. doi:10.1017/CBO9780511809071. ISBN 9780511809071. 
  9. Jones, K.S.; Walker, S.; Robertson, S.E. (2000). "A probabilistic model of information retrieval: Development and comparative experiments: Part 1". Information Processing & Management: 779–808. doi:10.1016/S0306-4573(00)00015-7. 
  10. Wong, S.K.M.; Ziarko, W.; Wong, P.C.N. (1985). "Generalized vector spaces model in information retrieval". Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval: 18–25. doi:10.1145/253495.253506. 
  11. Baeza-Yates, R.; Ribeiro-Neto, B. (1999). Modern Information Retrieval. Addison Wesley. pp. 544. ISBN 9780201398298. 
  12. Premalatha, R.; Srinivasan, S. (2014). "Text processing in information retrieval system using vector space model". Proceedings from the 2014 International Conference on Information Communication and Embedded Systems: 1–6. doi:10.1109/ICICES.2014.7033837. 

Notes

This presentation is faithful to the original, with only a few minor changes to presentation. Grammar and punctuation was edited to American English, and in some cases additional context was added to text when necessary. In some cases important information was missing from the references, and that information was added.