SemIndex: Semantic Aware Inverted Index

LAU Alumni Association Logo-Final-May20110








SemIndex: Semantic Aware Inverted Index


Download prototype: SemIndex (console mode) or SemIndex+ (graphical, with alternative algorithms)

Joe TEKLI, Christian KALLAS & Marc AL ASSAD
SOE, Dept. of Electrical & Computer Eng.
Lebanese American University
36 Byblos, LEBANON
(joe.tekli, marc.alassad)

Richard CHBEIR
UPPA Laboratory, IUT of Bayonne
University of Pau and Adour Countries
64600 Anglet, FRANCE
[email protected]

LE2I Laboratory UMR-CNRS
University of Bourgogne
21000 Dijon, France
(yi.luo, kokou)

Carlos Raymundo Ibanez
Computer Science Department
Universidad Peruana de Ciencias Aplicadas
Lima, PERU
[email protected]

Caetano Traina Jr. & Agma J.M. Traina
ICMC, Computer Science and Statistics Department
University of Sao Paulo
Sao Carlos, BRAZIL
(agma, caetano)


I. Introduction

Processing keyword-based queries is a fundamental problem in the domain of Information Retrieval (IR), where several studies have been done to develop effective keyword-based search techniques [5, 6, 7]. A standard containment keyword-based query, which retrieves textual identities containing a set of keywords, is generally supported by a full-text index. The inverted index is considered as one of the most useful full-text indexing techniques for very large textual collections [32], supported by many RDBMSs . It is also increasingly used on semi-structured [6] and unstructured data [5] to support keyword-based queries. Yet, the standard inverted index, which only supports exact term matching, cannot deal with data semantics.

Various approaches combining different types of data and semantic knowledge have been proposed to enhance query processing. In this study, we develop a new approach called SemIndex integrating domain knowledge into an inverted index to support semantic-aware querying. Major benefits of our work over existing methods include:

  • Pre-processing the index: Existing works use semantic knowledge to pre-process queries, such as query rewriting/relaxation and query suggestion [1, 4, 9], or to post-process the query results, such as using semantic result clustering [13, 15, 16]. Our work proposes a better alternative considering the semantic gap and enclosing semantic knowledge directly into an inverted index, so that the semantic-aware processing tasks can be done prior to query processing: at the most basic index level, where more opportunities can be explored toward both speed-ups and semantic-based filtering.
  • User involvement: Most existing works introduce some predefined parameters (heuristics) or user feedback to rewrite queries such that users are only involved in the query refinement (expansion, filtering, etc.) process after providing the first round of results [2, 3, 10]. In our work, we aim at allowing end-users to write, using the same framework, classical queries as well as semantically enriched queries, according to their needs. The users can get themselves more involved in the whole process: during the initial query writing, and then performing query rewriting.
  • Providing more relevant results: Most existing works focus on understanding the meaning of data/queries through semantic disambiguation [1, 8, 13], which: i) is usually a complex process requiring substantial processing time [12], and ii) depends on the query/data context which is not always sufficiently available, and thus does not guarantee correct results [11, 14]. The goal of our work is, with the help of semantic knowledge, to find more semantically relevant results than what a traditional inverted index or semantic-aware approaches can provide, while doing it more efficiently than existing disambiguation techniques .

Our semantic-aware inverted index, SemIndex, maps two data resources into a single data structure, a textual data collection (represented as a traditional inverted index), and a semantic knowledge base (represented as a traditional semantic network).
An extended query model with different levels of semantic awareness is defined, so that both semantic-aware queries and standard containment queries are processed within the same framework. Fig. 1 depicts the overall framework of our approach and its main components.
Briefly, the Indexer manages SemIndex generation and maintenance, while the Query Processor processes and answers the semantic-aware (or standard) queries issued by the user.

II. System Architecture

We have implemented our SemIndex framework using Java. We also have used MySQL 5.6 as an RDBMS, and WordNet 3.0 as a knowledge base. In addition to the two basic SemIndex components consisting of: i) the indexer, and ii) the query processor, our implementation also includes: iii) a lemmatizer used to transform index terms into their lemmas, as well as iv) extensible weight computation components which are called upon within the indexer and/or query processor to compute edge/node weights as needed (weights are computed initially during indexing, and are then updated dynamically during querying).
The RDBMS initially holds the input textual data collection and the knowledge base in the form of native RDBs . Java is used to send SQL queries to the RDBMS in the following order required to build SemIndex.


Fig. 1. Simplified activity diagram describing our SemIndex framework

In an initial battery of experiments, we evaluated the quality of our indexing approach by assessing four main criteria: i) index building time, ii) index size and characteristics, ii) query processing time, and iii) the number of returned results.
We used the IMBD movies table as an input textual collection, including the attributes movie_id and (title, plot) concatenated in one column (cf. Table 1) with a total size of around 75 MBytes including more then 7 million rows. WordNet 3.0 had a total size of around 26 Mbytes, including more than 117k synsets (senses). The early prototype system and experimental results can be downloaded from the following links:

In a subsequent study,

we have extended the above framework toward SemIndex+, allowing to search, select, and rank unstructured, structured (relational) and partly structured (NoSQL) textual data. At the indexer level, we added: i) an extension of SemIndex ‘s logical design to handle varying multi-attribute datasets (using attribute sensitive indexers), ii) a dedicated algorithm to handle terms with missing semantic connections (which we designate as missing terms ), and iii) a mathematical model for weighting SemIndex+ entries (i.e., the graph’s nodes and edges). At the query processing level, we developed: v) four alternative query processing algorithms (including a parallelized version of the core algorithm), coupled with vi) a dedicated relevance scoring measure, required in the query evaluation process in order to retrieve and rank relevant query answers.


Fig. 2. Simplified activity diagram describing our SemIndex framework

In addition, we conducted an extensive experimental study comparing SemIndex+ ‘s effectiveness and efficiency with various generic approaches (including inverted index search , query relaxation , query disambiguation , and query refinement ). The new prototype system and experimental results can be downloaded from the following links:


  1. Burton-Jones A.; Storey V.C.; Sugumaran V. and Purao S., A Heuristic-Based Methodology for Semantic Augmentation of User Queries on the Web. In Proceedings ot the International Conference on Conceptual Modeling (ER’03), 2003. pp. 47696489
  2. Carpineto C.; Romano G. and Giannini V., Improving Retrieval Feedback with Multiple Term-Ranking Function Combination. ACM Transactions on Information Systems (TOIS), 2002. 20(3):259-290
  3. Chandramouli K., Kliegr T. , Nemrava J., and S. V., Query Refinement and user Relevance Feedback for contextualized image retrieval. 5th International Conference on Visual Information Engineering (VIE), 2008. pp. 453 – 458
  4. Cimiano P.; Handschuh S. and Staab S., Towards the Self-Annotating Web. In Proceedings of the International World Wide Web Conference (WWW’04), 2004. pp. 462-471
  5. Das S., e.a., Making unstructured data sparql using semantic indexing in oracle database. In Proceedings of 29th IEEE ICDE Conf., 2012. pp. 1405961416
  6. Florescu D., Kossmann D., and Manolescu I., Integrating Keyword Search into Xml Query Processing, . Computer Networks, 2000. 33(1-6):11996135
  7. Frakes W.B. and R.A. Baeza-Yates, Information retrieval: Data structures and algorithms. Prentice-Hall, 1992
  8. Li Y., Yang H., and Jagadish H.V., Term Disambiguation in Natural Language Query for XML. In Proceedings of the International Conference on Flexible Query Answering Systems (FQAS), 2006. LNAI 4027, pp. 13396146
  9. Martinenghi D. and Torlone R., Taxonomy-based relaxation of query answering in relational databases. VLDB Journal, 2014. 23(5):747-769
  10. Mishra C. and Koudas N., Interactive Query Refinement International Conference on Extending Database Technology (EDBT’09), 2009. pp. 862-873
  11. de Lima E.F. and Pedersen J.O., Phrase Recognition and Expansion for Short, Precision biased Queries based on a Query Log. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1999. pp. 145-152, Berkeley, CA
  12. Navigli R., Word Sense Disambiguation: a Survey. ACM Computing Surveys, 2009. 41(2):19669
  13. Navigli R. and Crisafulli G., Inducing Word Senses to Improve Web Search Result Clustering. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, 2010. pp. 11696126, MIT, USA
  14. Voorhees E.M., Query Expansion Using Lexical-Semantic Relations. In Proceedings of the 17th International ACM/SIGIR Conference on Research and Development in Information Retrieval, 1994. pp. 61-69
  15. Sinh Hoa Nguyen, Wojciech Swieboda, and G. Jaskiewicz, Semantic Evaluation of Search Result Clustering Methods. Intelligent Tools for Building a Scientific Information Platform, Studies in Computational Intelligence Volume 467, 2013. 467(393-414)
  16. Wen H., Huang G.S., and L. Z., Clustering web search results using semantic information International Conference on Machine Learning and Cybernetics, 2009. 3(1504 – 1509)