XSDF: XML Semantic Disambiguation Framework
Nathalie Charbel
Computer Science Departement
University of Bourgogne
21000 Dijon, France
[email protected]
www.u-bourgogne.fr
Joe TEKLI
SOE, Dept. of Electrical & Computer Eng.
Lebanese American University
36 Byblos, LEBANON
[email protected]
www.lau.edu.lb
Richard CHBEIR
UPPA Laboratory, IUT of Bayonne
University of Pau and Adour Countries
64600 Anglet, FRANCE
[email protected]
www.univ-pau.fr
Gilbert TEKLI
UPPA Laboratory, IUT of Bayonne
University of Pau and Adour Countries
64600 Anglet, FRANCE
[email protected]
www.univ-pau.fr
I. Introduction
XML semantic-aware processing has become one of the central issues in Web data management, documents clustering and information retrieval. While XML data is semi-structured, yet it remains prone to lexical ambiguity, and thus requires dedicated semantic analysis and sense disambiguation processes assigning well-defined semantic meaning to XML elements and attributes. Most existing approaches in this context i) use syntactic information in processing XML data disregarding the semantics involved, ii) only partially consider the structural relations/context of XML nodes, and iii) completely ignore the problem of identifying ambiguous XML nodes.
XSDF is a n XML Semantic Disambiguation Framework that takes as input: an XML document and a general purpose semantic network, and produces as output a semantically augmented XML tree made of unambiguous semantic concepts. It consists of four main modules for: i) linguistic pre-processing of XML node labels and values, ii) selecting ambiguous XML nodes as targets for disambiguation, iii) representing target nodes as special sphere neighborhood vectors considering all XML structural relations, and iv) running the sphere neighborhood vectors through a hybrid disambiguation process, allowing the user to fine-tune disambiguation parameters following her needs.
In comparison with existing DB and IR-related systems involving XML semantic analysis, our prototype is not tied to a specific application nor to a specific context (it does not extend or propose a new XML querying language as in [1, 2], nor does it focus on one single application such as document clustering [3] or structural pattern matching [4]). In fact, it implements low-level algorithms and a semantic evalaution method that could be exploited in various application scenarios, enabling the user to test and evaluate their efficiency in each application domain.
II. System Architecture
Fig. 1. Overall XSDF system architecture.
We have developed a prototype system, titled XSDF (XML Semantic Disambiguation Framework) to test, evaluate and validate our approach, including implementations of its most recent alternatives in the literature. The XSDF prototype, implemented using java (NetBeans 7.0) and WordNet 2.1, is made of five independent and interactive components (cf. Fig. 1):
- XML processor component: starts by verify the integrity of XML documents, by parsing XML tag names and data values, in order to produce corresponding XDT representations. It then passes XML node labels to the linguistic pre-processing component, before computing their lexical and structure properties (i.e., label polysemy, depth, density, and ambiguity degree).
- Linguistic pre-processing component: applies linguistic pre-processing transformations to XML node labels (i.e., tokenization, stop words removal, and stemming).
- XML Context Builder component: builds context vectors following our relational information model for each XML target node selected for disambiguation (based on a user-chosen ambiguity degree threshold value). Context vectors can be built following our sphere neighborhood model, as well as some of its most basic alternatives, including parent node context [5, 6] and sub-tree context [7]. It is extensible to others.
- WordNet Processor component: handles access to the WordNet database , allowing to: i) retrieve concepts (synsets) along with their definitions (glosses), synonyms, and relations, ii) to compute semantic similarity measures (i.e., edge-based, node-based, and gloss-based), and iii) build context vectors for Wordnet concepts (following our context-based disambiguation approach).
- XML Disambiguation component: consists of several autonomous algorithms, including ours: XSDConcept, XSDContext, and XSDCombined, as well as two of their most prominent alternatives which we refer to as: RPD (Root Path Disambiguation) [8], and VSD (Versatile Structure Disambiguation) [9]. It is extensible to others.
A prototype snapshot is shown in Fig. 2.
Fig. 2. Snapshot of XSDF prototype system
We are currently undertaking experiments on XML-based multimedia documents, such as SVG, SMIL, X3D and MPEG-7 [14], as well as XML-based SOAP processing [15, 16].
Hereunder, we provide links to documents and web pages related to our study:
- XML Semantic Disambiguatoin Framework – Technical Report
- XS3: XML Semantic and Structure Similarity
- XML Grammar Matching and Comparison
- GML Data Search and Retrieval
- SVG-to-RDF Image Semantization
References
- Schenkel R., Theobald A. and Weikum G., Semantic Similarity Search on Semistructured Data with the XXL Search Engine, IR Journal 8, 521-545, 2005.
- Schlieder T., Similarity Search in XML Data Using Cost-based Query Transformations. In Proc. of ACM SIGMOD WebDB, pp. 19-24, 2001.
- Dalamagas, T., Cheng, T., Winkel, K., and Sellis, T. 2006. A methodology for clustering XML documents by structure. IS Journal, 31, 3, pp. 187-228, 2006.
- Sanz I. et al., ArHeX : An Approximate Retrieval System for Highly Heterogeneous XML Document Collections, In Proc. of EDBT, 192-206, 2006.
- Taha K. and Elmasri R., CXLEngine: A Comprehensive XML Loosely Structured Search Engine. Proceedings of the EDBT workshop on Database Technologies for Handling XML Information on the Web (DataX’08), 2008. pp. 37-42, Nantes, France.
- Taha K. and Elmasri R., XCDSearch: An XML Context-Driven Search Engine. IEEE Transactions on Knowledge and Data Engineering, 2010. 22(12):1781-1796.
- Theobald M., Schenkel R., and Weikum G., Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data. In Proceedings of the ACM SIGMOD International Workshop on Databases (WebDB), 2003. pp. 1-6, San Diego, California.
- Tagarelli A.; Longo M. and Greco S., Word Sense Disambiguation for XML Structure Feature Generation. In Proceedings of the European Semantic Web Conference, 2009. LNCS 5554, pp. 143-157.
- Mandreoli F.; Martoglia R. and Ronchetti E., Versatile Structural Disambiguation for Semantic-Aware Applications. In Proceedings of the ACM International Conference on Information and Knowledge Management, 2005. pp. 209-216 .