XS3: A Prototype for XML Document and Grammar Similarity Evaluation
Extended Toward Approximate GML Data Search & Retrieval
Joe TEKLI
SOE, Dept. of Electrical & Computer Eng.
Lebanese American University
36 Byblos, LEBANON
[email protected]
www.lau.edu.lb
Richard CHBEIR
UPPA Laboratory, IUT of Bayonne
University of Pau and Adour Countries
64600 Anglet, FRANCE
[email protected]
www.univ-pau.fr
I. Introduction
An XML document consists of a hierarchically structured self-describing piece of information, made of atomic and complex elements (i.e., containing sub-elements) as well as atomic attributes, thus incorporating structure and semantically rich data in one entity. This simple and self-describing nature of XML promotes a number of emerging techniques ranging from intelligent web search and retrieval (e.g., GML retrieval) to automatic information extraction and clustering/classification requiring XML similarity evaluation.
XS3 is a system for XML Structural and Semantic Similarity assessment. It allows the comparison of heterogeneous XML documents and grammars (originating form different data sources) based on their structural and/or semantic features, a topic which has been investigated in both DB and IR related research.
In comparison with existing DB and IR-related systems involving XML similarity assessment, our prototype is not tied to a specific application nor to a specific context (it does not extend or propose a new XML querying language as in [5, 6], nor does it focus on one single application such as document clustering [2] or structural pattern matching [4]). In fact, it implements low-level algorithms and similarity evaluation methods that could be exploited in various application scenarios, enabling the user to test and evaluate their efficiency in each application domain, and choose the one that is adapted to her needs.
II. System Architecture
The parser component starts by verifying the integrity of XML documents and grammars (DTDs and recently XSDs), subsequently transforming them into ordered labeled trees to be treated by the similarity evaluation component.
The similarity evaluation component consists of several autonomous algorithms (mostly based on the concept of tree edit distance), among which [1, 2, 3, 7, 8] dedicated to XML document/document, [9, 10] for document/grammar comparison, [11, 12] for grammar/grammar matching. It is extensible to other XML-based comparison approaches (a combined structural and semantic similarity measure has been recently implemented in XS3 [13], integrating edit distance and the traditional IR vector space model).
The Synthetic XML document/grammar generator produces sets of XML documents and grammars, based on specific user input requirements (e.g., a variability parameter for document generation, a controlled vocabulary for generating synthetic grammars, operator disposition, etc.).
Furthermore, a taxonomic analyzer component was introduced to compute semantic similarity values between words (expressions) in a given reference knowledge base (e.g., WordNet), to be subsequently exploited in evaluating XML label similarity [11, 12].
Built upon the main system components of XS3 are different modules and facilities for assessing XML similarity. These range over One to One comparisons (comparing one XML document X1 to another document X2, one XML grammar D1 to another D2, or one document X1 to a grammar D1 and vice-versa), One to Many comparisons (comparing one XML document X1/grammar D1 to a set of XML documents/grammars and vice-versa, ranking the documents/grammars according to their similarity to X1/D1), and Many to Many comparison module (comparing sets of XML documents/grammars, consequently enabling XML document/grammar clustering and classification).
![]() |
Fig. 1. Overall XS3 architecture. |
![]() |
Fig. 2. Snapshots of the XS3 system |
![]() |
Fig. 3. Snapshot of the newly introduced GML data search interface. |
We are currently undertaking experiments on XML-based multimedia documents, such as SVG, SMIL, X3D and MPEG-7 [14], as well as XML-based SOAP processing [15, 16].
Hereunder, we provide links to web pages and technical reports related to our study, detailing certain components and algorithms, as well as experimental results.
- XML Document Comparison: Appendix
- XML Grammar Matching and Comparison
- Approximate XML Document Grammar Validation – Technical Report
- GML Data Search and Retrieval
References
-
- Chawathe S., Comparing Hierarchical Data in External Memory. In Proc. of the VLDB Conference, pp. 90-101, 1999
- Dalamagas, T., Cheng, T., Winkel, K., and Sellis, T. 2006. A methodology for clustering XML documents by structure. IS Journal, 31, 3, pp. 187-228, 2006
- Nierman A. and Jagadish H. V., Evaluating structural similarity in XML documents. In Proc. of ACM SIGMOD WebDB, pp. 61-66, 2002
- Sanz I. et al., ArHeX : An Approximate Retrieval System for Highly Heterogeneous XML Document Collections, In Proc. of EDBT, 192-206, 2006
- Schenkel R., Theobald A. and Weikum G., Semantic Similarity Search on Semistructured Data with the XXL Search Engine, IR Journal 8, 521-545, 2005
- Schlieder T., Similarity Search in XML Data Using Cost-based Query Transformations. In Proc. of ACM SIGMOD WebDB, pp. 19-24, 2001
- Tekli J., Chbeir R. and Yetongnon K., A Fine-grained XML Structural Comparison Approach. In Proc. of the ER Conf., pp. 582-598, Auckland, New Zealand, 2007
- Tekli J. et al., Efficient XML Structural Similarity Detection using Sub-tree Commonalities. In Proc. of SBBD and SIGMOD DiSC, (Best paper), Brazil, 2007
- Tekli J., Chbeir R. and Yetongnon K., Structural Similarity Evaluation between XML Documents and DTDs. In Proc. of the WISE Conf., pp. 186-211, Nancy, France, 2007
- Tekli J., Chbeir R., Traina A.J., Traina C. Jr, Fileto R., Approximate XML Structure Validation using Document-Grammar Tree Similarity, Submitted to ACM Tranasctions on the Web (ACM TWEB), 2013
- Tekli J., Chbeir R. and Yetongnon K., XML Grammar Matching and Comparison: A User-based Framework. In Proc. of the ER Conf.,, Brazil, LNCS 5829, 2009, pp. 294-314
- Tekli J. and Chbeir R., An XML Grammar Matching Framework, Elsevier Information Sciences, 210:1-40, 2012
- Tekli J., Chbeir R. and Yetongnon K.,A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics, Elsevier Journal of Web Semantics (JWS),11:14-40, 2012
- Tekli J., Chbeir R. and Yetongnon K., XS3: A System for Similarity Evaluation in Multimedia-based Heterogeneous XML Repositories, ACM International Conference on Multimedia, Vancouver, Canada, 2008
- Tekli J., Damiani E., and Chbeir R., Differential SOAP Multicasting, In Proc. of IEEE International Conference on Web Services (ICWS), pp. 1-8, 2011
- Tekli J., Damiani E., and Chbeir R., Using XML-Based Multicasting to Improve Web Service Scalability. International Journal on Web Services Research (IJWSR), 9(1): 1-29, 2012