UPPA
ACMSigappFr
LAU Alumni Association Logo-Final-May20110

XS3: A Prototype for XML Document and Grammar Similarity Evaluation

 

Extended Toward Approximate GML Data Search & Retrieval

 

Download prototype and short demo video


Joe TEKLI
SOE, Dept. of Electrical & Computer Eng.
Lebanese American University
36 Byblos, LEBANON
joe.tekli@lau.edu.lb
www.lau.edu.lb


Richard CHBEIR
UPPA Laboratory, IUT of Bayonne
University of Pau and Adour Countries
64600 Anglet, FRANCE
richard.chbeir@univ-pau.fr
www.univ-pau.fr

 

I. Introduction

An XML document consists of a hierarchically structured self-describing piece of information, made of atomic and complex elements (i.e., containing sub-elements) as well as atomic attributes, thus incorporating structure and semantically rich data in one entity. This simple and self-describing nature of XML promotes a number of emerging techniques ranging from intelligent web search and retrieval (e.g., GML retrieval) to automatic information extraction and clustering/classification requiring XML similarity evaluation.

XS3 is a system for XML Structural and Semantic Similarity assessment. It allows the comparison of heterogeneous XML documents and grammars (originating form different data sources) based on their structural and/or semantic features, a topic which has been investigated in both DB and IR related research.

In comparison with existing DB and IR-related systems involving XML similarity assessment, our prototype is not tied to a specific application nor to a specific context (it does not extend or propose a new XML querying language as in [5, 6], nor does it focus on one single application such as document clustering [2] or structural pattern matching [4]). In fact, it implements low-level algorithms and similarity evaluation methods that could be exploited in various application scenarios, enabling the user to test and evaluate their efficiency in each application domain, and choose the one that is adapted to her needs.

II. System Architecture

The XS3 prototype, implemented using C#.Net, is made of four independent and interactive components, as well as various comparison modules and facilities (Figure 1).
The parser component starts by verifying the integrity of XML documents and grammars (DTDs and recently XSDs), subsequently transforming them into ordered labeled trees to be treated by the similarity evaluation component.
The similarity evaluation component consists of several autonomous algorithms (mostly based on the concept of tree edit distance), among which [1, 2, 3, 7, 8] dedicated to XML document/document, [9, 10] for document/grammar comparison, [11, 12] for grammar/grammar matching. It is extensible to other XML-based comparison approaches (a combined structural and semantic similarity measure has been recently implemented in XS3 [13], integrating edit distance and the traditional IR vector space model).
The Synthetic XML document/grammar generator produces sets of XML documents and grammars, based on specific user input requirements (e.g., a variability parameter for document generation, a controlled vocabulary for generating synthetic grammars, operator disposition, etc.).
Furthermore, a taxonomic analyzer component was introduced to compute semantic similarity values between words (expressions) in a given reference knowledge base (e.g., WordNet), to be subsequently exploited in evaluating XML label similarity [11, 12].
Built upon the main system components of XS3 are different modules and facilities for assessing XML similarity. These range over One to One comparisons (comparing one XML document X1 to another document X2, one XML grammar D1 to another D2, or one document X1 to a grammar D1 and vice-versa), One to Many comparisons (comparing one XML document X1/grammar D1 to a set of XML documents/grammars and vice-versa, ranking the documents/grammars according to their similarity to X1/D1), and Many to Many comparison module (comparing sets of XML documents/grammars, consequently enabling XML document/grammar clustering and classification).

 

Fig. 1. Overall XS3 architecture.

 

With its various components and functionalities, XS3 can be exploited in XML ranked search-by-document and search-by-grammar applications, as well as classic data warehousing and version control ones. In addition, its clustering and classification facilities integrate information retrieval concepts and metrics in an original manner (i.e., specially devised XML-related precision and recall metrics coupled with hierarchical classification/clustering algorithms) to be utilized as means for comparing the efficiency and accuracy of different XML similarity approaches in various application scenarios. XS3 is downloadable for research purposes (cf. Figure 2). A public web service version of our experimental system will be soon available.

 

snapshots
Fig. 2. Snapshots of the XS3 system

 

GML Snapshots
Fig. 3. Snapshot of the newly introduced GML data search interface.

 

We are currently undertaking experiments on XML-based multimedia documents, such as SVG, SMIL, X3D and MPEG-7 [14], as well as XML-based SOAP processing [15, 16].

Hereunder, we provide links to web pages and technical reports related to our study, detailing certain components and algorithms, as well as experimental results.

References

    1. Chawathe S., Comparing Hierarchical Data in External Memory. In Proc. of the VLDB Conference, pp. 90-101, 1999
    2. Dalamagas, T., Cheng, T., Winkel, K., and Sellis, T. 2006. A methodology for clustering XML documents by structure. IS Journal, 31, 3, pp. 187-228, 2006
    3. Nierman A. and Jagadish H. V., Evaluating structural similarity in XML documents. In Proc. of ACM SIGMOD WebDB, pp. 61-66, 2002
    4. Sanz I. et al., ArHeX : An Approximate Retrieval System for Highly Heterogeneous XML Document Collections, In Proc. of EDBT, 192-206, 2006
    5. Schenkel R., Theobald A. and Weikum G., Semantic Similarity Search on Semistructured Data with the XXL Search Engine, IR Journal 8, 521-545, 2005
    6. Schlieder T., Similarity Search in XML Data Using Cost-based Query Transformations. In Proc. of ACM SIGMOD WebDB, pp. 19-24, 2001
    7. Tekli J., Chbeir R. and Yetongnon K., A Fine-grained XML Structural Comparison Approach. In Proc. of the ER Conf., pp. 582-598, Auckland, New Zealand, 2007
    8. Tekli J. et al., Efficient XML Structural Similarity Detection using Sub-tree Commonalities. In Proc. of SBBD and SIGMOD DiSC, (Best paper), Brazil, 2007
    9. Tekli J., Chbeir R. and Yetongnon K., Structural Similarity Evaluation between XML Documents and DTDs. In Proc. of the WISE Conf., pp. 186-211, Nancy, France, 2007
    10. Tekli J., Chbeir R., Traina A.J., Traina C. Jr, Fileto R., Approximate XML Structure Validation using Document-Grammar Tree Similarity, Submitted to ACM Tranasctions on the Web (ACM TWEB), 2013
    11. Tekli J., Chbeir R. and Yetongnon K., XML Grammar Matching and Comparison: A User-based Framework. In Proc. of the ER Conf.,, Brazil, LNCS 5829, 2009, pp. 294-314
    12. Tekli J. and Chbeir R., An XML Grammar Matching Framework, Elsevier Information Sciences, 210:1-40, 2012
    13. Tekli J., Chbeir R. and Yetongnon K.,A novel XML document structure comparison framework based-on sub-tree commonalities and label semantics, Elsevier Journal of Web Semantics (JWS),11:14-40, 2012
    14. Tekli J., Chbeir R. and Yetongnon K., XS3: A System for Similarity Evaluation in Multimedia-based Heterogeneous XML Repositories, ACM International Conference on Multimedia, Vancouver, Canada, 2008
    15. Tekli J., Damiani E., and Chbeir R., Differential SOAP Multicasting, In Proc. of IEEE International Conference on Web Services (ICWS), pp. 1-8, 2011
    16. Tekli J., Damiani E., and Chbeir R., Using XML-Based Multicasting to Improve Web Service Scalability. International Journal on Web Services Research (IJWSR), 9(1): 1-29, 2012