Efficient Identification of Similar XML Fragments Based on Tree Edit Distance

Efficient Identification of Similar XML Fragments Based on Tree Edit Distance

Author: 
Wang, Hongzhi
Place: 
Hershey, PA
Publisher: 
IGI Global
Date published: 
2011
Record type: 
Responsibility: 
Li, Jianzhong, jt. author
Li, Fei, jt. author
Editor: 
Tagarelli, Andrea
Source: 
XML Data Mining
Subject: 
Abstract: 

Similarity detection between large XML fragment sets is broadly used in many applications such as data integration and XML de-duplication. Extensive methods are used to find similar XML fragments, such as the pq-gram state-of-the-art method which allows for relatively high join quality and efficiency. In this chapter, we propose pq-hash as an improvement to pq-grams. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. To efficiently perform similarity join on XML fragment sets, in this chapter we propose a cluster-based partition strategy as well as a sort-merge & hash join strategy to avoid nested loop join. Both our theoretical analysis and experimental results confirm that, while retaining high join quality, pq-hash gains much higher efficiency than pq-grams, and our strategies for approximate join are effective.

Series: 
Advances in Data Mining and Database Management

CITATION: Wang, Hongzhi. Efficient Identification of Similar XML Fragments Based on Tree Edit Distance edited by Tagarelli, Andrea . Hershey, PA : IGI Global , 2011. XML Data Mining - Available at: https://library.au.int/frefficient-identification-similar-xml-fragments-based-tree-edit-distance