Efficient Identification of Similar XML Fragments Based on Tree Edit Distance
Efficient Identification of Similar XML Fragments Based on Tree Edit Distance
Similarity detection between large XML fragment sets is broadly used in many applications such as data integration and XML de-duplication. Extensive methods are used to find similar XML fragments, such as the pq-gram state-of-the-art method which allows for relatively high join quality and efficiency. In this chapter, we propose pq-hash as an improvement to pq-grams. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. To efficiently perform similarity join on XML fragment sets, in this chapter we propose a cluster-based partition strategy as well as a sort-merge & hash join strategy to avoid nested loop join. Both our theoretical analysis and experimental results confirm that, while retaining high join quality, pq-hash gains much higher efficiency than pq-grams, and our strategies for approximate join are effective.
CITATION: Wang, Hongzhi. Efficient Identification of Similar XML Fragments Based on Tree Edit Distance edited by Tagarelli, Andrea . Hershey, PA : IGI Global , 2011. XML Data Mining - Available at: https://library.au.int/frefficient-identification-similar-xml-fragments-based-tree-edit-distance