Efficient Identification of Similar XML Fragments Based on Tree Edit Distance

Author:

Wang, Hongzhi

Place:

Hershey, PA

Publisher:

IGI Global

Date published:

2011

Record type:

Book/Monograph

Responsibility:

Li, Jianzhong, jt. author

Li, Fei, jt. author

Editor:

Tagarelli, Andrea

Source:

XML Data Mining

Subject:

DATA MINING

Abstract:

Similarity detection between large XML fragment sets is broadly used in many applications such as data integration and XML de-duplication. Extensive methods are used to find similar XML fragments, such as the pq-gram state-of-the-art method which allows for relatively high join quality and efficiency. In this chapter, we propose pq-hash as an improvement to pq-grams. As the base of pq-hash, a randomized data structure, pq-array, is developed. With pq-array, large trees are represented as small fixed sized arrays. To efficiently perform similarity join on XML fragment sets, in this chapter we propose a cluster-based partition strategy as well as a sort-merge & hash join strategy to avoid nested loop join. Both our theoretical analysis and experimental results confirm that, while retaining high join quality, pq-hash gains much higher efficiency than pq-grams, and our strategies for approximate join are effective.

Series:

Advances in Data Mining and Database Management

Catagory URL:

http://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/978-1-61350-3…

DOI:

http://dx.doi.org/10.4018/978-1-61350-356-0.ch004

CITATION: Wang, Hongzhi. Efficient Identification of Similar XML Fragments Based on Tree Edit Distance edited by Tagarelli, Andrea . Hershey, PA : IGI Global , 2011. XML Data Mining - Available at: https://library.au.int/frefficient-identification-similar-xml-fragments-based-tree-edit-distance