The paper "Cluster-based Sorted Neighborhood for Efficient Duplicate Detection" by Ahmad Samiei and Felix Naumann had been accepted at the DINA workshop at the ICDM-2016 conference.
Duplicate detection intends to find multiple and syntactically different representations of the same real-world entities in a dataset. The naive way of duplicate detection entails a quadratic number of pair-wise record comparisons to identify the duplicates. This number of comparisons might take hours even for an average sized dataset. As today's databases grow very fast, different candidate-selection methods, such as sorted neighborhood, blocking, canopy clustering and their variations, address this problem by shrinking the comparison space. The volume and velocity of data-change require ever faster and more flexible methods of duplicate detection. In particular, they need dynamic indices that can be updated efficiently as new data arrives.
We present a novel approach, which combines the idea of cluster-based methods with the well-known sorted neighborhood method. It carefully filters out irrelevant candidate pairs, which are less likely to yield duplicates, by pre-clustering records based not only on their proximity after sorting, but also on their similarity in selected attributes. An empirical evaluation on synthetic and real-world datasets shows that our approach improves the overall runtime over existing approaches while maintaining comparable result quality. Meanwhile, it uses dynamic indices, that in turns make it useful for deduplicating streaming data.