We are happy to announce that our paper "Transforming Pairwise Duplicates to Entity Clusters for High Quality Duplicate Detection" has been accepted at ACM Journal of Data and Information Quality (JDIQ).
Authors:Uwe Draisbach, Peter Christen, Felix Naumann
Abstract: Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result.
We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters.