Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Publications (sorted in inverse chronological order)

Automatic Blocking Key Selection for Duplicate Detection based on Unigram Combinations

Vogel, Tobias; Naumann, Felix in Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB 2012 .

Duplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen. We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation is accompanied with several comprehensive experiments on large artificial and real-world datasets.
Unigram_Blocking_20Tobias_20Vogel__20Felix_20Naumann.pdf
Further Information
Tags isg