Vogel, Tobias; Naumann, Felix
Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB
Duplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen. We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation is accompanied with several comprehensive experiments on large artificial and real-world datasets.