Prof. Dr. Felix Naumann

Tobias Vogel










FOAF Description



  • Data Quality Services for Duplicate Detection

Research Areas

  • Duplicate Detection
  • Data Cleaning


Running Projects

Finished Projects

  • PoSR (Potsdam Service Repository)
  • iDuDe (Duplicate Detection for iOS)


  • WS 2009/2010: Master's Seminar "Emerging Web Services Technologies"
  • WS 2009/2010: Workshop "Duplikaterkennung"
  • SS 2010: Master's Seminar: "Similarity Search Algorithms"


  • Local Arrangements Chair for ICIQ 2009


Automatic Blocking Key Selection for Duplicate Detection based on Unigram Combinations

Vogel, Tobias; Naumann, Felix in Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB 2012 .

Duplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen. We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation is accompanied with several comprehensive experiments on large artificial and real-world datasets.
Further Information
Tags isg

Master's Theses

  • <a href="http://www.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/arbeiten/Thema_Masterarbeit.pdf">Duplicate Detection Across Structured And Unstructured Data</a> - David Sonnabend <br>
  • Duplicate Detection with CrowdSourcing (e.g. Amazon's Mechanical Turk) - David Wenzel