Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Tobias Vogel

 

 

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam, Germany

Phone: ++49 331 5509 292
Fax: ++49 331 5509 287
Room: E-2.02.2
E-Mail: T. Vogel

FOAF Description


Research

  • Data Quality Services for Duplicate Detection

Research Areas

  • Duplicate Detection
  • Data Cleaning

Projects

Running Projects

Finished Projects

  • PoSR (Potsdam Service Repository)
  • iDuDe (Duplicate Detection for iOS)

Teaching

  • WS 2009/2010: Master's Seminar "Emerging Web Services Technologies"
  • WS 2009/2010: Workshop "Duplikaterkennung"
  • SS 2010: Master's Seminar: "Similarity Search Algorithms"

Activities

  • Local Arrangements Chair for ICIQ 2009

Automatic Blocking Key Selection for Duplicate Detection based on Unigram Combinations

Tobias Vogel, Felix Naumann
In Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB, 2012

Abstract:

Duplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen. We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation is accompanied with several comprehensive experiments on large artificial and real-world datasets.

BibTeX file

@inproceedings{UnigramblockingVogelNaumann2012,
author = { Tobias Vogel, Felix Naumann },
title = { Automatic Blocking Key Selection for Duplicate Detection based on Unigram Combinations },
year = { 2012 },
month = { 0 },
abstract = { Duplicate detection is the process of identifying multiple but different representations of same real-world objects, which typically involves a large number of comparisons. Partitioning is a well-known technique to avoid many unnecessary comparisons. However, partitioning keys are usually handcrafted, which is tedious and the keys are often poorly chosen. We propose a technique to find suitable blocking keys automatically for a dataset equipped with a gold standard. We then show how to re-use those blocking keys for datasets from similar domains lacking a gold standard. Blocking keys are created based on unigrams, which we extend with length-hints for further improvement. Blocking key creation is accompanied with several comprehensive experiments on large artificial and real-world datasets. },
booktitle = { Proceedings of the 10th International Workshop on Quality in Databases (QDB) in conjunction with VLDB },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Fri, 17 Apr 2015 11:21:24 +0200

Master's Theses

  • <a href="http://www.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/arbeiten/Thema_Masterarbeit.pdf">Duplicate Detection Across Structured And Unstructured Data</a> - David Sonnabend <br>
  • Duplicate Detection with CrowdSourcing (e.g. Amazon's Mechanical Turk) - David Wenzel