Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Tobias Vogel

 

 

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam, Germany

Phone: ++49 331 5509 292
Fax: ++49 331 5509 287
Room: E-2.02.2
E-Mail: T. Vogel

FOAF Description


Research

  • Data Quality Services for Duplicate Detection

Research Areas

  • Duplicate Detection
  • Data Cleaning

Projects

Running Projects

Finished Projects

  • PoSR (Potsdam Service Repository)
  • iDuDe (Duplicate Detection for iOS)

Teaching

  • WS 2009/2010: Master's Seminar "Emerging Web Services Technologies"
  • WS 2009/2010: Workshop "Duplikaterkennung"
  • SS 2010: Master's Seminar: "Similarity Search Algorithms"

Activities

  • Local Arrangements Chair for ICIQ 2009

Publications

Semi-Supervised Consensus Clustering: Reducing Human Effort

Vogel, Tobias; Naumann, Felix in Proceedings of the International Workshop on Data Integration and Applications 2014 .

Machine-based clustering yields fuzzy results. For example, when detecting duplicates in a dataset, different tools might end up with different clusterings. Eventually, a decision needs to be made, defining which records are in the same cluster, i. e., are duplicates. Such a definitive result is called a Consensus Clustering and can be created by evaluating the clustering attempts against each other and only resolving the disagreements by human experts. Yet, there can be different consensus clusterings, depending on the choice of disagreements presented to the human expert. In particular, they may require a different number of manual inspections. We present a set of strategies to select the smallest set of manual inspections to arrive at a consensus clustering and evaluate their efficiency on a set of real-world and synthetic datasets.
SemiSupervisedConsensusClustering.pdf
Further Information
Tags isg

Master's Theses

  • <a href="http://www.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/arbeiten/Thema_Masterarbeit.pdf">Duplicate Detection Across Structured And Unstructured Data</a> - David Sonnabend <br>
  • Duplicate Detection with CrowdSourcing (e.g. Amazon's Mechanical Turk) - David Wenzel