Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI

Duplicate Detection (Wintersemester 2010/2011)

Dozent: Prof. Dr. Felix Naumann (Information Systems)

Allgemeine Information

  • Semesterwochenstunden: 2
  • ECTS: 3
  • Benotet: Ja
  • Einschreibefrist: 1.10.2010 - 31.3.2011
  • Lehrform: SP
  • Belegungsart: Wahlpflichtmodul
  • Maximale Teilnehmerzahl: 10

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA
  • IT-Systems Engineering A
  • IT-Systems Engineering B
  • IT-Systems Engineering C
  • IT-Systems Engineering D
IT-Systems Engineering BA


Duplicate detection is about finding multiple representatives of the same real-world entity within a datset. This task is difficult, because representations might differ slightly, so some similarity measure must be defined to compare pairs of records. Another difficulty is the the high volume, datasets might have, making a pair-wise comparison of all records infeasible.

In this seminar, we want to discuss several papers, covering different aspects of duplicate detecion.


No requirements.


  • Felix Naumann & Melanie Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management #3, 2010.
  • Peter Christen and Karl Goiser. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining, Volume 43, 2007.

Lern- und Lehrformen

Master's seminar for up to 10 students (no teams)


  • Active participation at all seminar dates
  • At least 1 consultation each for presentation and written summary
  • Presentation at the end of the semester
  • Written summary and discussion of the paper (up to 8 pages) with latex template


  • 19.10.2010: Seminar introduction and presentation of the topics
  • 25.10.2010: Registration deadline (Email to Uwe Draisbach)

Please check regularly the seminar page for all other seminar dates.