Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Incremental Duplicate Detection (Sommersemester 2016)

Dozent: Prof. Dr. Felix Naumann (Information Systems) , John Koumarelas (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/teaching/ss-16/incremental-duplicate-detection.html

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist:
  • Lehrform: Seminar
  • Belegungsart: Wahlpflichtmodul
  • Maximale Teilnehmerzahl: 12

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA
  • IT-Systems Engineering A
  • IT-Systems Engineering B
  • IT-Systems Engineering C
  • IT-Systems Engineering D
IT-Systems Engineering BA


Duplicates in datasets are multiple, different representations of same real-world object. Their detection is usually complex. Huge datasets and the online nature of current modern systems even demand for an incremental detection on new incoming data. In this seminar, we want to explore existing techniques for incremental duplicate detection, re-implement them, extend them, and evaluate them.

A naive approach of comparing a new record with all existent records would mean O(n) complexity, which in real-time systems is not feasible. Therefore the students who participate, will be provided with relevant literature that propose systems with advanced indexing techniques for handling this problem.


Desired: Information Integration or Data Profiling and Data Cleansing course (we give higher priority if more than 12 students want to participate)

Lern- und Lehrformen

  • Introductory session
  • Individual meetings with advisors
  • Plenary meetings
  • Team-based software project (teams of 2)


  • Active participation
  • Short intermediate presentation (10min per team)
  • Long final presentation (30min per team)
  • Report (6 pages)
  • Implementation (efficiency, effectiveness, and extensions)


Please find the maintained schedule on the course page.