Incremental Duplicate Detection (Sommersemester 2016)

Dozent: Prof. Dr. Felix Naumann (Information Systems) , John Koumarelas (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/teaching/ss-16/incremental-duplicate-detection.html

Allgemeine Information

Semesterwochenstunden: 4
ECTS: 6
Benotet: Ja
Einschreibefrist:
Lehrform: Seminar
Belegungsart: Wahlpflichtmodul
Maximale Teilnehmerzahl: 12

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA

IT-Systems Engineering A
IT-Systems Engineering B
IT-Systems Engineering C
IT-Systems Engineering D

IT-Systems Engineering BA

Beschreibung

Duplicates in datasets are multiple, different representations of same real-world object. Their detection is usually complex. Huge datasets and the online nature of current modern systems even demand for an incremental detection on new incoming data. In this seminar, we want to explore existing techniques for incremental duplicate detection, re-implement them, extend them, and evaluate them.

A naive approach of comparing a new record with all existent records would mean O(n) complexity, which in real-time systems is not feasible. Therefore the students who participate, will be provided with relevant literature that propose systems with advanced indexing techniques for handling this problem.

Voraussetzungen

Desired: Information Integration or Data Profiling and Data Cleansing course (we give higher priority if more than 12 students want to participate)

Lern- und Lehrformen

Introductory session
Individual meetings with advisors
Plenary meetings
Team-based software project (teams of 2)

Leistungserfassung

Active participation
Short intermediate presentation (10min per team)
Long final presentation (30min per team)
Report (6 pages)
Implementation (efficiency, effectiveness, and extensions)

Termine

Please find the maintained schedule on the course page.

Zurück