Hasso-Plattner-Institut
Hasso-Plattner-Institut
  
Login
 

Distributed Duplicate Detection (Wintersemester 2016/2017)

Lecturer: Prof. Dr. Felix Naumann (Information Systems)
Course Website: https://hpi.de/en/naumann/teaching/teaching/ws-1617/distributed-duplicate-detection.html

General Information

  • Weekly Hours: 4
  • Credits: 6
  • Graded: yes
  • Enrolment Deadline: 28.10.2016
  • Teaching Form: Project seminar
  • Enrolment Type: Compulsory Elective Module
  • Maximum number of participants: 12

Programs & Modules

IT-Systems Engineering BA
  • Business Process & Enterprise Technologies
  • Operating Systems & Information Systems Technology

Description

Duplicates in datasets are multiple, different representations of same real-world object. Their detection is usually complex. Huge datasets and the online nature of current modern systems even demand for more processing power with more nodes participating to handle the incoming data. In this seminar, we want to explore existing techniques for distributed duplicate detection, re-implement, extend and evaluate them. Popular frameworks like Apache Spark or Apache Flink, will be selected to be used by the teams to ease the implementation.

A naive approach could be to replicate all m records to n nodes, and then split the pairs that should be compared uniformly to all the nodes, so that every node has m/n pairs to compare.

The students who participate, will be provided with relevant literature or they can propose their own, for algorithms that handle this problem, with different approaches. We are looking for maximum 12 students that will form groups, where each group will be assigned one of the latter papers, with the task of implementing and extending it. The evaluation metrics include how efficient(=fast) the algorithms are, and how effective against the metrics of precision, recall and f-measure. The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.

Requirements

Desired: Information Integration or Data Profiling and Data Cleansing course (we give higher priority if more than 12 students want to participate)

Learning

  • Introductory session
  • Individual meetings with advisors
  • Plenary meetings
  • Team-based software project

Examination

  • Active participation
  • Short intermediate presentation (15min per team)
  • Long final presentation (20min per team)
  • Report (6 pages)
  • Implementation (efficiency, effectiveness, and extensions)

Dates

Please find the maintained schedule on the course page.

Zurück