Distributed Duplicate Detection (Wintersemester 2016/2017)
Lecturer: Prof. Dr. Felix Naumann
- Weekly Hours: 4
- Credits: 6
- Enrolment Deadline: 28.10.2016
- Teaching Form: Project seminar
- Enrolment Type: Compulsory Elective Module
- Maximum number of participants: 12
Programs & Modules
- Business Process & Enterprise Technologies
- Operating Systems & Information Systems Technology
Duplicates in datasets are multiple, different representations of same real-world object. Their detection is usually complex. Huge datasets and the online nature of current modern systems even demand for more processing power with more nodes participating to handle the incoming data. In this seminar, we want to explore existing techniques for distributed duplicate detection, re-implement, extend and evaluate them. Popular frameworks like Apache Spark or Apache Flink, will be selected to be used by the teams to ease the implementation.
A naive approach could be to replicate all m records to n nodes, and then split the pairs that should be compared uniformly to all the nodes, so that every node has m/n pairs to compare.
The students who participate, will be provided with relevant literature or they can propose their own, for algorithms that handle this problem, with different approaches. We are looking for maximum 12 students that will form groups, where each group will be assigned one of the latter papers, with the task of implementing and extending it. The evaluation metrics include how efficient(=fast) the algorithms are, and how effective against the metrics of precision, recall and f-measure. The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.
Desired: Information Integration or Data Profiling and Data Cleansing course (we give higher priority if more than 12 students want to participate)
- Introductory session
- Individual meetings with advisors
- Plenary meetings
- Team-based software project
- Active participation
- Short intermediate presentation (15min per team)
- Long final presentation (20min per team)
- Report (6 pages)
- Implementation (efficiency, effectiveness, and extensions)
Please find the maintained schedule on the course page.