Hasso-Plattner-Institut20 Jahre HPI
Hasso-Plattner-Institut20 Jahre HPI

Large Scale Duplicate Detection (Sommersemester 2013)

Dozent: Prof. Dr. Felix Naumann (Information Systems)

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist: 10.2.2013 - 30.4.2013
  • Lehrform: Seminar
  • Belegungsart: Wahlpflichtmodul
  • Maximale Teilnehmerzahl: 8

Studiengänge & Module

IT-Systems Engineering BA
  • Business Process & Enterprise Technologies
  • Operating Systems & Information Systems Technology


In this seminar, we want to find duplicate entries in the CD database freedb.org using the Map/Reduce platform Stratosphere.

We will use a dataset of ~2 million CDs with ~15 million tracks. Naive duplicate detection would result in roughly 4 trillion comparisons, which is computionally infeasible.

Hence, the goal of the seminar is to find more sophisticated algorithms that avoid unnecessary comparisons. In particular, all algorithms should run in a parallel fashion on our cluster to speed up the duplicate detection process by adding more computers to the cluster.

Please refer to the course Web site of our chair for additional informations and slides.


There are no special prerequisites for this course. All students are expected to independently study Map/Reduce and duplicate detection from literature. Thus, prior knowledge in either topic will help to focus faster on the actual challenge.

Students that already participated in prior Map/Reduce courses during their Master study may still apply but we prefer other students in a potential selection process.


[1] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.

[2] Mauricio A. Hernández and Salvatore J. Stolfo. The Merge/Purge Problem for Large Databases. SIGMOD Record 24, 1995.

[3] Lars Kolb, Andreas Thor, and Erhard Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. Computer Science - Research and Development 27, 2012.

[4] Rares Vernica, Michael J. Carey, and Chen Li. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the International Conference on Management of data (SIGMOD), 2010. 

[5] Ahmed Metwally and Christos Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the Very Large Database Endowment (PVLDB) 5, 2012.

[6] http://www.stratosphere.eu/

Lern- und Lehrformen

We are looking for 4 groups of 2 students each, which compete for the best results in a similar fashion to the Data Cleansing Workshops. However, the evaluation metric is how efficient(=fast) the algorithms are, given a minimum effectiveness(=quality, how many duplicates are correctly identified). The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.

Unless we have students that prefer having the seminar in English, it will be held in German.


15% Paper presentation (15+5 min)
15% Intermediate presentation (15+5 min)
20% Final presentation (30+10 min)
25% Final report (6-8 pages)
10% Participation in seminars and consultations
15% Algorithms & Implementation


The seminar starts on Monday 8th. In case of a selection process, students will be notified by the end of the week.

NEU: ab 29.04. neuer Termin montags, 11:00-12:30, Raum A-1.1