Large Scale Duplicate Detection

In this seminar, we want to find duplicate entries in the CD database freedb.org using the Map/Reduce platform Stratosphere.

We will use a dataset of ~2 million CDs with ~15 million tracks. Naive duplicate detection would result in roughly 4 trillion comparisons, which is computionally infeasible.

Hence, the goal of the seminar is to find more sophisticated algorithms that avoid unnecessary comparisons. In particular, all algorithms should run in a parallel fashion on our cluster to speed up the duplicate detection process by adding more computers to the cluster.

We are looking for 4 groups of 2 students each, which compete for the best results in a similar fashion to the Data Cleansing Workshops. However, the evaluation metric is how efficient(=fast) the algorithms are, given a minimum effectiveness(=quality, how many duplicates are correctly identified). The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.

Please also check the official course Web site in the lecture directory of the HPI.

Time schedule

To participate, please join us at the first meeting on April 08 2013 in ?. We will only use one timeslot except for the final presentations.

April 08: Topic introduction
April 13: Submission of topic wishlist
April 14: Notification
May 06: Paper presentations and first ideas (15+5 min)
June ??: Intermediate presentation (15+5 min)
July ??: Final presentation (30+10 min)
August ??: Final report (6-8 pages)
A few mandatory (before presentations/submission of report) and some optional consultations once per week

Slides

Date	Topic		Slides
April 08, 2013	Introduction		PDF
April 15, 2013	Map/Reduce		PPTX PDF
May, 2013	Paper Presentation	Multi-pass Sorted Neighborhood Blocking with MapReduce	PDF
		CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks	PDF
		Efﬁcient parallel set-similarity joins using MapReduce	PDF
		A fast approach for parallel deduplication on multicore Processors	PDF
May 27, 2013	Efficient Java		PDF

Grading Process

6 LP
15% Paper presentation (15+5 min)
15% Intermediate presentation (15+5 min)
20% Final presentation (30+10 min)
25% Final report (6-8 pages)
10% Participation in seminars and consultations
15% Algorithms & Implementation

Literature

Overview

[1] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.

[2] Mauricio A. Hernández and Salvatore J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery 2, 1998.

[3] Stratosphere Project Homepage http://www.stratosphere.eu/

Papers for Presentation

[4] Rares Vernica, Michael J. Carey, and Chen Li. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the International Conference on Management of Data (SIGMOD), 2010.

[5] Ahmed Metwally and Christos Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the Very Large Database Endowment (PVLDB) 5, 2012.

[6] Lars Kolb, Andreas Thor, and Erhard Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. Computer Science - Research and Development 27, 2012.

[7] Foto N. Afrati, Anish Das Sarma, David Menestrina, Aditya G. Parameswaran, Jeffrey D. Ullman. Fuzzy Joins Using MapReduce. In Proceedings of the International Conference on Data Engineering, 2012.

[8] Dal Bianco, Guilherme, Renata Galante, and Carlos A. Heuser. A fast approach for parallel deduplication on multicore processors. In Proceedings of the ACM Symposium on Applied Computing, 2011.

[9] Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, Philip Bohannon. CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks. In Proceedings of the International Conference on Information and Knowledge Management, 2012.