Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Large Scale Duplicate Detection

In this seminar, we want to find duplicate entries in the CD database freedb.org using the Map/Reduce platform Stratosphere.

We will use a dataset of ~2 million CDs with ~15 million tracks. Naive duplicate detection would result in roughly 4 trillion comparisons, which is computionally infeasible.

Hence, the goal of the seminar is to find more sophisticated algorithms that avoid unnecessary comparisons. In particular, all algorithms should run in a parallel fashion on our cluster to speed up the duplicate detection process by adding more computers to the cluster.

We are looking for 4 groups of 2 students each, which compete for the best results in a similar fashion to the Data Cleansing Workshops. However, the evaluation metric is how efficient(=fast) the algorithms are, given a minimum effectiveness(=quality, how many duplicates are correctly identified). The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.

Please also check the official course Web site in the lecture directory of the HPI.

Time schedule

To participate, please join us at the first meeting on April 08 2013 in ?. We will only use one timeslot except for the final presentations.

  • April 08: Topic introduction
  • April 13: Submission of topic wishlist
  • April 14: Notification
  • May 06: Paper presentations and first ideas (15+5 min)
  • June ??: Intermediate presentation (15+5 min)
  • July ??: Final presentation (30+10 min)
  • August ??: Final report (6-8 pages)
  • A few mandatory (before presentations/submission of report) and some optional consultations once per week

Slides

DateTopicSlides
April 08, 2013IntroductionPDF
April 15, 2013Map/ReducePPTXPDF
May, 2013


Paper Presentation


Multi-pass Sorted Neighborhood Blocking with MapReducePDF
CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication TasksPDF
Efficient parallel set-similarity joins using MapReducePDF

A fast approach for parallel deduplication on multicore Processors

PDF

May 27, 2013Efficient JavaPDF

Grading Process

  • 6 LP
  • 15% Paper presentation (15+5 min)
  • 15% Intermediate presentation (15+5 min)
  • 20% Final presentation (30+10 min)
  • 25% Final report (6-8 pages)
  • 10% Participation in seminars and consultations
  • 15% Algorithms & Implementation

Literature

Overview

[1] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.

[2] Mauricio A. Hernández and Salvatore J. Stolfo. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery 2, 1998.

[3] Stratosphere Project Homepage http://www.stratosphere.eu/

Papers for Presentation

[4] Rares Vernica, Michael J. Carey, and Chen Li. Efficient parallel set-similarity joins using MapReduce. In Proceedings of the International Conference on Management of Data (SIGMOD), 2010. 

[5] Ahmed Metwally and Christos Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. Proceedings of the Very Large Database Endowment (PVLDB) 5, 2012.

[6] Lars Kolb, Andreas Thor, and Erhard Rahm. Multi-pass Sorted Neighborhood Blocking with MapReduce. Computer Science - Research and Development 27, 2012.

[7] Foto N. Afrati, Anish Das Sarma, David Menestrina, Aditya G. Parameswaran, Jeffrey D. Ullman. Fuzzy Joins Using MapReduce. In Proceedings of the International Conference on Data Engineering, 2012.

[8] Dal Bianco, Guilherme, Renata Galante, and Carlos A. Heuser. A fast approach for parallel deduplication on multicore processors. In Proceedings of the ACM Symposium on Applied Computing, 2011.

[9] Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, Philip Bohannon. CBLOCK: An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks. In Proceedings of the International Conference on Information and Knowledge Management, 2012.