Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Distributed Duplicate Detection

Duplicates in datasets are multiple, different representations of same real-world object. Their detection is usually complex. Huge datasets and the online nature of current modern systems even demand for more processing power with more nodes participating to handle the incoming data. In this seminar, we want to explore existing techniques for distributed duplicate detection, re-implement, extend and evaluate them. Popular frameworks like Apache Spark or Apache Flink, will be selected to be used by the teams to ease the implementation.

A naive approach could be to replicate all m records to n nodes, and then split the pairs that should be compared uniformly to all the nodes, so that every node has m/n pairs to compare.

The students who participate, will be provided with relevant literature or they can propose their own, for algorithms that handle this problem, with different approaches. We are looking for maximum 12 students that will form groups, where each group will be assigned one of the latter papers, with the task of implementing and extending it. The evaluation metrics include how efficient(=fast) the algorithms are, and how effective against the metrics of precision, recall and f-measure. The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.

Objectives

Explore existing techniques for distributed duplicate detection, re-implement, extend and evaluate them.

Tasks

  • Bring your own dataset (including a ground truth) in TSV (tab-separated), which will be shared with the rest of the teams. Alternatively, you can use the datasets provided by the advisors.
  • Read relevant literature to the assigned paper.
  • Evaluate your implementation, regarding the correctness based on the ground truths provided from all the datasets.
  • Active participation in the meetings.
  • Implement the algorithm mentioned in the paper and the possible extensions, using Java JDK 8.

Deliverables

  • In case you provided a dataset: the dataset (including a ground truth) in TSV (tab-separated), its accompanying similarity measure and a short explanation on them.
  • An intermediate presentation demonstrating your first insights on your assigned topic and your progress.
  • A final presentation demonstrating your algorithm and solutions
  • A report (6-8 pages) on your solutions. The report should document, discuss and explain the algorithm your extensions, implementation and finally evaluation.
  • Well documented source code, implemented in java, following the given interface, uploaded in the suggested repository.

Time schedule

Schedule of Distributed Duplicate Detection
DateTopicMaterial
To participate please join us at the first meeting on October 17, 2016 in D-E 9/10 at 11:00 - 12:30.
October 17Topic introductionpdf pptx
October 21Submission of application and topic wishlist by 4pm
October 21Notification of participation and topic
October 24Team building / assignment
October 31- Holidays -
November 7Individual consultation *1
November 14Group meetings *2
November 21Individual consultation
November 28Group meetings
December 5Individual consultation
December 12Intermediate presentation (15+5 min)
December 19Group meetings
December 26- Christmas -
January 2Individual consultation
January 9Group meetings
January 16Individual consultation
January 23Group meetings
January 30Individual consultation
February 6Final presentation (20+10 min)
- Exam period -
March 15Final report (6-8 pages)
|*1: at the office of advisors, occasional
|*2: at the classroom, mandatory

Grading Process

  • 6 LP
  • 10% Intermediate presentation (15+5 min)
  • 20% Final presentation (20+10 min)
  • 25% Final report (6-8 pages)
  • 10% Participation in the meetings
  • 35% Algorithms & Implementation

Literature

  1. Lars Kolb, Andreas Thor, and Erhard Rahm. 2012. Dedoop: efficient deduplication with Hadoop. Proc. VLDB Endow. 5, 12 (August 2012), 1878-1881. DOI=http://dx.doi.org/10.14778/2367502.2367527
  2. Xu Chu, Ihab F. Ilyas, and Paraschos Koutris. 2016. Distributed data deduplication. Proc. VLDB Endow. 9, 11 (July 2016), 864-875. DOI: http://dx.doi.org/10.14778/2983200.2983203
  3. Akash Das Sarma, Yeye He, and Surajit Chaudhuri. 2014. ClusterJoin: a similarity joins framework using map-reduce. Proc. VLDB Endow. 7, 12 (August 2014), 1059-1070. DOI=http://dx.doi.org/10.14778/2732977.2732981
  4. Anja Gruenheid, Xin Luna Dong, and Divesh Srivastava. 2014. Incremental record linkage. Proc. VLDB Endow. 7, 9 (May 2014), 697-708. DOI=http://dx.doi.org/10.14778/2732939.2732943
  5. Li, Shouheng, Liang, Huizhi (Elly), Ramadan, Banda. 2013, Two Stage Similarity-aware Indexing for Large-scale Real-time Entity Resolution, 146.
  6. O. Benjelloun et al., "D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution," 27th International Conference on Distributed Computing Systems (ICDCS '07), Toronto, ON, 2007, pp. 37-37. DOI: 10.1109/ICDCS.2007.96
  7. P. Christen, "A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication," in IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537-1555, Sept. 2012. doi: 10.1109/TKDE.2011.127 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5887335&isnumber=6247403

Relevant courses

Data Profiling and Data Cleansing (Lecture, Master)