Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Incremental Duplicate Detection

Duplicates in datasets are multiple, different representations of same real-world object. Their detection is usually complex. Huge datasets and the online nature of current modern systems even demand for an incremental detection on new incoming data. In this seminar, we want to explore existing techniques for incremental duplicate detection, re-implement them, extend them, and evaluate them.

A naive approach of comparing a new record with all existent records would mean O(n) complexity, which in real-time systems is not feasible. Therefore the students who participate, will be provided with relevant literature that propose systems with advanced indexing techniques for handling this problem.

We are looking for 6 groups of 2 students each, where each group will be assigned a particular paper to implement and extend. The evaluation metric is how efficient(=fast) the algorithms are. The final grade does not depend on the rank of the teams but on the ideas, the implementation, the evaluation, and the presentation.

Objectives

Explore existing techniques for incremental duplicate detection, re-implement them, extend them, and evaluate them.

Tasks

  • Bring your own dataset (including a ground truth) in CSV (semicolon separated) and your similarity measure, which will be shared with the rest of the teams.
  • Read relevant literature to the assigned paper.
  • Evaluate your implementation, regarding the correctness based on the ground truths provided from all the datasets.
  • Active participation in the meetings.
  • Implement the algorithm mentioned in the paper and the possible extensions, using Java JDK 8.

Deliverables

  • input dataset (including a ground truth) in CSV (semicolon separated), its accompanying similarity measure and a short explanation on them.
  • an intermediate presentation demonstrating your first insights on your assigned topic and your progress.
  • a final presentation demonstrating your algorithm and solutions
  • a report (6-8 pages) on your solutions. The report should document, discuss and explain the algorithm your extensions, implementation and finally evaluation.
  • well documented source code, implemented in java, following the given interface, uploaded in the suggested repository.

Time schedule

Schedule of Incremental Duplicate Detection
DateTopicMaterial
To participate please join us at the first meeting on April 11, 2016 in A.1-2.
April 11Topic introductionslides
April 15Submission of application and topic wishlist by 4pm
April 15Notification of participation and topic
April 18team building / assignment
April 25individual consultation *1
May 2group meetings *2
May 9individual consultation
May 16- National Holiday -
May 23group meetings
May 30Intermediate presentation (15+5 min)
June 6individual consultation
June 13group meetings
June 20individual consultation
June 27group meetings
July 4individual consultation
July 11group meetings
July 25Final presentation (20+10 min)
- Exam period -
August 15Final report (6-8 pages)
|*1: at the office of advisors, occasional
|*2: at the classroom, mandatory

Grading Process

  • 6 LP
  • 10% Intermediate presentation (15+5 min)
  • 20% Final presentation (20+10 min)
  • 25% Final report (6-8 pages)
  • 10% Participation in the meetings
  • 35% Algorithms & Implementation

Literature

Papers for Presentation

[1] Banda Ramadan, Peter Christen, Huizhi Liang, and Ross W. Gayler. 2015. Dynamic Sorted Neighborhood Indexing for Real-Time Entity Resolution. J. Data and Information Quality 6, 4, Article 15 (October 2015), 29 pages. DOI=http://dx.doi.org/10.1145/2816821

[2] Li, Shouheng, Liang, Huizhi (Elly), Ramadan, Banda. (2013). Two Stage Similarity-aware Indexing for Large-scale Real-time Entity Resolution, 146.

[3] Indrajit Bhattacharya, Lise Getoor, and Louis Licamele. 2006. Query-time entity resolution. InProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '06). ACM, New York, NY, USA, 529-534. DOI=http://dx.doi.org/10.1145/1150402.1150463

[4] Anja Gruenheid, Xin Luna Dong, and Divesh Srivastava. 2014. Incremental record linkage. Proc. VLDB Endow.7, 9 (May 2014), 697-708. DOI=http://dx.doi.org/10.14778/2732939.2732943

[5] Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal 18, 1 (January 2009), 255-276. DOI=http://dx.doi.org/10.1007/s00778-008-0098-x

[6] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev Motwani. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (SIGMOD '03). ACM, New York, NY, USA, 313-324. DOI=http://dx.doi.org/10.1145/872757.872796

[7] Parag Singla and Pedro Domingos. 2006. Entity Resolution with Markov Logic. In Proceedings of the Sixth International Conference on Data Mining (ICDM '06). IEEE Computer Society, Washington, DC, USA, 572-582. DOI=http://dx.doi.org/10.1109/ICDM.2006.65

Further reading

[8] P. Christen, "A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication," in IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537-1555, Sept. 2012. doi: 10.1109/TKDE.2011.127 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5887335&isnumber=6247403

Relevant courses

Data Profiling and Data Cleansing (Lecture, Master)