Duplicate Detection
Description
Duplicate detection is about finding multiple representatives of the same real-world entity within a datset. This task is difficult, because representations might differ slightly, so some similarity measure must be defined to compare pairs of records. Another difficulty is the the high volume, datasets might have, making a pair-wise comparison of all records infeasible.
In this seminar, we want to discuss several papers, covering different aspects of duplicate detecion.
Topics
- Entity resolution with iterative blocking
Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, Hector Garcia-Molina: Entity resolution with iterative blocking. SIGMOD Conference 2009: 219-232 - Swoosh: a generic approach to entity resolution
Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal. - Evaluating Entity Resolution Results
Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2010) Evaluating Entity Resolution Results. In: PVLDB, September 13-17, 2010, Singapore. - Entity Resolution with Evolving Rules
Whang, Steven Euijong and Garcia-Molina, Hector (2010) Entity Resolution with Evolving Rules. In: PVLDB, September 13-17, 2010, Singapore. - Privacy Preserving Schema and Data Matching
Monica Scannapieco, Elisa Bertino, Ilya Figotin, Ahmed Elmagarmid: Privacy Preserving Schema and Data Matching, SIGMOD Conference 2007. - Anonymizing Classification Data for Privacy Preservation
Benjamin C.M. Fung, Ke Wang, Philip S. Yu: Anonymizing Classification Data for Privacy Preservation, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp. 711-725, May 2007. - Efficient Duplicate Record Detection Based on Similarity Estimation
Mohan Li, Hongzhi Wang, Jianzhong Li and Hong Gao: Efficient Duplicate Record Detection Based on Similarity Estimation, Web-Age Information Management, Lecture Notes in Computer Science, 2010, Volume 6184/2010, 595-607. - A strategy for allowing meaningful and comparable scores in approximate matching
Carina F. Dorneles, Marcos Freitas Nunes, Carlos A. Heuser, Viviane P. Moreira, Altigran S. da Silva, Edleno S. de Moura: A strategy for allowing meaningful and comparable scores in approximate matching, CIKM 2007, Dezember 2009, Pages 673-689 - Creating probabilistic databases from duplicated data
Oktie Hassanzadeh , Renée J. Miller, Creating probabilistic databases from duplicated data, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.5, p.1141-1166, October 2009 - Similarity-aware indexing for real-time entity resolution
Peter Christen, Ross Gayler and David Hawking: Similarity-aware indexing for real-time entity resolution, Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, November 2009. Extended version available as Technical Report.
Type of Lecture
- Master's seminar for up to 10 students (no teams)
- 3 points (benotete Leistungspunkte)
Requirements
The duplicate detection seminar has no requirements.
Registration
The topics will be presented at the first appointment on Tuesday, 19.10.2010.
To register, please send an email to Uwe Draisbach by Monday, 25.10.2010. The email should include the ranked Top 3 topics you would like to work on.
Grading process (Leistungserfassung)
- Active participation at all seminar dates
- At least 1 consultation each for presentation and written summary
- Presentation at the end of the semester
- Written summary and discussion of the paper (6-8 pages without cover sheet, table of contents, and references) with latex template
Dates
The seminar is scheduled every Tuesday at 15:15. Please check this website regularly for any updates.
Consultation takes place during seminar timeslots. Please register in before. Mandatory consultations are due 2 weeks before presentation.
The final paper submission is 28.02.2011.
Literature
- Felix Naumann & Melanie Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management #3, 2010.
- Peter Christen and Karl Goiser. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining, Volume 43, 2007.