Duplicate Detection

Description

Duplicate detection is about finding multiple representatives of the same real-world entity within a datset. This task is difficult, because representations might differ slightly, so some similarity measure must be defined to compare pairs of records. Another difficulty is the the high volume, datasets might have, making a pair-wise comparison of all records infeasible.

In this seminar, we want to discuss several papers, covering different aspects of duplicate detecion.

Topics

Entity resolution with iterative blocking
Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, Hector Garcia-Molina: Entity resolution with iterative blocking. SIGMOD Conference 2009: 219-232
Swoosh: a generic approach to entity resolution
Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.
Evaluating Entity Resolution Results
Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2010) Evaluating Entity Resolution Results. In: PVLDB, September 13-17, 2010, Singapore.
Entity Resolution with Evolving Rules
Whang, Steven Euijong and Garcia-Molina, Hector (2010) Entity Resolution with Evolving Rules. In: PVLDB, September 13-17, 2010, Singapore.
Privacy Preserving Schema and Data Matching
Monica Scannapieco, Elisa Bertino, Ilya Figotin, Ahmed Elmagarmid: Privacy Preserving Schema and Data Matching, SIGMOD Conference 2007.
Anonymizing Classification Data for Privacy Preservation
Benjamin C.M. Fung, Ke Wang, Philip S. Yu: Anonymizing Classification Data for Privacy Preservation, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp. 711-725, May 2007.
Efficient Duplicate Record Detection Based on Similarity Estimation
Mohan Li, Hongzhi Wang, Jianzhong Li and Hong Gao: Efficient Duplicate Record Detection Based on Similarity Estimation, Web-Age Information Management, Lecture Notes in Computer Science, 2010, Volume 6184/2010, 595-607.
A strategy for allowing meaningful and comparable scores in approximate matching
Carina F. Dorneles, Marcos Freitas Nunes, Carlos A. Heuser, Viviane P. Moreira, Altigran S. da Silva, Edleno S. de Moura: A strategy for allowing meaningful and comparable scores in approximate matching, CIKM 2007, Dezember 2009, Pages 673-689
Creating probabilistic databases from duplicated data
Oktie Hassanzadeh , Renée J. Miller, Creating probabilistic databases from duplicated data, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.5, p.1141-1166, October 2009
Similarity-aware indexing for real-time entity resolution
Peter Christen, Ross Gayler and David Hawking: Similarity-aware indexing for real-time entity resolution, Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, November 2009. Extended version available as Technical Report.

Type of Lecture

Master's seminar for up to 10 students (no teams)
3 points (benotete Leistungspunkte)

Requirements

The duplicate detection seminar has no requirements.

Registration

The topics will be presented at the first appointment on Tuesday, 19.10.2010.

To register, please send an email to Uwe Draisbach by Monday, 25.10.2010. The email should include the ranked Top 3 topics you would like to work on.

Grading process (Leistungserfassung)

Active participation at all seminar dates
At least 1 consultation each for presentation and written summary
Presentation at the end of the semester
Written summary and discussion of the paper (6-8 pages without cover sheet, table of contents, and references) with latex template

Dates

The seminar is scheduled every Tuesday at 15:15. Please check this website regularly for any updates.

Consultation takes place during seminar timeslots. Please register in before. Mandatory consultations are due 2 weeks before presentation.

The final paper submission is 28.02.2011.

Date	Room	Topic	Presenter	Slides
19.10.2010	HS 3	Seminar introduction and presentation of the topics	Naumann, Draisbach, Vogel, Lange, Heise
25.10.2010		Registration deadline (Monday!)
26.10.2010		n/a
02.11.2010		n/a
09.11.2010	SNB-E.9	Wissenschaftliches Arbeiten, Lesen und Vortragen	Felix Naumann
16.11.2010		n/a
23.11.2010	H-2.58	Trainingsbasierte Ansätze zum Objekt Matching	Hanna Köpcke (Uni. Leipzig)
30.11.2010		Consultations
07.12.2010		Consultations
14.12.2010		Consultations
21.12.2010		Christmas break
28.12.2010		Christmas break
04.01.2011	H-2.58	Swoosh: a generic approach to entity resolution Entity Resolution with Evolving Rules	Johannes Dyck Eyk Kny
~~11.01.2011~~ 14.01.2011 16:00-17:30	H-2.58	Creating probabilistic databases from duplicated data Entity resolution with iterative blocking	Dustin Beyer Florian Thomas
18.01.2011	H-2.58	Efficient Duplicate Record Detection Based on Similarity Estimation Similarity-aware indexing for real-time entity resolution	Mathias Grauer Sven Viehmeier
25.01.2011	H-2.58	A strategy for allowing meaningful and comparable scores in approximate matching Evaluating Entity Resolution Results	Toni Grütze Cindy Fähnrich
01.02.2011	H-2.58	Privacy Preserving Schema and Data Matching Anonymizing Classification Data for Privacy Preservation	Egidijus Gircys Benedikt Forchhammer
08.02.2011		Consultations
28.02.2011		Final paper submission

Literature

Felix Naumann & Melanie Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management #3, 2010.
Peter Christen and Karl Goiser. Quality and Complexity Measures for Data Linkage and Deduplication. Quality Measures in Data Mining, Volume 43, 2007.