Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Description

Duplicate detection is about finding multiple representatives of the same real-world entity within a datset. This task is difficult, because representations might differ slightly, so some similarity measure must be defined to compare pairs of records. Another difficulty is the the high volume, datasets might have, making a pair-wise comparison of all records infeasible.

In this seminar, we want to discuss several papers, covering different aspects of duplicate detecion.

Topics

  • Entity resolution with iterative blocking
    Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, Hector Garcia-Molina: Entity resolution with iterative blocking. SIGMOD Conference 2009: 219-232
  • Swoosh: a generic approach to entity resolution
    Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.
  • Evaluating Entity Resolution Results
    Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2010) Evaluating Entity Resolution Results. In: PVLDB, September 13-17, 2010, Singapore.
  • Entity Resolution with Evolving Rules
    Whang, Steven Euijong and Garcia-Molina, Hector (2010) Entity Resolution with Evolving Rules. In: PVLDB, September 13-17, 2010, Singapore.
  • Privacy Preserving Schema and Data Matching
    Monica Scannapieco, Elisa Bertino, Ilya Figotin, Ahmed Elmagarmid: Privacy Preserving Schema and Data Matching, SIGMOD Conference 2007.
  • Anonymizing Classification Data for Privacy Preservation
    Benjamin C.M. Fung, Ke Wang, Philip S. Yu: Anonymizing Classification Data for Privacy Preservation, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp. 711-725, May 2007.
  • Efficient Duplicate Record Detection Based on Similarity Estimation
    Mohan Li, Hongzhi Wang, Jianzhong Li and Hong Gao: Efficient Duplicate Record Detection Based on Similarity Estimation, Web-Age Information Management, Lecture Notes in Computer Science, 2010, Volume 6184/2010, 595-607.
  • A strategy for allowing meaningful and comparable scores in approximate matching
    Carina F. Dorneles, Marcos Freitas Nunes, Carlos A. Heuser, Viviane P. Moreira, Altigran S. da Silva, Edleno S. de Moura: A strategy for allowing meaningful and comparable scores in approximate matching, CIKM 2007, Dezember 2009, Pages 673-689
  • Creating probabilistic databases from duplicated data
    Oktie Hassanzadeh , Renée J. Miller, Creating probabilistic databases from duplicated data, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.5, p.1141-1166, October 2009
  • Similarity-aware indexing for real-time entity resolution
    Peter Christen, Ross Gayler and David Hawking: Similarity-aware indexing for real-time entity resolution, Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, November 2009. Extended version available as Technical Report.

Type of Lecture

  • Master's seminar for up to 10 students (no teams)
  • 3 points (benotete Leistungspunkte)

Requirements

The duplicate detection seminar has no requirements.

Registration

The topics will be presented at the first appointment on Tuesday, 19.10.2010.

To register, please send an email to Uwe Draisbach by Monday, 25.10.2010. The email should include the ranked Top 3 topics you would like to work on.

Grading process (Leistungserfassung)

  • Active participation at all seminar dates
  • At least 1 consultation each for presentation and written summary
  • Presentation at the end of the semester
  • Written summary and discussion of the paper (6-8 pages without cover sheet, table of contents, and references) with latex template

Dates

The seminar is scheduled every Tuesday at 15:15. Please check this website regularly for any updates.

Consultation takes place during seminar timeslots. Please register in before. Mandatory consultations are due 2 weeks before presentation.

The final paper submission is 28.02.2011.

Date Room Topic Presenter Slides
19.10.2010 HS 3 Seminar introduction and presentation of the topics Naumann, Draisbach, Vogel, Lange, Heise
25.10.2010 Registration deadline (Monday!)
26.10.2010 n/a
02.11.2010 n/a
09.11.2010 SNB-E.9 Wissenschaftliches Arbeiten, Lesen und Vortragen Felix Naumann
16.11.2010 n/a
23.11.2010 H-2.58 Trainingsbasierte Ansätze zum Objekt Matching Hanna Köpcke
(Uni. Leipzig)
30.11.2010 Consultations
07.12.2010 Consultations
14.12.2010 Consultations
21.12.2010 Christmas break
28.12.2010 Christmas break
04.01.2011 H-2.58 Swoosh: a generic approach to entity resolution

Entity Resolution with Evolving Rules
Johannes Dyck


Eyk Kny



11.01.2011
14.01.2011 16:00-17:30
H-2.58 Creating probabilistic databases from duplicated data

Entity resolution with iterative blocking
Dustin Beyer


Florian Thomas



18.01.2011 H-2.58 Efficient Duplicate Record Detection Based on Similarity Estimation

Similarity-aware indexing for real-time entity resolution
Mathias Grauer


Sven Viehmeier



25.01.2011 H-2.58 A strategy for allowing meaningful and comparable scores in approximate matching

Evaluating Entity Resolution Results
Toni Grütze


Cindy Fähnrich



01.02.2011 H-2.58 Privacy Preserving Schema and Data Matching

Anonymizing Classification Data for Privacy Preservation
Egidijus Gircys


Benedikt Forchhammer



08.02.2011 Consultations
28.02.2011 Final paper submission

Literature