Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Description

Duplicate detection is about finding multiple representatives of the same real-world entity within a datset. This task is difficult, because representations might differ slightly, so some similarity measure must be defined to compare pairs of records. Another difficulty is the the high volume, datasets might have, making a pair-wise comparison of all records infeasible.

In this seminar, we want to discuss several papers, covering different aspects of duplicate detecion.

Topics

  • Entity resolution with iterative blocking
    Steven Euijong Whang, David Menestrina, Georgia Koutrika, Martin Theobald, Hector Garcia-Molina: Entity resolution with iterative blocking. SIGMOD Conference 2009: 219-232
  • Swoosh: a generic approach to entity resolution
    Benjelloun, Omar and Garcia-Molina, Hector and Menestrina, David and Su, Qi and Whang, Steven Euijong and Widom, Jennifer (2008) Swoosh: a generic approach to entity resolution. The VLDB Journal.
  • Evaluating Entity Resolution Results
    Menestrina, David and Whang, Steven Euijong and Garcia-Molina, Hector (2010) Evaluating Entity Resolution Results. In: PVLDB, September 13-17, 2010, Singapore.
  • Entity Resolution with Evolving Rules
    Whang, Steven Euijong and Garcia-Molina, Hector (2010) Entity Resolution with Evolving Rules. In: PVLDB, September 13-17, 2010, Singapore.
  • Privacy Preserving Schema and Data Matching
    Monica Scannapieco, Elisa Bertino, Ilya Figotin, Ahmed Elmagarmid: Privacy Preserving Schema and Data Matching, SIGMOD Conference 2007.
  • Anonymizing Classification Data for Privacy Preservation
    Benjamin C.M. Fung, Ke Wang, Philip S. Yu: Anonymizing Classification Data for Privacy Preservation, IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 5, pp. 711-725, May 2007.
  • Efficient Duplicate Record Detection Based on Similarity Estimation
    Mohan Li, Hongzhi Wang, Jianzhong Li and Hong Gao: Efficient Duplicate Record Detection Based on Similarity Estimation, Web-Age Information Management, Lecture Notes in Computer Science, 2010, Volume 6184/2010, 595-607.
  • A strategy for allowing meaningful and comparable scores in approximate matching
    Carina F. Dorneles, Marcos Freitas Nunes, Carlos A. Heuser, Viviane P. Moreira, Altigran S. da Silva, Edleno S. de Moura: A strategy for allowing meaningful and comparable scores in approximate matching, CIKM 2007, Dezember 2009, Pages 673-689
  • Creating probabilistic databases from duplicated data
    Oktie Hassanzadeh , Renée J. Miller, Creating probabilistic databases from duplicated data, The VLDB Journal — The International Journal on Very Large Data Bases, v.18 n.5, p.1141-1166, October 2009
  • Similarity-aware indexing for real-time entity resolution
    Peter Christen, Ross Gayler and David Hawking: Similarity-aware indexing for real-time entity resolution, Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), Hong Kong, November 2009. Extended version available as Technical Report.

Type of Lecture

  • Master's seminar for up to 10 students (no teams)
  • 3 points (benotete Leistungspunkte)

Requirements

The duplicate detection seminar has no requirements.

Registration

The topics will be presented at the first appointment on Tuesday, 19.10.2010.

To register, please send an email to Uwe Draisbach by Monday, 25.10.2010. The email should include the ranked Top 3 topics you would like to work on.

Grading process (Leistungserfassung)

  • Active participation at all seminar dates
  • At least 1 consultation each for presentation and written summary
  • Presentation at the end of the semester
  • Written summary and discussion of the paper (6-8 pages without cover sheet, table of contents, and references) with latex template

Dates

The seminar is scheduled every Tuesday at 15:15. Please check this website regularly for any updates.

Consultation takes place during seminar timeslots. Please register in before. Mandatory consultations are due 2 weeks before presentation.

The final paper submission is 28.02.2011.

Date|Room|Topic|Presenter|Slides
19.10.2010|HS 3|Seminar introduction and presentation of the topics|Naumann, Draisbach, Vogel, Lange, Heise|
25.10.2010||Registration deadline (Monday!)||
26.10.2010||n/a||
02.11.2010||n/a||
09.11.2010|SNB-E.9|Wissenschaftliches Arbeiten, Lesen und Vortragen|Felix Naumann|
16.11.2010||n/a||
23.11.2010|H-2.58|Trainingsbasierte Ansätze zum Objekt Matching|Hanna Köpcke (Uni. Leipzig)|
30.11.2010||Consultations||
07.12.2010||Consultations||
14.12.2010||Consultations||
21.12.2010||Christmas break||
28.12.2010||Christmas break||
04.01.2011|H-2.58|Swoosh: a generic approach to entity resolution Entity Resolution with Evolving Rules|Johannes Dyck Eyk Kny|
11.01.2011 14.01.2011 16:00-17:30|H-2.58|Creating probabilistic databases from duplicated data Entity resolution with iterative blocking|Dustin Beyer Florian Thomas|
18.01.2011|H-2.58|Efficient Duplicate Record Detection Based on Similarity Estimation Similarity-aware indexing for real-time entity resolution|Mathias Grauer Sven Viehmeier|
25.01.2011|H-2.58|A strategy for allowing meaningful and comparable scores in approximate matching Evaluating Entity Resolution Results|Toni Grütze Cindy Fähnrich|
01.02.2011|H-2.58|Privacy Preserving Schema and Data Matching Anonymizing Classification Data for Privacy Preservation|Egidijus Gircys Benedikt Forchhammer|
08.02.2011||Consultations||
28.02.2011||Final paper submission||

Literature