Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Incremental Duplicate Detection

Databases play an important role in IT-based companies now a days, and many industries and organizations rely on accuracy of data in databases to perform their operations. Unfortunately, the data are not always clean. For instance, real world entities have different representations, which could be due to erroneous data entry, data evolution, data integration, etc. this in turn introduces error so called duplication into the databases. Deduplication intends to extract different representations of real world entities in a database. Focus of this work is on incremental deduplication, a more recent topic in deduplication. Deduplication is a time extensive process and sheer amount of data added to the already deduplicated database makes it unreliable and unusable, therefore imposes extra cost to the industries. Incremental record deduplication attempts to address this problem and make database with many transactions always up-to-date and clean.