Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Arvid Heise

Former PhD student

Email: Arvid Heise

Research Activities

  • Cloud Computing
  • Parallel and Declarative Data Cleansing
  • MapReduce with Hadoop

Publications

Progressive Duplicate Detection

Thorsten Papenbrock, Arvid Heise, Felix Naumann
IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 27(5):1316-1329 2015

DOI: http://doi.ieeecomputersociety.org/10.1109/TKDE.2014.2359666

Abstract:

Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work.

Keywords:

duplicate detection,data cleansing,hpi

BibTeX file

@article{progressive_dude2015,
author = { Thorsten Papenbrock, Arvid Heise, Felix Naumann },
title = { Progressive Duplicate Detection },
journal = { IEEE Transactions on Knowledge and Data Engineering (TKDE) },
year = { 2015 },
volume = { 27 },
number = { 5 },
pages = { 1316-1329 },
month = { 0 },
abstract = { Duplicate detection is the process of identifying multiple representations of same real world entities. Today, duplicate detection methods need to process ever larger datasets in ever shorter time: maintaining the quality of a dataset becomes increasingly difficult. We present two novel, progressive duplicate detection algorithms that significantly increase the efficiency of finding duplicates if the execution time is limited: They maximize the gain of the overall process within the time available by reporting most results much earlier than traditional approaches. Comprehensive experiments show that our progressive algorithms can double the efficiency over time of traditional duplicate detection and significantly improve upon related work. },
keywords = { duplicate detection,data cleansing,hpi },
publisher = { IEEE Computer Society },
issn = { 1041-4347 },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Thu, 16 Jul 2015 11:26:05 +0200