Hasso-Plattner-Institut
  
Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Tobias Vogel

 

 

Hasso-Plattner-Institut
für Softwaresystemtechnik
Prof.-Dr.-Helmert-Straße 2-3
D-14482 Potsdam, Germany

Phone: ++49 331 5509 292
Fax: ++49 331 5509 287
Room: E-2.02.2
E-Mail: T. Vogel

FOAF Description


Research

  • Data Quality Services for Duplicate Detection

Research Areas

  • Duplicate Detection
  • Data Cleaning

Projects

Running Projects

Finished Projects

  • PoSR (Potsdam Service Repository)
  • iDuDe (Duplicate Detection for iOS)

Teaching

  • WS 2009/2010: Master's Seminar "Emerging Web Services Technologies"
  • WS 2009/2010: Workshop "Duplikaterkennung"
  • SS 2010: Master's Seminar: "Similarity Search Algorithms"

Activities

  • Local Arrangements Chair for ICIQ 2009

Instance-based "one-to-some" Assignment of Similarity Measures to Attributes

Tobias Vogel, Felix Naumann
In Proceedings of the 19th International Conference on Cooperative Information Systems (CoopIS), 2011

Abstract:

Data quality is a key factor for economical success. It is usually defined as a set of properties of data, such as completeness, accessibility, relevance, and conciseness. The latter includes the absence of multiple representations for same real world objects. To avoid such duplicates, there is a wide range of commercial products and customized self-coded software. These programs can be quite expensive both in acquisition and maintenance. In particular, small and medium-sized companies cannot afford these tools. Moreover, it is difficult to set up and tune all necessary parameters in these programs. Recently, web-based applications for duplicate detection have emerged. However, they are not easy to integrate into the local IT landscape and require much manual configuration effort. With DAQS (Data Quality as a Service) we present a novel approach to support duplicate detection. The approach features (1) minimal required user interaction and (2) self-configuration for the provided input data. To this end, each data cleansing task is classified to find out which metadata is available. Next, similarity measures are automatically assigned to the provided records’ attributes and a duplicate detection process is carried out. In this paper we introduce a novel matching approach, called one-to-some or 1:k assignment, to assign similarity measures to attributes. We performed an extensive evaluation on a large training corpus and ten test datasets of address data and achieved promising results.

BibTeX file

@inproceedings{Instancebasedonetosomeassignmentofsimilaritymeasurestoattributescoopis2011vogelnaumann,
author = { Tobias Vogel, Felix Naumann },
title = { Instance-based "one-to-some" Assignment of Similarity Measures to Attributes },
year = { 2011 },
month = { 0 },
abstract = { Data quality is a key factor for economical success. It is usually defined as a set of properties of data, such as completeness, accessibility, relevance, and conciseness. The latter includes the absence of multiple representations for same real world objects. To avoid such duplicates, there is a wide range of commercial products and customized self-coded software. These programs can be quite expensive both in acquisition and maintenance. In particular, small and medium-sized companies cannot afford these tools. Moreover, it is difficult to set up and tune all necessary parameters in these programs. Recently, web-based applications for duplicate detection have emerged. However, they are not easy to integrate into the local IT landscape and require much manual configuration effort. With DAQS (Data Quality as a Service) we present a novel approach to support duplicate detection. The approach features (1) minimal required user interaction and (2) self-configuration for the provided input data. To this end, each data cleansing task is classified to find out which metadata is available. Next, similarity measures are automatically assigned to the provided records’ attributes and a duplicate detection process is carried out. In this paper we introduce a novel matching approach, called one-to-some or 1:k assignment, to assign similarity measures to attributes. We performed an extensive evaluation on a large training corpus and ten test datasets of address data and achieved promising results. },
booktitle = { Proceedings of the 19th International Conference on Cooperative Information Systems (CoopIS) },
priority = { 0 }
}

Copyright Notice

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

last change: Fri, 17 Apr 2015 11:16:24 +0200

Master's Theses

  • <a href="http://www.hpi.uni-potsdam.de/fileadmin/user_upload/fachgebiete/naumann/arbeiten/Thema_Masterarbeit.pdf">Duplicate Detection Across Structured And Unstructured Data</a> - David Sonnabend <br>
  • Duplicate Detection with CrowdSourcing (e.g. Amazon's Mechanical Turk) - David Wenzel