Data Preparation for Duplicate Detection

Content

Authors
Abstract
Source code
Datasets

Authors

Ioannis Koumarelas, Lan Jiang, Felix Naumann

Abstract

Data errors represent a major issue in most application workflows. Before any important task can take place, acertain data quality has to be guaranteed, by eliminating a number of different errors that may appear in data. Typically, most of these errors are fixed with data preparation methods, such as whitespace removal. However, the particular error of duplicate records, where multiple records refer to the same entity, is usually eliminated independently with specialized techniques. Our work is the first to bring these two areas together by applying data preparation operations under a systematic approach, prior to performing duplicate detection.

Our process workflow can be summarized as follows: The workflow begins with the user providing asinput a sample of the gold standard, the actual dataset, and some minor metadata to exclude or include domain-specific data preparations, such as address normalization. The preparation selection operates in two consecutive phases. First, to vastly reduce the search space of ineffective data preparations, decisions are made based on the improvement or worsening of pair similarities. Second, using the remaining data preparations an iterative leave-one-out classification process removes preparations one by one and determines the redundant preparations based on the achieved area under the precision-recall curve (AUC-PR). Using this workflow, we manage to improve the results of duplicate detection up to 19% in AUC-PR.

Source code

The repositories assisting the execution of "Data Preparation for Duplicate Detection" are available at: https://gitlab.com/data.prep.dedup

data.prep.dedup.java: The main project, which implements the computationally expensive parts of the algorithm, is implemented in Java: https://gitlab.com/data.prep.dedup/data.prep.dedup.java
data.prep.dedup.python: A utility project, which orchestrates the execution and assists the experimental evaluation, is implemented in Python: https://gitlab.com/data.prep.dedup/data.prep.dedup.python

Datasets

The following datasets have been used for the experimental evaluation. Dataset information, along with the available records, duplicates, and non-duplicates are available followingly:

CDDB: records | duplicates | non-duplicates
Census: records | duplicates | non-duplicates
Cora: records | duplicates | non-duplicates
Movies: due to licensing issues we cannot publish the generated data.
Hotels: provided by our industry partner, Concur: www.concur.com. Unfortunately, due to privacy issues it cannot be disclosed.
Restaurants: records | duplicates | non-duplicates