MDedup - Duplicate Detection with Matching Dependencies

Content

Authors
Abstract
Source code
Datasets
Training data and evaluation results

Authors

Ioannis Koumarelas, Thorsten Papenbrock, Felix Naumann

Abstract

Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are very effective, but they are also very hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific. For many duplicate detection algorithms that are based on machine learning it is also difficult to explain why certain duplicates have been discovered and others not.

For these reasons, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a machine learning model to select MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-baseddata cleaning approaches, we propose an additional boosting technique. Our experiments show that this approach reaches up to 80% F-measure and 99% on our evaluation datasets, which are very good numbers considering that the system is configuration free.

Source code

The repositories assisting the execution of MDedup are available at: https://gitlab.com/mdedup

mdedup: The main project, which implements the core parts of MDedup, is implemented in Java: https://gitlab.com/mdedup/mdedup
mdedup_utils: A utility project, which assists the experimental evaluation, is implemented in Python: https://gitlab.com/mdedup/mdedup_utils

Datasets

The following datasets have been used for the experimental evaluation of MDedup. Dataset information, along with the available records, duplicates, and non-duplicates are available followingly:

Amazon-Walmart: records | duplicates | non-duplicates
CDDB: records | duplicates | non-duplicates
Census: records | duplicates | non-duplicates
Cora: records | duplicates | non-duplicates
DBLP-Scholar: records | duplicates | non-duplicates
Hotels: provided by our industry partner, Concur: www.concur.com. Unfortunately, due to privacy issues it cannot be disclosed. However, we do provide the discovered matching dependencies (MDs) and matching dependency combinations (MDCs).
NCVoters: records | duplicates | non-duplicates
Restaurants: records | duplicates | non-duplicates

Training data and evaluation results

The following matching dependency combinations (MDCs) are produced according to the training pipeline, as discussed in the paper. They are used to support the discovery of MDCs in a new dataset. (Consider MDC Prediction in the project mdedup)

Matching dependencies (MDs): discovered by MDDis.
Matching dependency combinations (MDCs):
- Training pipeline: "Selection", "Expansion", "Exploration" (where features are generated for MDCs of Selection and Exploration). The MDCs of the exploration phase are used to train the regression model.
- Application pipeline: We also provide the MDCs of the application pipeline, marked with "Prediction", as were scored in our experimental evaluation. The MDC with the highest "prediction_score" is selected for the first-cut duplicate detection, for which precision, recall, and F-measure scores are also provided.
Evaluation results:
- MDC Selection: This corresponds to the MDC with the highest F-measure (F1) for phase = "selection".
- MDC Prediction: This corresponds to the MDC with the highest prediction_score for phase = "prediction".
- Boosting: Both cases of MDC Selection and MDC Prediction, with boosting applied, using Support Vector Machines (SVMs).