Duplicate detection is an integral part of data cleaning and serves to identify multiple representations of same real-world entities in (relational) datasets. Existing duplicate detection approaches are very effective, but they are also very hard to parameterize or require a lot of pre-labeled training data. Both parameterization and pre-labeling are at least domain-specific if not dataset-specific. For many duplicate detection algorithms that are based on machine learning it is also difficult to explain why certain duplicates have been discovered and others not.
For these reasons, we propose a novel, rule-based and fully automatic duplicate detection approach that is based on matching dependencies (MDs). Our system uses automatically discovered MDs, various dataset features, and known gold standards to train a machine learning model to select MDs as duplicate detection rules. Once trained, the model can select useful MDs for duplicate detection on any new dataset. To increase the generally low recall of MD-baseddata cleaning approaches, we propose an additional boosting technique. Our experiments show that this approach reaches up to 80% F-measure and 99% on our evaluation datasets, which are very good numbers considering that the system is configuration free.