Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Project Overview

The Duplicate Detection (DuDe) toolkit supports the search for tuples that represent the same real-world object in a variety of data sources.

If you have any questions, please don't hesitate to contact Felix Naumann or Uwe Draisbach. To reference the project, please cite this paper:

DuDe: The Duplicate Detection Toolkit: Uwe Draisbach and Felix Naumann, Proceedings of the International Workshop on Quality in Databases (QDB) 2010, Singapore.

Website content:

Project Details

Duplicate detection (aka. entity matching, record linkage, etc.) has been a research topic for several decades and was first defined by Newcombe et al.. The challenge is how to identify pairs of records that represent the same real world entity in an effective and efficient manner. Researchers all over the world have developed a variety of methods to measure the similarity of records or to reduce the number of required comparisons. However, it is still difficult to compare results, as there usually are differences in the evaluated datasets, the implementation of the algorithms, or simply the hardware on which the code is executed.

To face this challenge, we are developing the comprehensive duplicate detection toolkit "DuDe'". It consists of several components with clear interfaces that can be easily served with individual code. This makes it easy for researchers to develop just a small piece of code and compare it with other algorithms, instead of developing a whole duplicate detection process. In addition, DuDe already provides several algorithms, similarity measures, and datasets with gold standards, and is thus a step towards a duplicate detection benchmark.

The goal of DuDe is to provide a toolkit for duplicate detection which:

  • is easy to use,
  • is easy to extend,
  • supports a large variety of data sources, including nested data,
  • allows almost all algorithms to be implemented using the toolkit, and
  • provides several basic similarity measures.

DuDe Architecture

DuDe Components

Data Sources
The data source component is used to extract data from any data source that is supported by the toolkit and to convert the data into the internal JSON format. Currently, we are able to extract records from relational databases (Oracle, DB2, MySQL and PostgreSQL), CSV files, XML documents, JSON files, and BibTeX bibliographies. For each data source, a record identifier, consisting of one or many attributes, can be defined and additionally a global ID is assigned to each data source, which is also saved within the extracted records. This allows a comparison of records from different sources without the necessity of an source-wide unique identifier.

Preprocessor
The preprocessor is used to gather statistics while extracting the data, e.g., counting the number of records or (distinct) values. After the extraction phase, each preprocessor instance is accessible within the algorithm and within comparators that need preprocessing information.

Algorithms
Algorithms are responsible for selecting pairs of records from the sources that should be classified as duplicate or non-duplicate. We support all algorithms that follow a pair-wise comparison pattern. Most algorithms require some kind of preprocessing, such as sorting for the Sorted Neighborhood Method or partitioning for the Blocking Method. Therefore, each algorithm can execute a preprocessing step before returning record pairs. In case of sorting, DuDe allows the definition of a sorting key and the selection of a sorter.

Algorithms do not not select all duplicate pairs at once, but rather select a candidate pair that is pipelined to the similarity function and then classified before the next candidate pair is selected. Therefore, it is possible that algorithms get notified whether the former pairs have been classified as duplicate or non-duplicate, before the algorithm selects the next pair. This is neccessary for algorithms that select pairs in dependency to previous classifications. 

Similarity Functions
Similarity functions are used to compare two records and calculate a similarity. The similarity is a value between 0 and 1, with 1 defined as equality. We distinguish between three types of comparators:

  • Structure-based similarity functions can be used to compare objects based on their structure. This is especially interesting if records from different sources with different schemas are compared (e.g. calculating the similarity based on the number of equal attributes in two records).
  • Content-based similarity functions can be used to compare objects based on concrete attribute values. Examples are Levenshtein, SoundEx and identity comparators.
  • Aggregating similarity functions (e.g. min, max) can be used to combine different structure- or content-based comparators.

Postprocessor
The postprocessor receives the classified record pairs and performs additional processing: Two important postprocessors are the transitive closure generator and the statistic component. The former calculates the transitive closure for all classified duplicates. The latter allows the calculation of key figures, such as runtime, number of generated record pairs, or number of classified duplicates. If a gold standard exists for a dataset, additionally precision, recall, f-measure, etc. are calculated.

Output
There are several output formats for the record pairs, including a simple text output that writes each result pair into one line using a specified separator for the attribute values, and a JSON output, which uses JSON syntax. The CSV output allows the output of additional information for record pairs, such as the calculated similarity or whether the pair has been classified as duplicate or not. The statistic component has its own CSV output, which also allows the specification of additional attributes. These attributes can be used to describe the configuration of an experiment. All outputs can be written to the screen or into a file. The offered output components can easily be extended to meet experiment specific requirements.

DuDe Download

DuDe is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. 
DuDe is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. 
You should have received a copy of the GNU General Public License along with DuDe. If not, see <http://www.gnu.org/licenses/>.

Documentation and How-To Guides

You can find the documentation in our JavaDoc. Some helpful pdf-files are here:

Datasets

We have prepared 3 datasets that have been used frequently for duplicate detection:

  • CORA
    This dataset includes bibliographical information about scientific papers. It provides 1879 objects. We have added an unique identifier for each record and made some adjustments, so that the file can be loaded with the DuDe XML extractor. 
  • Restaurant data from Riddle repository
    A collection of 864 restaurant records from the Fodor's and Zagat's restaurant guides that contains 112 duplicates.
  • CD information
    This dataset includes 9763 CDs randomly extracted from freeDB.

Dataset

Data

Gold Standard

Extractors

Data Adjustments

CORA1879 records64,578 recordstxt-filetxt-file
Restaurant864 records112 recordstxt-filetxt-file
CD9763 records299 recordstxt-file---

Data Generators

The following data generators might be used to create artificial datasets:

Acknowledgments

We would like to thank all involved students, especially Matthias Pohl and Florian Thomas, who have made a great eff ort to implement parts of the framework.

Show Case: BibTex

You can see DuDe in action with our BibTex Live Demo. Upload your BibTex file and find duplicates in it, merge them and download a revised copy of your BibTex file.