Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Project Description

The evaluation of duplicate detection systems usually needs pre-classified results. Such gold standards are often expensive to come by (much manual classification is necessary), not representative (too small or too synthetic), and proprietary and thus preclude repetition (company-internal data). This lament has been uttered in many papers and even more paper reviews. 

We propose an annealing standard, which is a structured set of duplicate detection results, some of which are manually verified and some of which are merely validated by many classifiers. As more and more classifiers are evaluated against the annealing standard, more and more results are verified and validation becomes more and more confident.

Read the paper published in the Journal of Data and Information Quality (JDIQ) 2014.

 

We want to evaluate the annealing standard by deduplicating a large real-world data set. The task is to develop a classifier for the data set so that we can use the results to improve the quality of the annealing standard.

Website content:

Data Set

The data set was extracted from freeDB, a free CD and music database service to look up textual metadata about music, audio or data CDs. There are two files, one describing the discs and one describing the respective tracks. The task is to deduplicate of the discs file. The tracks file just contains additional information.

Discs

The "disc_id" is a unique identifier which we have added. The "freedbdiscid" is a generated identifier, which is not unique, i.e., there are different discs with the same freedbdiscid (e.g., freedbdiscid 1000464 refers to disc_id 728138 and disc_id 1549212, which is in this case indeed a duplicate).

Nr|Attribute|Type|Sample
1|disc_id|BIGINT|1991212
2|freedbdiscid|BIGINT|3221869327
3|artist_name|VARCHAR(198)|The Purple Helmets
4|disc_title|VARCHAR(187)|Rise Again
5|genre_title|VARCHAR(104)|Rock
6|disc_released|INT|1989
7|disc_tracks|INT|15
8|disc_seconds|INT|2515
9|disc_language|VARCHAR(3)|eng

Tracks

The file with the tracks contains additional information for the discs. The attribute "disc_id" is a reference to the disc and "track_number" describes the order of the tracks. "artist_name" contains the artist name of the track, in opposite to "artist_name" in the disc file, which contains the artist of the disc (e.g., "Various").

Nr|Attribute|Type|Sample
1|disc_id|BIGINT|1991212
2|track_number|INT|1
3|track_title|VARCHAR(246)|Brand New Cadillac
4|artist_name|VARCHAR(203)|The Purple Helmets
5|track_seconds|INT|155

Duplicate Definition

In general, a duplicate is a pair of two object representations that represent the same real-world entity. Deciding whether two discs are a duplicate is a difficult task, sometimes even for humans. We provide the following guidelines for this decision, well aware that others might have a different view: 

A pair of discs is a duplicate if a customer/vendor views the two discs as being the same product. This is, for instance, the case if the disc could be produced using the same master disc.

We have prepared an Excel spreadsheet with examples of duplicates and non-duplicates which help to obtain a common understanding.

freeDB.xls

Annealing and Silver Standard

We run our workflow using four self-created classifiers and present now the annealing standard. For ease of use, we extracted the silver standard into separate files. Additionally, we provide the declared duplicates of the four classifiers. All files are semicolon-separated pairs of disc ids.

There are pairs that are undisputed among the classifiers but still do no appear in the annealing standard duplicates. This is because those are manually inspected as unknown artist CDs and thus occur in the non-duplicates of the silver standard, even though the workflow did not force them to undergo a manual inspection

File|Content|Number of pairs|Size
Annealing standard duplicates|Undisputed duplicate pairs (DA)|

123,854

|

1,900 KB

Silver standard duplicates|Manually inspected duplicate pairs (DS)|

405

|

16 KB

Silver standard non-duplicates|Manually inspected non-duplicate pairs (NS)|

1,243

|

28 KB

Submission 1|Declared duplicate pairs by classifier 1|

127,660

|

1,900 KB

Submission 2|Declared duplicate pairs by classifier 2|

126,193

|

1,900 KB

Submission 3|Declared duplicate pairs by classifier 3|

127,122

|

1,900 KB

Submission 4|Declared duplicate pairs by classifier 4|

128,485

|

1,900 KB

Build your own…

Feel free to build your own classifiers and evaluate it against the provided annealing standard. You are invited to add some manual inspections, too, to increase the silver standard. Please let us know and we will update the dataset accordingly.