Annealing Standard
Project Description
The evaluation of duplicate detection systems usually needs pre-classified results. Such gold standards are often expensive to come by (much manual classification is necessary), not representative (too small or too synthetic), and proprietary and thus preclude repetition (company-internal data). This lament has been uttered in many papers and even more paper reviews.
We propose an annealing standard, which is a structured set of duplicate detection results, some of which are manually verified and some of which are merely validated by many classifiers. As more and more classifiers are evaluated against the annealing standard, more and more results are verified and validation becomes more and more confident.
Read the paper published in the Journal of Data and Information Quality (JDIQ) 2014.
We want to evaluate the annealing standard by deduplicating a large real-world data set. The task is to develop a classifier for the data set so that we can use the results to improve the quality of the annealing standard.
Website content:
Data Set
The data set was extracted from freeDB, a free CD and music database service to look up textual metadata about music, audio or data CDs. There are two files, one describing the discs and one describing the respective tracks. The task is to deduplicate of the discs file. The tracks file just contains additional information.
| File | Records | Size |
| Discs | 750,000 | 66 MB |
| Tracks | 9,951,985 | 538 MB |
Discs
The "disc_id" is a unique identifier which we have added. The "freedbdiscid" is a generated identifier, which is not unique, i.e., there are different discs with the same freedbdiscid (e.g., freedbdiscid 1000464 refers to disc_id 728138 and disc_id 1549212, which is in this case indeed a duplicate).
| Nr | Attribute | Type | Sample |
| 1 | disc_id | BIGINT | 1991212 |
| 2 | freedbdiscid | BIGINT | 3221869327 |
| 3 | artist_name | VARCHAR(198) | The Purple Helmets |
| 4 | disc_title | VARCHAR(187) | Rise Again |
| 5 | genre_title | VARCHAR(104) | Rock |
| 6 | disc_released | INT | 1989 |
| 7 | disc_tracks | INT | 15 |
| 8 | disc_seconds | INT | 2515 |
| 9 | disc_language | VARCHAR(3) | eng |
Tracks
The file with the tracks contains additional information for the discs. The attribute "disc_id" is a reference to the disc and "track_number" describes the order of the tracks. "artist_name" contains the artist name of the track, in opposite to "artist_name" in the disc file, which contains the artist of the disc (e.g., "Various").
| Nr | Attribute | Type | Sample |
| 1 | disc_id | BIGINT | 1991212 |
| 2 | track_number | INT | 1 |
| 3 | track_title | VARCHAR(246) | Brand New Cadillac |
| 4 | artist_name | VARCHAR(203) | The Purple Helmets |
| 5 | track_seconds | INT | 155 |
Duplicate Definition
In general, a duplicate is a pair of two object representations that represent the same real-world entity. Deciding whether two discs are a duplicate is a difficult task, sometimes even for humans. We provide the following guidelines for this decision, well aware that others might have a different view:
A pair of discs is a duplicate if a customer/vendor views the two discs as being the same product. This is, for instance, the case if the disc could be produced using the same master disc.
We have prepared an Excel spreadsheet with examples of duplicates and non-duplicates which help to obtain a common understanding.
Annealing and Silver Standard
We run our workflow using four self-created classifiers and present now the annealing standard. For ease of use, we extracted the silver standard into separate files. Additionally, we provide the declared duplicates of the four classifiers. All files are semicolon-separated pairs of disc ids.
There are pairs that are undisputed among the classifiers but still do no appear in the annealing standard duplicates. This is because those are manually inspected as unknown artist CDs and thus occur in the non-duplicates of the silver standard, even though the workflow did not force them to undergo a manual inspection
| File | Content | Number of pairs | Size |
| Annealing standard duplicates | Undisputed duplicate pairs (DA) | 123,854 | 1,900 KB |
| Silver standard duplicates | Manually inspected duplicate pairs (DS) | 405 | 16 KB |
| Silver standard non-duplicates | Manually inspected non-duplicate pairs (NS) | 1,243 | 28 KB |
| Submission 1 | Declared duplicate pairs by classifier 1 | 127,660 | 1,900 KB |
| Submission 2 | Declared duplicate pairs by classifier 2 | 126,193 | 1,900 KB |
| Submission 3 | Declared duplicate pairs by classifier 3 | 127,122 | 1,900 KB |
| Submission 4 | Declared duplicate pairs by classifier 4 | 128,485 | 1,900 KB |
Build your own…
Feel free to build your own classifiers and evaluate it against the provided annealing standard. You are invited to add some manual inspections, too, to increase the silver standard. Please let us know and we will update the dataset accordingly.