Annealing Standard

Project Description

The evaluation of duplicate detection systems usually needs pre-classified results. Such gold standards are often expensive to come by (much manual classification is necessary), not representative (too small or too synthetic), and proprietary and thus preclude repetition (company-internal data). This lament has been uttered in many papers and even more paper reviews.

We propose an annealing standard, which is a structured set of duplicate detection results, some of which are manually verified and some of which are merely validated by many classifiers. As more and more classifiers are evaluated against the annealing standard, more and more results are verified and validation becomes more and more confident.

Read the paper published in the Journal of Data and Information Quality (JDIQ) 2014.

We want to evaluate the annealing standard by deduplicating a large real-world data set. The task is to develop a classifier for the data set so that we can use the results to improve the quality of the annealing standard.

Website content:

Data set
Duplicate definition
Annealing and Silver Standard
Build your own…

Data Set

The data set was extracted from freeDB, a free CD and music database service to look up textual metadata about music, audio or data CDs. There are two files, one describing the discs and one describing the respective tracks. The task is to deduplicate of the discs file. The tracks file just contains additional information.

File	Records	Size
Discs	750,000	66 MB
Tracks	9,951,985	538 MB

Discs

The "disc_id" is a unique identifier which we have added. The "freedbdiscid" is a generated identifier, which is not unique, i.e., there are different discs with the same freedbdiscid (e.g., freedbdiscid 1000464 refers to disc_id 728138 and disc_id 1549212, which is in this case indeed a duplicate).

Nr	Attribute	Type	Sample
1	disc_id	BIGINT	1991212
2	freedbdiscid	BIGINT	3221869327
3	artist_name	VARCHAR(198)	The Purple Helmets
4	disc_title	VARCHAR(187)	Rise Again
5	genre_title	VARCHAR(104)	Rock
6	disc_released	INT	1989
7	disc_tracks	INT	15
8	disc_seconds	INT	2515
9	disc_language	VARCHAR(3)	eng

Tracks

The file with the tracks contains additional information for the discs. The attribute "disc_id" is a reference to the disc and "track_number" describes the order of the tracks. "artist_name" contains the artist name of the track, in opposite to "artist_name" in the disc file, which contains the artist of the disc (e.g., "Various").

Nr	Attribute	Type	Sample
1	disc_id	BIGINT	1991212
2	track_number	INT	1
3	track_title	VARCHAR(246)	Brand New Cadillac
4	artist_name	VARCHAR(203)	The Purple Helmets
5	track_seconds	INT	155

Duplicate Definition

In general, a duplicate is a pair of two object representations that represent the same real-world entity. Deciding whether two discs are a duplicate is a difficult task, sometimes even for humans. We provide the following guidelines for this decision, well aware that others might have a different view:

A pair of discs is a duplicate if a customer/vendor views the two discs as being the same product. This is, for instance, the case if the disc could be produced using the same master disc.

We have prepared an Excel spreadsheet with examples of duplicates and non-duplicates which help to obtain a common understanding.

freeDB.xls

Annealing and Silver Standard

We run our workflow using four self-created classifiers and present now the annealing standard. For ease of use, we extracted the silver standard into separate files. Additionally, we provide the declared duplicates of the four classifiers. All files are semicolon-separated pairs of disc ids.

There are pairs that are undisputed among the classifiers but still do no appear in the annealing standard duplicates. This is because those are manually inspected as unknown artist CDs and thus occur in the non-duplicates of the silver standard, even though the workflow did not force them to undergo a manual inspection

File	Content	Number of pairs	Size
Annealing standard duplicates	Undisputed duplicate pairs (D_A)	123,854	1,900 KB
Silver standard duplicates	Manually inspected duplicate pairs (D_S)	405	16 KB
Silver standard non-duplicates	Manually inspected non-duplicate pairs (N_S)	1,243	28 KB
Submission 1	Declared duplicate pairs by classifier 1	127,660	1,900 KB
Submission 2	Declared duplicate pairs by classifier 2	126,193	1,900 KB
Submission 3	Declared duplicate pairs by classifier 3	127,122	1,900 KB
Submission 4	Declared duplicate pairs by classifier 4	128,485	1,900 KB

Build your own…

Feel free to build your own classifiers and evaluate it against the provided annealing standard. You are invited to add some manual inspections, too, to increase the silver standard. Please let us know and we will update the dataset accordingly.