CD Datasets

Content

Dataset 1 (9763 CDs)
Dataset 2 (1000 CDs)

Dataset 1

This dataset includes 9763 CDs randomly extracted from freeDB.

Dataset
- The data was converted from plain to XML and is packed into a zip archive.
- It is also available in a tab separated value (TSV) format. (9,763 objects - TSV format)
  - Same, but lower-cased and with special characters removed. (9,763 objects - TSV format)
Duplicates
- A list of all duplicates in the dataset. (298 objects - XML format)
- This is an updated list (2018) - we had missed a transitive duplicate pair. (299 objects - XML format)
- A further update (2018), including one more transitive closured pair. (300 objects - TSV format)
Non-duplicates
- We generate non-duplicate pairs by following a systematic approach. (3,000 objects - TSV format)
  - Using an updated, further simplified approach across datasets. (3,000 objects - TSV format)
Schema of the dataset
This is a pdf representation of the schema of the dataset.

Dataset 2

This dataset was generated by extracting 500 clean CD objects from the FreeDB database and 500 artificially generated duplicates using the Dirty XML Data Generator (one duplicate for each CD).

Dataset

Schema of the dataset
Here you get the schema of the dataset, which is listed below.

Sources

http://www.freedb.org/