CD Datasets
Content
Dataset 1
This dataset includes 9763 CDs randomly extracted from freeDB.
- Dataset
- The data was converted from plain to XML and is packed into a zip archive.
- It is also available in a tab separated value (TSV) format. (9,763 objects - TSV format)
- Same, but lower-cased and with special characters removed. (9,763 objects - TSV format)
- Duplicates
- A list of all duplicates in the dataset. (298 objects - XML format)
- This is an updated list (2018) - we had missed a transitive duplicate pair. (299 objects - XML format)
- A further update (2018), including one more transitive closured pair. (300 objects - TSV format)
- Non-duplicates
- We generate non-duplicate pairs by following a systematic approach. (3,000 objects - TSV format)
- Using an updated, further simplified approach across datasets. (3,000 objects - TSV format)
- We generate non-duplicate pairs by following a systematic approach. (3,000 objects - TSV format)
- Schema of the dataset
This is a pdf representation of the schema of the dataset.
Dataset 2
This dataset was generated by extracting 500 clean CD objects from the FreeDB database and 500 artificially generated duplicates using the Dirty XML Data Generator (one duplicate for each CD).
- Schema of the dataset
Here you get the schema of the dataset, which is listed below.