Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Content

      Dataset 1

      This dataset includes 9763 CDs randomly extracted from freeDB.

      • Dataset
        • The data was converted from plain to XML and is packed into a zip archive.
        • It is also available in a tab separated value (TSV) format. (9,763 objects - TSV format)
          • Same, but lower-cased and with special characters removed. (9,763 objects - TSV format)
      • Duplicates
        • A list of all duplicates in the dataset. (298 objects - XML format)
        • This is an updated list (2018) - we had missed a transitive duplicate pair. (299 objects - XML format)
        • A further update (2018), including one more transitive closured pair. (300 objects - TSV format)
      • Non-duplicates
      • Schema of the dataset
        This is a pdf representation of the schema of the dataset.

      Dataset 2

      This dataset was generated by extracting 500 clean CD objects from the FreeDB database and 500 artificially generated duplicates using the Dirty XML Data Generator (one duplicate for each CD).