Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Description

DBLP is a bibliographic database for computer sciences. The main problem in DBLP is the assignment of papers to author entities.

Dataset 1:

This dataset provides bibliographical information about computer science journals and proceedings. It includes 50,000 objects.

Download:

Used in:

Usage:

If you would like to use this dataset, please cite our paper [2].

Dataset 2:

The data set has been constructed from parts of DBLP that were automatically cleaned (using fine-tuned heuristics) or manually cleaned (due to author requests), where different aliases for a person are known or ambiguous names have been resolved.

The data set consists of paper reference pairs that can be assigned to the following categories:

  • Two papers from the same author
  • Two papers from the same author with different name aliases (e. g., with/without middle initial)
  • Two papers from different authors with different names
  • Two papers from different authors with the same name

For each paper pair, the matching task is to decide whether the two papers were written by the same author. The data set contains 2,500 paper pairs per category (10,000 in total). This does not represent the original distribution of ambiguous or alias names in DBLP (where about 99.2 % of the author names are non-ambiguous), but makes the matching task more difficult and interesting.

Download:

Format:

  • CSV file: Each line corresponds to one publication pair. One author of each publication has been selected for comparison.
  • Column descriptions:
    • sameentity (boolean): author1 is same entity as author2
    • samename (boolean): author1 and author2 have same name
    • authorname1, authorname2 (string): names of authors to be compared
    • key1, key2 (string): DBLP keys of publications
    • p1*, p2* (string): details of compared publications (p1, p2) as given in DBLP database
    • p[1|2]booktitlefull, p[1|2)journalfull (string): full names of given journal/book title abbreviation (matched to dictionary, may contain errors)
    • p[1|2][author|editor] (string): multi-valued attribute values for authors and editors, separated by pipe symbol "|"

Usage:

If you would like to use this dataset, please cite our paper [1].

References

  • Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248.
     
  • A Duplicate Detection Benchmark for XML (and Relational) Data. Weis, Melanie; Naumann, Felix; Brosy, Franziska (2005).