Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.

For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.

Please do not hesitate to reach out directly to us, if you cannot find a paper, slides, or other research artifacts.

Description

DBLP is a bibliographic database for computer sciences. The main problem in DBLP is the assignment of papers to author entities.

Dataset 1:

This dataset provides bibliographical information about computer science journals and proceedings. It includes 50,000 objects.

Download:

DBLP Dataset 1

Used in:

A Duplicate Detection Benchmark for XML (and Relational) Data

Usage:

If you would like to use this dataset, please cite our paper [2].

Dataset 2:

The data set has been constructed from parts of DBLP that were automatically cleaned (using fine-tuned heuristics) or manually cleaned (due to author requests), where different aliases for a person are known or ambiguous names have been resolved.

The data set consists of paper reference pairs that can be assigned to the following categories:

Two papers from the same author
Two papers from the same author with different name aliases (e. g., with/without middle initial)
Two papers from different authors with different names
Two papers from different authors with the same name

For each paper pair, the matching task is to decide whether the two papers were written by the same author. The data set contains 2,500 paper pairs per category (10,000 in total). This does not represent the original distribution of ambiguous or alias names in DBLP (where about 99.2 % of the author names are non-ambiguous), but makes the matching task more difficult and interesting.

Download:

DBLP Dataset 2

Format:

CSV file: Each line corresponds to one publication pair. One author of each publication has been selected for comparison.
Column descriptions:
- sameentity (boolean): author1 is same entity as author2
- samename (boolean): author1 and author2 have same name
- authorname1, authorname2 (string): names of authors to be compared
- key1, key2 (string): DBLP keys of publications
- p1*, p2* (string): details of compared publications (p1, p2) as given in DBLP database
- p[1|2]booktitlefull, p[1|2)journalfull (string): full names of given journal/book title abbreviation (matched to dictionary, may contain errors)
- p[1|2][author|editor] (string): multi-valued attribute values for authors and editors, separated by pipe symbol "|"

Usage:

If you would like to use this dataset, please cite our paper [1].

References

Frequency-aware Similarity Measures. Lange, Dustin; Naumann, Felix (2011). 243–248.

[ Details ]

A Duplicate Detection Benchmark for XML (and Relational) Data. Weis, Melanie; Naumann, Felix; Brosy, Franziska (2005).

[ Details ]

Sources

DBLP: http://www.informatik.uni-trier.de/~ley/db/

Chair

Prof. Dr. Felix Naumann

Information Systems

E-Mail: felix.naumann(at)hpi.de

Assistant: Diana Stephan

Office: Campus II, House F, F-2.01
Tel.: +49 (0)331 5509-280
E-Mail: office-naumann(at)hpi.de

To visit us, please see these directions.

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

We are excited to announce that our new book chapter "Data Quality for Enterprise AI" has just been published. > Go to article

01.11.2025 | Paper accepted at WOP@ISWC

We are excited to announce that our paper "Is SHACL Suitable for Data Quality Assessment?" was accepted at the WOP … > Go to article

29.09.2025 | Paper accepted at NeurIPS 2025

We are excited to announce that our paper "Learning Conditional Marked Event Sequences with Mixed Data Types" was … > Go to article

29.09.2025 | Paper accepted at SIGMOD 2026

We are excited to announce that our paper "Burr: A Benchmark for Ontology Learning from Relational Databases" was … > Go to article

09.07.2025 | Paper accepted in SIGMOD Record

We are excited to announce that our paper “Table Dissolution: Adding Salt To Your Data” was accepted at the Ninth … > Go to article

Project highlights

Metanome: Big Data Profiling

Metis: Data Quality Assessment

Janus: Change exploration

KITQAR: AI and Data Quality