Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.

For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.

Please do not hesitate to reach out directly to us, if you cannot find a paper, slides, or other research artifacts.

Content

Dataset 1 (9763 CDs)
Dataset 2 (1000 CDs)

Dataset 1

This dataset includes 9763 CDs randomly extracted from freeDB.

Dataset
- The data was converted from plain to XML and is packed into a zip archive.
- It is also available in a tab separated value (TSV) format. (9,763 objects - TSV format)
  - Same, but lower-cased and with special characters removed. (9,763 objects - TSV format)
Duplicates
- A list of all duplicates in the dataset. (298 objects - XML format)
- This is an updated list (2018) - we had missed a transitive duplicate pair. (299 objects - XML format)
- A further update (2018), including one more transitive closured pair. (300 objects - TSV format)
Non-duplicates
- We generate non-duplicate pairs by following a systematic approach. (3,000 objects - TSV format)
  - Using an updated, further simplified approach across datasets. (3,000 objects - TSV format)
Schema of the dataset
This is a pdf representation of the schema of the dataset.

Dataset 2

This dataset was generated by extracting 500 clean CD objects from the FreeDB database and 500 artificially generated duplicates using the Dirty XML Data Generator (one duplicate for each CD).

Dataset

Schema of the dataset
Here you get the schema of the dataset, which is listed below.

Sources

http://www.freedb.org/

Chair

Prof. Dr. Felix Naumann

Information Systems

E-Mail: felix.naumann(at)hpi.de

Assistant: Diana Stephan

Office: Campus II, House F, F-2.01
Tel.: +49 (0)331 5509-280
E-Mail: office-naumann(at)hpi.de

To visit us, please see these directions.

News

Project highlights

Metanome: Big Data Profiling

Metis: Data Quality Assessment

Janus: Change exploration

KITQAR: AI and Data Quality

Content

Dataset 1

Dataset 2

Sources

Chair

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

01.11.2025 | Paper accepted at WOP@ISWC

29.09.2025 | Paper accepted at NeurIPS 2025

29.09.2025 | Paper accepted at SIGMOD 2026

09.07.2025 | Paper accepted in SIGMOD Record

Project highlights

People and open positions