Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.

For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.

Please do not hesitate to reach out directly to us, if you cannot find a paper, slides, or other research artifacts.

Contact

Felix Naumann
Melanie Weis

Overview

One problem of data integration is the occurrence of sereval different representations of a same real-world object, which are called duplicates. The goal of this project is to devise algorithms that detect different representations of objects in XML data. To this end, we develop methods that consider descriptive data of an object as well as relationships to other objects, e.g., in children, parent, or sibling XML elements. Traditionally, relational approaches only consider data stored in a single relational table, i.e., previous methods do not consider relationships.

Data cleaning defines the process of correcting errors in data, e.g., typographical errors, outdated information, or different formats. Duplicate detection is a crucial step in data cleaning, but we also consider further cleaning steps.

Duplicate Detection in Tree and Graph Data

We propose three duplicate detection algorithms for XML Data. The goal of all three algorithms is to detect a maximum number of true duplicates without detecting false duplicates (effectivity) in a reasonlable amount of time (efficiency).

The top-down algorithm is useful for efficient and effective duplicate detection in hierarchical XML data, assuming that nesting of XML elements reflects 1:N relationships in the real world. This is for instance true for XML elements representing states and nesting city elements as children, because a city can only be located in a single state. Opposed to that, we observe a M:N relationship between movie XML elements and actor XML elements, although actors are nested under movies. In such scenarios, the top-down algorithm no longer performs effective duplicate detection, for which we propose the bottom-up algorithm. In general, an XML document may reflect a graph structure, e.g., if key references are used. An actor XML element nested under a movie XML elmenent may for instance be one of possibly many references to a an actor element. We developed a third algorithm for scenarios where relationships form a graph. The algorithm exploits the additional relationships to improve effectivenes.

XML Data Cleaning

We developed a system for XML data cleaning in cooperation with INRIA Futurs, France. This system, named XClean, allows a declarative specification of an XML cleaning process. This program is then compiled to an XQuery, which can then be executed on any XQuery processor. Further information on XClean is available at http://www.hpi.uni-potsdam.de/~naumann/xclean/

Chair

Prof. Dr. Felix Naumann

Information Systems

E-Mail: felix.naumann(at)hpi.de

Assistant: Diana Stephan

Office: Campus II, House F, F-2.01
Tel.: +49 (0)331 5509-280
E-Mail: office-naumann(at)hpi.de

To visit us, please see these directions.

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

We are excited to announce that our new book chapter "Data Quality for Enterprise AI" has just been published. > Go to article

01.11.2025 | Paper accepted at WOP@ISWC

We are excited to announce that our paper "Is SHACL Suitable for Data Quality Assessment?" was accepted at the WOP … > Go to article

29.09.2025 | Paper accepted at NeurIPS 2025

We are excited to announce that our paper "Learning Conditional Marked Event Sequences with Mixed Data Types" was … > Go to article

29.09.2025 | Paper accepted at SIGMOD 2026

We are excited to announce that our paper "Burr: A Benchmark for Ontology Learning from Relational Databases" was … > Go to article

09.07.2025 | Paper accepted in SIGMOD Record

We are excited to announce that our paper “Table Dissolution: Adding Salt To Your Data” was accepted at the Ninth … > Go to article

Project highlights

Metanome: Big Data Profiling

Metis: Data Quality Assessment

Janus: Change exploration

KITQAR: AI and Data Quality