Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.

For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.

Please do not hesitate to reach out directly to us, if you cannot find a paper, slides, or other research artifacts.

Content

Authors

Jan Hegewald, Felix Naumann, Melanie Weis

Description

This paper describes a tool called XStruct, which provides functionality to extract the schema of XML files automatically. You can find the tool here.

Abstract

XML is the de facto standard format for data exchange on the Web. While it is fairly simple to generate XML data, it is a complex task to design a schema and then guarantee that the generated data is valid according to that schema. As a consequence much XML data does not have a schema or is not accompanied by its schema. In order to gain the benefits of having a schema-efficient querying and storage of XML data, semantic verification, data integration, etc. - this schema must be extracted.

In this paper we present an automatic technique, XStruct, for XML Schema extraction. Based on ideas of [1], XStruct extracts a schema for XML data by applying several heuristics to deduce regular expressions that are 1-unambiguous and describe each element's contents correctly but generalized to a reasonable degree. Our approach features several advantages over known techniques: XStruct scales to very large documents (beyond 1GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and understandable schema for multiple documents; it detects datatypes and attributes. Experiments confirm these features and properties. [more]

[1] J.-K. Min, J.-Y. Ahn, and C.-W. Chung. Efficient extraction of schemas for XML documents. Information Processing Letters, 85:7-12, 2003.

Chair

Prof. Dr. Felix Naumann

Information Systems

E-Mail: felix.naumann(at)hpi.de

Assistant: Diana Stephan

Office: Campus II, House F, F-2.01
Tel.: +49 (0)331 5509-280
E-Mail: office-naumann(at)hpi.de

To visit us, please see these directions.

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

We are excited to announce that our new book chapter "Data Quality for Enterprise AI" has just been published. > Go to article

01.11.2025 | Paper accepted at WOP@ISWC

We are excited to announce that our paper "Is SHACL Suitable for Data Quality Assessment?" was accepted at the WOP … > Go to article

29.09.2025 | Paper accepted at NeurIPS 2025

We are excited to announce that our paper "Learning Conditional Marked Event Sequences with Mixed Data Types" was … > Go to article

29.09.2025 | Paper accepted at SIGMOD 2026

We are excited to announce that our paper "Burr: A Benchmark for Ontology Learning from Relational Databases" was … > Go to article

09.07.2025 | Paper accepted in SIGMOD Record

We are excited to announce that our paper “Table Dissolution: Adding Salt To Your Data” was accepted at the Ninth … > Go to article

Project highlights

Metanome: Big Data Profiling

Metis: Data Quality Assessment

Janus: Change exploration

KITQAR: AI and Data Quality