Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Authors

Jan Hegewald, Felix Naumann, Melanie Weis

Description

This paper describes a tool called XStruct, which provides functionality to extract the schema of XML files automatically. You can find the tool here.

Abstract

XML is the de facto standard format for data exchange on the Web. While it is fairly simple to generate XML data, it is a complex task to design a schema and then guarantee that the generated data is valid according to that schema. As a consequence much XML data does not have a schema or is not accompanied by its schema. In order to gain the benefits of having a schema-efficient querying and storage of XML data, semantic verification, data integration, etc. - this schema must be extracted.

In this paper we present an automatic technique, XStruct, for XML Schema extraction. Based on ideas of [1], XStruct extracts a schema for XML data by applying several heuristics to deduce regular expressions that are 1-unambiguous and describe each element's contents correctly but generalized to a reasonable degree. Our approach features several advantages over known techniques: XStruct scales to very large documents (beyond 1GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and understandable schema for multiple documents; it detects datatypes and attributes. Experiments confirm these features and properties. [more]

[1] J.-K. Min, J.-Y. Ahn, and C.-W. Chung. Efficient extraction of schemas for XML documents. Information Processing Letters, 85:7-12, 2003.