One problem of data integration is the occurrence of sereval different representations of a same real-world object, which are called duplicates. The goal of this project is to devise algorithms that detect different representations of objects in XML data. To this end, we develop methods that consider descriptive data of an object as well as relationships to other objects, e.g., in children, parent, or sibling XML elements. Traditionally, relational approaches only consider data stored in a single relational table, i.e., previous methods do not consider relationships.
Data cleaning defines the process of correcting errors in data, e.g., typographical errors, outdated information, or different formats. Duplicate detection is a crucial step in data cleaning, but we also consider further cleaning steps.