dirtyxml

Project Members: Sven Puhlmann, Felix Naumann, Melanie eis

Web site: www.hpi.uni-potsdam.de/~naumann/projekte/completed_projects/dirtyxml.html

Whenever there is a need to integrate data from various data sources, certain algorithms are used that have the ability to clean the integrated data. In order to test these algorithms one needs "dirty" sample data. The Dirty XML Data Generator is a tool written in Java that creates a dirty XML data file given a clean XML document and a set of parameters. According to the parameter set, the generated data can contain errors of different type, such as duplicates and misspellings, and is used to benchmark algorithms that clean nested integrated XML data.

The Dirty XML Data Generator was implemented by Sven Puhlmann in the context of a student research project.

Content

Main Features
Sample of dirty XML data generation
Terms of use
Download
Detailed Documentation

1. Main Features

Flexible and fast generation of dirty XML data

Extensible implementation

Algorithms can be added in order to pollute character data in a specific way by implementing a very simple Java Interface.

Automatic detection of new algorithms when used in the parameter XML file.

Clearly arranged parameter definition in an XML file with reusable components: the parameterised algorithm specifications

Parameters can be nested the same way as are the elements and attributes in the clean XML file.

2. Sample of dirty XML data generation

Suppose there is a clean XML file persons.xml containing a set of persons each with a name, an address and some additional data.

<?xml version="1.0"?>
<persons>
<person ID="1" age="57" sex="m">
    <firstname>Henry</firstname>
    <lastname>Evening</lastname>
    <address>157 Short St., Sydney NSW 2113, Australia</address>
</person>
<person ID="2" age="21" sex="f">
    <firstname>Olga</firstname>
    <lastname>Tschernikova</lastname>
    <address>101000 Moscow, Russia</address>
</person>
<person ID="17" sex="f">
    <firstname>Maria</firstname>
    <lastname>Lamprecht</lastname>
    <address>Meiereiplatz 27, 16223 Mehrow, Germany</address>
</person>
</persons>

In order to test a certain algorithm, we want to create a dirty XML file based on the clean one. For instance, it should contain some duplicates and perhaps misses a specified amount of data values. In addition, the values of some attributes and the text content of a couple of elements should contain misspellings and data errors.

To define how to pollute the data in the way described above, we write an XML file persons_params.xml containing a set of parameters:

<?xml version="1.0"?>
<dirtyXMLparameters
    xmlns:xs="http://www.w3.org/2001/XMLSchema-instance"
   xs:noNamespaceSchemaLocation="../dist/DirtyXMLParameters.xsd"
   valid4Desc="true"
   errorsInAncestors="false">

<algo name="swap1" baseAlgo="swapChars">
    <parameter name="includeFirstChar" value="true"/>
    <parameter name="includeLastChar" value="true"/>
    <parameter name="minSwaps" value="1"/>
    <parameter name="maxSwaps" value="2"/>
</algo>

<algo name="swap2" baseAlgo="swapChars">
    <parameter name="includeFirstChar" value="false"/>
    <parameter name="includeLastChar" value="true"/>
    <parameter name="minSwaps" value="2"/>
    <parameter name="maxSwaps" value="4"/>
</algo>

<algo name="del1" baseAlgo="deleteChar">
    <parameter name="includeFirstChar" value="false"/>
    <parameter name="includeLastChar" value="true"/>
</algo>

<algo name="ins1" baseAlgo="insertChar">
    <parameter name="includeFirstChar" value="false"/>
    <parameter name="includeLastChar" value="true"/>
    <parameter name="includeUpper" value="false"/>
    <parameter name="includeLower" value="true"/>
    <parameter name="includeDigits" value="false"/>
</algo>

<dupElement name="person" delProb="0" dupProb="100" maxDup="1">
    <attribute name="sex">
      <chars delProb="30" changeProb="75">
        <changeAlgo algoName="swap1" useProb="100"/>
      </chars>
    </attribute>
    <dupElement name="firstname" delProb="25" dupProb="60" maxDup="2">
      <chars delProb="30" changeProb="80">
        <changeAlgo algoName="del1" useProb="50"/>
        <changeAlgo algoName="ins1" useProb="50"/>
      </chars>
    </dupElement>
    <dupElement name="address" delProb="0" dupProb="30" maxDup="1">
      <chars delProb="0" changeProb="70">
        <changeAlgo algoName="swap2" useProb="80"/>
        <changeAlgo algoName="del1" useProb="20"/>
      </chars>
    </dupElement>
</dupElement>
</dirtyXMLparameters>

In the first part of the parameter file (lines 8 to 33), four algorithms are defined that rest upon the three base algorithms SwapChars, DeleteChar, and InsertChar. These parameterised algorithms will be applied to character data contained in elements and attribute values.

In line 6 we define by means of errorsInAncestors="false" that the original elements from which the duplicates originate should not be polluted.

The second part of the file (beginning with line 35) constitutes the elements from which duplicates should be created.

Some examples: The element address, whose chars will be polluted using two different algorithms ( swap2 and del1) used with a probability of 0.8 and 0.2, respectively. Note that the probabilities must add up to 1 (that means 100%).

For a detailed explanation of the parameters please have a look at the Detailed Documentation.

Executing the Dirty XML Data Generator with the clean XML file, the parameter XML file, and the name of the dirty XML file (here: persons_dirty.xml) as input leads to the following result:

<?xml version="1.0" encoding="UTF-8"?>
<persons>
<person ID="1" age="57" sex="m">
    <firstname>Henry</firstname>
    <lastname>Evening</lastname>
    <address>157 Short St., Sydney NSW 2113, Australia</address>
</person>
<person ID="2" age="21" sex="f">
    <firstname>Olga</firstname>
    <lastname>Tschernikova</lastname>
    <address>101000 Moscow, Russia</address>
</person>
<person ID="17" sex="f">
    <firstname>Maria</firstname>
    <lastname>Lamprecht</lastname>
    <address>Meiereiplatz 27, 16223 Mehrow, Germany</address>
</person>
<person ID="1" age="57" sex="m">
    <firstname>Henry</firstname>
    <lastname>Evening</lastname>
    <address>157 Short St., Sydney NSW 2113, Australia</address>
    <firstname>Henry</firstname>
</person>
<person ID="2" age="21" sex="f">
    <firstname>Olga</firstname>
    <lastname>Tschernikova</lastname>
    <address>101000 Moscow, Russia</address>
    <firstname>Oga</firstname>
    <address>101000 oMscow ,Russia</address>
</person>
<person ID="17">
    <firstname>Maria</firstname>
    <lastname>Lamprecht</lastname>
    <address>Meiereiplatz 27, 16223 Mehrow, Germany</address>
</person>
</persons>

In the first 17 lines you will see the person elements of the source file that have not been polluted (as requested with the errorsInAncestors attribute in the root elements of the parameter file). The lines 18 to 35 contain the same elements, but polluted (we defined a duplication probability of 1 and that at most one duplicate should be created). They contain the dirty data according to the parameters.

^ zur Inhaltsübersicht

3. Terms of use

The software is free for academic purposes. We would very much

appreciate a short note or feedback on the usage.

For commercial use please contact Felix Naumann.

4. Download

You can choose between:

the complete distribution containing the JAR file, the required JDOM library, an example and the full Technical Report of the student research project (in German) and
the JAR file only. In this case you need to download the JDOM library as well and add it to your classpath.

5. Detailed Documentation

For further information please read the Technical Report (in German).

^ zur Inhaltsübersicht