dirtyxml
Project Members: Sven Puhlmann, Felix Naumann, Melanie eis
Web site: www.hpi.uni-potsdam.de/~naumann/projekte/completed_projects/dirtyxml.html
Whenever there is a need to integrate data from various data sources, certain algorithms are used that have the ability to clean the integrated data. In order to test these algorithms one needs "dirty" sample data. The Dirty XML Data Generator is a tool written in Java that creates a dirty XML data file given a clean XML document and a set of parameters. According to the parameter set, the generated data can contain errors of different type, such as duplicates and misspellings, and is used to benchmark algorithms that clean nested integrated XML data.
The Dirty XML Data Generator was implemented by Sven Puhlmann in the context of a student research project.
1. Main Features
- Flexible and fast generation of dirty XML data
- Extensible implementation
- Algorithms can be added in order to pollute character data in a specific way by implementing a very simple Java Interface.
- Clearly arranged parameter definition in an XML file with reusable components: the parameterised algorithm specifications
2. Sample of dirty XML data generation
Suppose there is a clean XML file persons.xml containing a set of persons each with a name, an address and some additional data.
In order to test a certain algorithm, we want to create a dirty XML file based on the clean one. For instance, it should contain some duplicates and perhaps misses a specified amount of data values. In addition, the values of some attributes and the text content of a couple of elements should contain misspellings and data errors.
To define how to pollute the data in the way described above, we write an XML file persons_params.xml containing a set of parameters:
In the first part of the parameter file (lines 8 to 33), four algorithms are defined that rest upon the three base algorithms SwapChars, DeleteChar, and InsertChar. These parameterised algorithms will be applied to character data contained in elements and attribute values.
In line 6 we define by means of errorsInAncestors="false" that the original elements from which the duplicates originate should not be polluted.
The second part of the file (beginning with line 35) constitutes the elements from which duplicates should be created.
Some examples: The element address, whose chars will be polluted using two different algorithms ( swap2 and del1) used with a probability of 0.8 and 0.2, respectively. Note that the probabilities must add up to 1 (that means 100%).
For a detailed explanation of the parameters please have a look at the Detailed Documentation.
Executing the Dirty XML Data Generator with the clean XML file, the parameter XML file, and the name of the dirty XML file (here: persons_dirty.xml) as input leads to the following result:
In the first 17 lines you will see the person elements of the source file that have not been polluted (as requested with the errorsInAncestors attribute in the root elements of the parameter file). The lines 18 to 35 contain the same elements, but polluted (we defined a duplication probability of 1 and that at most one duplicate should be created). They contain the dirty data according to the parameters.
^ zur Inhaltsübersicht
3. Terms of use
The software is free for academic purposes. We would very much
appreciate a short note or feedback on the usage.
For commercial use please contact Felix Naumann.
4. Download
You can choose between:
- the complete distribution containing the JAR file, the required JDOM library, an example and the full Technical Report of the student research project (in German) and
- the JAR file only. In this case you need to download the JDOM library as well and add it to your classpath.
5. Detailed Documentation
For further information please read the Technical Report (in German).
^ zur Inhaltsübersicht