DAQS
Project members
- Tobias Vogel
DAQS (DAta Quality as a WebService) is a comprehensive data clensing project. Its goal is to provide the full duplicate detection workflow via webservice with as less manual interaction as possible. The challenge is to enable the computer taking decisions that usually a human expert takes.
Workflow
The workflow consists mainly of three steps.
- In the problem classification phase, the available dataset is analyzed and the degree of missing information is estimated. (see below)
- If the semantics (a.k.a fine-grained data types) of the dataset is unclear, classes have to be assigned to the attributes.
- The actual duplicate detection is performed with the similarity measures derived from the classes.
1. Problem Classification
There are four types of datasets, that can be present.
- Datasets can have semantic annotations for the attributes (and a mapping/a separator). Consequently, the duplicate detection task can be performed nearly automatically.
- The datasets only have a mapping (and a separator). Thus, it is clear, which attributes to compare, but not, how.
- In datasets without a mapping, only the tuples and attributes are distinguishable, but it is not clear, which attributes to compare with which other attributes.
- In case of unstructured documents, not even tuples/attributes can be recognized. They have to be retrieved, first.
2. Attribute Classification
In case that there are no semantics assigned to the dataset's attributes, they have to be assigned by the service. In DAQS, we use an instance-based as well as an machine learning classification approach to do that.
The figure shows the instance-based and machine learning classification classes.
Datasets
| Usage | Dataset | Description | Source of the original dataset |
|---|---|---|---|
| Instance-based classification | Global Knowledge Dictionary | This is the (only) instance-based dataset. Some attributes are removed for confidentiality. Consequently, the classification results will be a bit worse. | |
| Machine learning classification | Global Knowledge | This is the training dataset for machine learning. It is a melange from different sources mentioned here, but without overlapping any of the other datasets. (Usually, only the first 500 tuples are used.) | |
| Machine learning classification | Clermont-Voters | This is a file of voters in the Cerlmont county in the USA. | www.clarkcountynv.gov/Depts/election/Pages/VoterDataFiles.aspx |
| Machine learning classification | Fakenames | This is a generated dataset with all available attributes from fakenamegenerator.com. | fakenamegenerator.com |
| Machine learning classification | KrumnowTeusner | This is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others. | |
| Machine learning classification | LebenNiepraschk | This is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others. | |
| Machine learning classification | RichlyWehrmeyer | This is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others. | |
| Machine learning classification | List B | This dataset comes from an information integration assignment of the University of Arcansas at Little Rock. | ualr.edu/eriq/downloads/ |
| Machine learning classification | List C | This dataset comes from an information integration assignment of the University of Arcansas at Little Rock. | ualr.edu/eriq/downloads/ |
| Machine learning classification | Politiker | This dataset is crawled from Deutschland-API. | www.deutschland-api.de/Hauptseite |
| Machine learning classification | Mines | This dataset is an overview over some mines in the USA. |
Often, the original datasets contained many null values. As we only took 500 tuples for our experiments, we selected 500 tuples from the best-filled tuples of each dataset randomly.
Demo
There is a little demo created for the wheelmap.org project.
Wheelmap-Client
Wheelmap-Service-Endpoint (RESTful)