Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Project members

  • Tobias Vogel

DAQS (DAta Quality as a WebService) is a comprehensive data clensing project. Its goal is to provide the full duplicate detection workflow via webservice with as less manual interaction as possible. The challenge is to enable the computer taking decisions that usually a human expert takes.

 

Workflow

The workflow consists mainly of three steps.

  1. In the problem classification phase, the available dataset is analyzed and the degree of missing information is estimated. (see below)
  2. If the semantics (a.k.a fine-grained data types) of the dataset is unclear, classes have to be assigned to the attributes.
  3. The actual duplicate detection is performed with the similarity measures derived from the classes.

1. Problem Classification

 There are four types of datasets, that can be present.

  1. Datasets can have semantic annotations for the attributes (and a mapping/a separator). Consequently, the duplicate detection task can be performed nearly automatically.
  2. The datasets only have a mapping (and a separator). Thus, it is clear, which attributes to compare, but not, how.
  3. In datasets without a mapping, only the tuples and attributes are distinguishable, but it is not clear, which attributes to compare with which other attributes.
  4. In case of unstructured documents, not even tuples/attributes can be recognized. They have to be retrieved, first.

2. Attribute Classification

In case that there are no semantics assigned to the dataset's attributes, they have to be assigned by the service. In DAQS, we use an instance-based as well as an machine learning classification approach to do that.

The figure shows the instance-based and machine learning classification classes.

Datasets

Datasets
UsageDatasetDescriptionSource of the original dataset
Instance-based classificationGlobal Knowledge DictionaryThis is the (only) instance-based dataset. Some attributes are removed for confidentiality. Consequently, the classification results will be a bit worse.
Machine learning classificationGlobal KnowledgeThis is the training dataset for machine learning. It is a melange from different sources mentioned here, but without overlapping any of the other datasets. (Usually, only the first 500 tuples are used.)
Machine learning classificationClermont-VotersThis is a file of voters in the Cerlmont county in the USA.www.clarkcountynv.gov/Depts/election/Pages/VoterDataFiles.aspx
Machine learning classificationFakenames

This is a generated dataset with all available attributes from fakenamegenerator.com.

fakenamegenerator.com
Machine learning classificationKrumnowTeusnerThis is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others.
Machine learning classificationLebenNiepraschkThis is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others.
Machine learning classificationRichlyWehrmeyerThis is a dataset generated by students during an earlier information integration lecture. Data are mostly crawled from Wikipedia and others.
Machine learning classificationList BThis dataset comes from an information integration assignment of the University of Arcansas at Little Rock.ualr.edu/eriq/downloads/
Machine learning classificationList CThis dataset comes from an information integration assignment of the University of Arcansas at Little Rock.ualr.edu/eriq/downloads/
Machine learning classificationPolitikerThis dataset is crawled from Deutschland-API.www.deutschland-api.de/Hauptseite
Machine learning classificationMinesThis dataset is an overview over some mines in the USA.

www.data.gov/raw/4137

Often, the original datasets contained many null values. As we only took 500 tuples for our experiments, we selected 500 tuples from the best-filled tuples of each dataset randomly.

Demo

There is a little demo created for the wheelmap.org project.

Wheelmap-Client

Wheelmap-Service-Endpoint (RESTful)