Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Data Cleaning

Ziawasch Abedjan, TU Berlin

Abstract

Data cleaning is one of the most time-consuming and tedious tasks in data-driven tasks. Typically, it entails the identification of erroneous values and their correction. Effective error detection can significantly improve the subsequent correction step. Research on data cleaning has provided a variety of approaches, most of which require some prior knowledge about the dataset in order to set up and configure the approach with rules, sensitivity thresholds, or other parameters.

Often these approaches only cover a certain type of errors. Recently, novel machine learning techniques have been proposed to treat error detection as a classification task. These approaches still require large amounts of training data scaling with the size of the dataset to cover the variety of residing error types inside a dataset. In this talk, I will present our work in progress towards a holistic data cleaning system that significantly reduces the amount of required labels by leveraging label propagation techniques and meta-learning.
 

Biography

Ziawasch Abedjan is Juniorprofessor and head of the “Big Data Management” (BigDaMa) Group at TU Berlin. Prior to that, Ziawasch was a postdoc at the “Computer Science and Artificial Intelligence Laboratory” at MIT working on various data integration problems. Ziawasch received his PhD on from the Hasso Plattner Institute in Potsdam, Germany. He is recipient of the Best Dissertation Award of the University of Potsdam, the 2014 CIKM Best Student Paper Award, and the 2015 SIGMOD Best Demo Award. His research is funded by the DFG, the Federal Ministry for Research and Education, and the Federal Ministry of Transport, Building and Urban Development.

A recording of the presentation is available on Tele-Task.

Summary

written by Tobias Bredow, Alina Gries, and Kay Erik Jenß

Big data is all around and data science has become a very interesting topic in the society and industry. A study from 2016 (CrowdFlower’s Data Science Report) discovered that data scientists spend 80% of their time with data preparation.
A data preparation pipeline consists of profiling, extraction, transformation, matching, discovery and cleaning. In the lecture Prof. Dr. Ziawasch Abedjan focused on data cleaning.

1 Error detection

Before detecting and correcting data errors one have to answer the question what data errors are?. Basically it can be anything that a user does not want to have in the data set. Common types of errors are missing values, uniqueness errors like the same id in two different entries, representation errors like another order of first name and last name, contradictions for example between age and date of birth, typos, incorrect values or column shifts.

In the following we summarize the presented error detection algorithms.

Rule Violation Detector
Rules can be defined as functional dependencies for the data set. For example the same zip code should always lead to the same city name. It can be considered an error value which violates this rule as an error. Later we will see that one can also use these rules to find corrections.

Pattern Violation Detector
This algorithm uses more fine granular rules, so-called patterns. For example one could define a format with a regular expression for every column. Anything that violates a pattern can be considered an error.

But note that when a value breaks a rule or a pattern, it is still not known what is exactly wrong with it.

Outlier Detection
Furthermore there are statistical approaches such as Outlier Detection. It uses heuristics to identify what can be an error. So a value that does not fit in the distribution can be considered an error. But not every outlier is a data error. An outlier is only a heuristic for us to figure out if something was an error or not.

Knowledge Base Violation Detector
This algorithm validates the data set with a master data set with mostly clean values.

Now one can ask, which type of detection strategy is most effective to find errors. In 2016 Prof. Dr. Ziawasch Abedjan and his group did a study to answer this question. They discovered that data sets rarely contain only one issue and there is a mix of different problems. Therefore there is not a single best strategy but we need all of them. They also found out that there are not only a few errors in a data set but thousands so automation is really needed to find errors.

1.1 Possible directions for aggregating error detection algorithms

There are different directions to aggregate the results of multiple error detection algorithms. They can be divided into unsupervised methods and semi-supervised methods.

When looking at unsupervised methods one can do majority voting, union the results or do a minimum agreement, so at least k strategies have to agree that a value is an error. Between agreement and union there is always a precision and recall tradeoff: The more tools agree that the value is an error, the more precise the result will be, but the fewer errors it will discover and vice versa.

With semi-supervised methods it is possible to learn the best combination of error detection techniques and transfer error detection into a classification task.

1.2 Implementation

The groups idea was to holistically combine multiple error detection strategies and predict each value of the tuple t[i] as clean: 0 or error: 1. More detailed, they have multiple error detection strategies for different systems and each system outputs a value for each data cell on our data set. So they get multiple binary matrices and for each tuple value exists a string of zeros and ones.

Additionally to make this more compelling Prof. Dr. Ziawasch Abedjan and his group enriched the feature vector with additional simple heuristics which can be generated from metadata which represents characteristics of the data set.
For that he presented the following five metadata categories.

Data Completeness

Data Type Affiliation

Attribute Domains

Frequent Values

Multi-Column Dependencies

So in their overall system the research group combines these ideas: First they generate the needed metadata for an input data set. Then they run multiple error detection strategies and generate binary matrices. After that they put the generated data in a feature vector. Now they can do ensemble learning and get one final output.  

1.3 Evaluation Methodology

In order to evaluate their system the group used the precision, the recall and the F-Measure, resulting from these metrics. The precision being the proportion of errors found that are actual errors and the recall the percentage of actual errors found.

They then compare the resulting F-Measure with measures from popular error detection algorithms. This comparison shows that the stacking approach performs better then the common approaches especially on detecting errors in an address dataset and similar or better on a hospital dataset. Including metadata with the stacking approach improves the results even further. However their approach needs labeled data which is not represented in the evaluation.

1.4 Generalizing labels

Overall, generalization in error detection is difficult. Since a generalization would have to capture not only the syntactic but also the semantic aspects of a label. It is possible that there is just a typo in one of the labels or that a label is entirely wrong. However many of these aspects are covered by different existing techniques. These need to have the right configuration to capture certain aspects of errors. In order to automatically generalize labels Prof. Dr. Ziawasch Abedjan and his group developed the Raha System.

1.5 Raha System: Automatic Algorithm Configuration

Instead of manually finding the best configuration for each algorithm, Raha generates a wide range of configurations for each of the used algorithms. This results in a large amount of error detection strategies that capture the similarity of errors.

The output of these different strategies is then used to generate feature vectors for each label. According to these feature vectors the labels are then clustered. In order to then determine if the labels are good or dirty the user needs to label these clusters. To maximize the knowledge gained from each labeled value and since cluster are separated amongst the columns of the feature, vectors tuples are chosen in a way so that they cover as many as possible unlabeled clusters. This is done till the user hits his label budget.

Afterwards training and classification can be done.

For comparing Raha to current approaches an experimental setup was used with eight different datasets and different algorithms that were compared on precision, recall, the F1-measure and runtime.

Results of these experiments showed that with only 10 labeled tuples Raha already outperforms all compared approaches. This gap widens as the number of labeled tuples increases.

Since running each algorithm with each configuration is rather expensive the group used meta-learning. Old data was used to filter out configurations that performed poorly in the past in order to not waste time executing these strategies.

Results showed that using this approach reduced the runtime by a big margin while only reducing the effectiveness by a small amount.

The limitations of the approach Raha takes are:

It does not handle errors done by the user while labeling the clusters

It does not provide enough context for the user while labeling 

It does not correct found errors

There is no guarantee how good it actually works

2 Error correction

Prof. Dr. Ziawasch Abedjan went a step further in his studies on how to clean data. As a second part after discovering the errors within the data set, there must be a mechanism to fix the corrupted tuples, for which two somewhat different approaches were presented. The first, developed at the University of Waterloo, Canada, has been an approach which uses denial constraints, whereas the second method uses a statistical background knowledge to repair the data set.

2.1 Denial constraints approach

As the name might suggest, this method relies on denial constraints. Denial constraints are now widely used to ensure the correctness of data because they have the advantage over other integrity constraints of being more general than most ICs. In addition, they offer the advantage of being more expressive and maintaining the balance between complexity and expressiveness.

Figure 1: Architecture of the system. Source: Chu, Xu & Ilyas, Ihab & Papotti, Paolo. (2013). Holistic data cleaning: Putting violations into context.

These denial constraints are parsed in the presented system and fed together with the data into a Conflict Hypergraph, which then finds out the violations of the constraints.

In the next step, the cells to be repaired are selected and repaired according to the nodes in which they were grouped, so that the updates can be applied to the data.

The main approach of this method is to use the conflict hypergraph to create nodes that represent cells that fall out of the grid. Now edges are added inside the graph, connecting cells that are involved in the same type of violation.

Once this is done, minimal vertex cover is used to isolate a cell that is most likely to have been corrupted. A stack is then built for that cell, and all subsequent cells, which contains all the changes to be considered, so that they can all be applied to that cell.

2.2 HoloClean

This approach is based on the fact that there is advance information that the system can process in order to make corrections on a statistical basis. This includes additional information on the data set to be repaired, further information such as denial constraints and dependencies, but also external information such as the assignment of cities to states and postal codes discussed in the presentation.

Based on the data available there, probabilities can then be specified for each cell, how likely it is that a certain value will be there. Thus, it is possible to statistically select the one with the highest probability from a set of finite possibilities for the value of a cell.

In summary, it can be said that with the growing increase in data, new methods for correction are emerging. This also ensures that data sets can be trusted more and more even in important business scenarios. The approaches in error detection shift from traditional methods such as rule and pattern violations to ML-based approaches, which still require human assistance for classification, but are only semi-supported.

Nevertheless, there are already very effective approaches, especially in terms of correcting the errors in the data sets, which solve the problems encountered up until this point in time.