Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Data Quality and Data Preparation

Summary written by Michael Schmitt & Jan Riedler

General introduction into the course 

In the course "Lecture Series on HPI Research SoSe 2024" HPI professors present their research areas and current topics. The aim is to provide students an overview of the breadth of research (groups) at HPI. The first lecture was given by Prof. Dr. Naumann on "data quality and data preparation".  

The professor and his chair

Prof. Dr. Felix Naumann is a professor in the field of Information Systems and leads the corresponding chair at the HPI. His research group focuses on efficiently managing heterogeneous data and texts within large systems. This field includes the development of techniques for data profiling, data preparation, and data cleaning, especially for structured data. For example, the members of the group develop methods to extract elements from art, including people, places, and works, along with their interrelationships [Source].  

Prof. Naumann himself has been a Full Professor at HPI since 2006. Notably, he held various visiting positions at prestigious institutions like SAP's Innovation Center, AT&T Research, QCRI, and IBM Almaden Research Center. Additionally, he served as the Dean of Studies for the Digital Engineering Faculty and is a speaker at the HPI Research School "Data Science and Engineering" since 2019. His expertise extends to delivering keynote speeches at various conferences and workshops, including CAiSE, EDBT. Among other awards, he won the GI Dissertation Prize, the IBM Research Division Award and two Reviewer Awards (ICDE, SIGMOD) [Source].  

A brief introduction into data quality and data preparation 

Overview of the research problem 

This lecture focused on data quality and data preparation, which in itself is a research area within the larger field of data engineering. To illustrate what data quality issues actually are and to better understand the impact of such errors, in the following paragraphs real-world examples are given. 

One example for bad data quality would be the registration form for buying tickets for the World Cup 2010, where one had to indicate the country one was coming from. As the website used an outdated country list (dropdown list), it was possible to register declaring countries that did not exist anymore. This process led to data quality errors, leading to subsequent errors in data processing. 

Such data quality errors are quite common and can be of different types such as duplicates, misspellings or intentional false information to bypass incorrect form rules. These issues lead to difficulties in analysing data or e.g., wrong consequences, such as sending material to one person twice in case of duplicates. 

Later in this summary, solutions will be presented to automate and refine data preparation and cleaning. Also, a more detailed definition of the terms will be given in the following abstracts. 

Background and classification of the research area 

Unlike the first scientists who were focussed on empirical and experimental science (such as Alexander von Humboldt observing the Amazon rainforest and taking note of it), science continued to be more of a theoretical nature, such as the creation of mathematical tools to model and validate the models created. With the start of the phase of computational paradigm much more complicated simulations and mathematical experiments were possible, ultimately leading to the current and fourth paradigm of science which is data-intensive. Jim Gray was one of the leading researchers of this most current paradigm as mentioned in the lecture. 

A typical data science pipeline is containing the following steps and challenges: 

1.    Capture (measurement or getting data) 

2.    Extraction (extracting only the data needed from the raw data) 

3.    Curation (cleaning and preparing the data for subsequent tasks)  

4.    Storage (storing the data efficiently) 

5.    Search (making the data searchable in order to allow data discovery) 

6.    Sharing (formatting the data conveniently, providing metadata, making it accessible) 

The above steps could be referred to as data engineering, which is about providing all relevant data in an efficient way. Data scientists could then pick up from this work and continue with data analytics, machine learning, visualizations etc., which is a whole other research field in itself. 

This summary is mostly about Step 3 (curation) as data cleaning and data preparation is involved mostly in this step. As an example, to highlight the importance of clean data, according to a survey from 2016 with about 80 participants, data scientists spend 60% of their time on cleaning and organizing the data. The second-largest block of their time (19%) went into data collection. Interestingly, about 50% of the data scientists said that cleaning and organizing data is not just the most part of their work but also the least enjoyable part. 

This is where data cleaning and data preparation research is rising to help data scientists to not have to spend that much time preparing and cleaning their data. 

Main ideas, definitions and methodologies used in proposed solutions 

It can be difficult to clearly distinguish data preparation and data cleaning, but to discuss this on a more technical degree, data preparation is more on the syntactic level (formatting data, understanding units, data models, structure, etc.) while data cleaning would be more in the semantic domain in terms of understanding the data and correcting factual errors in the data (e.g. distinguishing a wrong from the right address even though both are formatted in the right way). 

One example for data preparation would be changing the delimiter from comma separated to tab separated while an example for data cleaning would be deduplicating entries from a database.  

The following paragraph continues the discussion on a more technical level presenting the following topics to give some insights into the problems and solutions that have been developed in this field over time: 

1.    Bad Files 

2.    Data Preparation 

3.    Bad Data 

4.    Data Cleaning 

Bad files and data preparation 

Bad files, not following the specific rules of formatting, are a generally known problem. Fixing such errors across thousands of files, each potentially having its own unique issues, can be time-consuming. 

There are solutions that have been developed in the area of data preparation, but many of the commercially available tools expect you to have loaded the data into the tool to work with it which already poses a problem in itself.  

This issue is where the software "Mondrian", researched and developed at Prof. Naumann's chair, could come into place. It can extract multiple tables within one spreadsheet based on first rendering a spreadsheet as an image and then recognizing patterns and shapes. Mondrian also allows rule-based data extraction from multiple files that all follow the same template automatically in a concise way.  

Another solution is "ExtracTable" which can automatically ingest a large variety of raw data files, including text files and poorly structured csv-files by detecting row patterns and thus separating their values into coherent columns. 

The problem of ill-formed records, where within one data-set different observations might have different formats, needs also to be considered. To deal with this problem, the chair of Prof. Naumann is working on a solution which is using pattern recognition (e.g. three uppercase letters followed by a sequence of three digits followed by etc.) to identify observations which are ill-formed from well-formed rows. In this case, e.g. the most frequent pattern could be declared as the well-formed observation while all others will be declared as ill-formed and could potentially be automatically repaired after this first step.  

"AggreCol" is a solution to automatically detect various types of aggregations in verbose CSV files. It supports sum, difference, average, division, and relative change. In the first stage, "AggreCol" detects adjacent aggregations of each function (e.g. sum) separately. An aggregation is adjacent if the set of cells in its range are adjacent to its aggregate. The second stage collects the individual detection results and removes the spurious aggregations with a set of pruning rules. In the last stage, it aims to recognize non-adjacent aggregations by skipping the aggregates of detected aggregations [Source]. Due to rounding errors aggregations sometimes are not exact making it even harder to recognize them.  

Bad data and solutions for data cleaning 

In the following, problems that can occur once the data has been loaded into a database, are emphasised. As a first example the case of the driver "Prawo Jazdy" is shown who seems to be Ireland worst driver having collected a large number of speeding and parking tickets. Only after a while the authorities realized that "Prawo Jazdy" is not an actual driver's name but the Polish word for driver's license.

Another example would be false postal addresses (e.g. due to relocations) which can lead to a large amount of returned letters for large companies with a lot of customers. The companies still would have to ship the documents to the right address meaning a lot of manual work is needed to repair the data. In even worse cases wrong addresses or misinterpreted data due to bad handwriting might lead to Americans not receiving their tax returns which are usually returned via a physical cheque posted from IRS.  

Of course, it is of interest to fix this kind of data quality issue. Data quality or information quality in a broader sense has been researched for a few decades now. One very generic definition for data quality would be "fitness for use" meaning if the data is fit for use, it has high quality. If it is not fit for use, it has low quality. A large number of data quality dimensions could be listed such as accuracy, objectivity, believability, reputation etc. In one research paper a total of 179 dimensions have been identified, ultimately boiling it down to a set of 15 core data quality dimensions. The lecture continuous explaining some of these dimensions in more detail, summing up that data quality is not just counting the mistakes but that there's much more to it. 

In Prof. Naumann's and his colleague's research, they try to assess these dimensions. Additionally, today in times of AI new types of dimensions arise such as the diversity of a data set in order to, e.g., train a machine learning model. Also, privacy considerations are more and more important. Especially with machine learning pipelines including data acquisition, modelling, and deploying new dimensions such as liability have to be assessed if they are even measurable. Even though it seems to be a broad field which is touched upon under the concept of data quality, the abstraction to fitness for use actually makes the problem more comprehensible. 

After understanding the problem and discussing ways to measure data quality, methods of actually cleaning the data might be of interest. Data cleaning requires two steps: detecting the error and then repairing the data. So first, methods to detect errors have to be established defining what to detect and how to detect the errors. It would of course make a lot of sense to clean the data source, but usually there is no access to the source. 

As a second step, there are various parts of data repairing techniques. So first it would be about specifying the repair target, what to repair, and then questions like do I detect issues once at a time or is a more holistic approach desirable. As for error detection, also the step of repairing could be automated or human-guided. Another aspect to be decided is if the source data should be cleaned or if cleaning techniques should e.g. be included in a machine learning model (the target) so it could also work with other "uncleaned" data sets. These ideas are mostly drawn from a paper with the title "Trends in Cleaning Relational Data: Consistency and Deduplication". 

In order to detect errors, you can use constraints (such as finding problems in the data that violate constraints). A constraint could be e.g. a uniqueness constraint. So, every duplicate is identified as error. Also, more complex functional constraints which might be dependent on the actual data (like: If two people have the same zip code, they need to live in the same state) could be thought of. Subsequently, minimal repairs could be made to the data, so the constraints are not violated anymore. To solve these issues a number of solutions are available. "HoloClean" for example is a kind of holistic cleaning system, which is working with lots of constraints and lots of data at the same time offering a more advanced solution. 

Summary 

Data quality and data preparation in the time of the fourth data-intensive paradigm are relevant problems which are prevalent, time-consuming and lead to a multitude of problems and errors. In order to separate the concepts of bad files and bad data, a number of explanations and examples were given above. Defining the relevant terms and problems helps understanding and preventing the underlying issues. Fortunately, research groups such as Prof. Naumann's are tackling the problem and are developing efficient solutions based on interesting approaches some of which are publicly available. 

In Prof. Naumann's view, data quality and data preparation are both frustrating and at the same time exciting research problems. In the future, there is plenty of work and research to be done. Prof. Naumann concluded the lecture with the offer to contact him in case of further interest, e.g., in order to start new research projects.