Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Preparation: A Survey of Commercial Tools

Abstract:

Raw data are often messy: they follow different encodings, records are not well structured, values ​​do not adhere to patterns, etc. Such data are in general not fit to be ingested by downstream applications, such as data analytics tools, or even by data management systems. The act of obtaining information from raw data relies on some data preparation process. Data preparation is integral to advanced data analysis and data management, not only for data science but for any data-driven applications. Existing data preparation tools are operational and useful, but there is still room for improvement and optimization. With increasing data volume and its messy nature, the demand for prepared data increases day by day. To cater to this demand, companies and researchers are developing techniques and tools for data preparation.
To better understand the available data preparation systems, we have conducted a survey to investigate (1) prominent data preparation tools, (2) distinctive tool features, (3) the need for preliminary data processing even for these tools and, (4) features and abilities that are still lacking. We conclude with an argument in support of automatic and intelligent data preparation beyond traditional and simplistic techniques.

The Need For Data Preparation

Raw data appears in many situations: logs, sensor output, government data, medical research data, climate data, geospatial data, etc. It accumulates in many places, such as file systems, data lakes or online repositories. 
In typical scenarios, raw data from various sources is accrued without any standardized formats or structure and with no specific target use-case; thus, it can appear messy, contain invalid characters, use different encodings, lack necessary columns, contain unwanted rows, have missing values, not follow valid patterns, etc. 

One example scenario in need of data preparation technology are data lakes to store heterogeneous raw data. They can turn into vast repositories or collections of unstructured, semi-structured, unformulated, messy, and unclean data. The large volumes of data in data lakes are compelling and can generate valuable information, provided they are thoroughly pre-processed, cleaned, and prepared. With the ever-increasing amount of raw data, the need for data preparation has become more apparent.

Preparing data yields many advantages, such as prompt error detection, improved analytics, improved data quality, enhanced scalability, accelerated data usage and more easy data collaboration.

Data Preparation vs. Data Cleaning

We define data preparation as the set of preprocessing operations performed in early stages of a data processing pipeline, i.e., data transformationsat the structural and syntactical levels. In contrast, data cleaning concerns sub-sequent data transformations and corrections at the semantic level.

Data Preparation Tasks

Data preparation is not a single step process. Rather, it usually comprises many individual preparation steps, implemented by what we call preparators, and which we have organized anew into six broader categories, defined here.

Data discovery is the process of analyzing and collecting data from different sources, for instance to match data patterns, find missing data, and locate outliers.

Data validation comprises rules and constraints to inspect the data, for instance for correctness, completeness, and other data quality constraints.

Data structuring encompasses tasks for the creation, representation and structuring of information. Examples include updating schema, detecting and changing encoding and, transform data by example.

Data enrichment adds value or supplementary information to existing data from separate sources. Typically, it involves augmenting existing data with new or derived data values using data lookups, primary key generation, and inserting metadata.

Data filtering generates a subset of the data under consideration, facilitating manual inspection and removing irregular data rows or values. Examples include extracting text parts, and keeping or deleting filtered rows.

Data cleaning refers to removal, addition, or replacement of less accurate or inaccurate data values with more suitable, accurate or representative values. Typical examples are deduplication, fill missing values, and removing whitespace.

Despite our definition, which distinguishes data preparation and cleaning, we include data cleaning steps here as well, as most data preparation tools also venture into this area.

This Project Makes the Following Contributions:


1. Organisation: We propose six broad categories of data preparation and identify 40 common data preparation steps, which we classify into those categories.
2. Documentation: We validate the availability of these features and broader categories for seven selected tools and document them in a feature matrix.
3. Evaluation: We evaluate the selected features of surveyed tools to identify whether the tool offers the stated functionalities or not.
4. Recommendation: We identify shortcomings of commercial data preparation tools in general and encourage researchers to explore further in the field of data preparation.

Discovered Tools with Asserted Data Preparation Capabilities

We collected notable commercial data preparation tools gathered from business reports and analyses, company portals, and online demonstration videos. Our preliminary investigation resulted in 42 initial commercial tools , which we then examined for the extent of their data preparation capabilities.

 

Selected Data Preparation Tools

Criteria for Selected Tools:

1. Domain specificity: tools that specifically address the data preparation task.
2. Comprehensiveness: the extent and sophistication to which tools adequately covered preparation features.
3. Guides and documentation: the availability of proper documentation for the tools, i.e., useful, up-to-date documentation with listings of features and how-to guides.
4. Trial availability: the availability of a trial version, giving us the opportunity to test the tools and validate their features.
5. GUI: the availability of a comprehensive and intuitive graphical user interface to select and apply preparations.
6. Customer assistance: compliant support teams that assisted users with generic and specific tool queries, when needed.

Preparator Matrix

Table below provides a feature matrix showing which preparator is supported by which tool in each of the six categories.

Further Features of Data Preparation Tools

In our survey of commercial tools, we came across many functional and non-functional system features that did not cater to our data preparation focus. Nonetheless, these features are important and interesting to explore and use. Therefore, we have collected them as well.

Contacts

The research project is conducted at the Data Preparation group. If you have any question, interesting idea, or request, please contact Mazhar Hameed.