Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Overview

Data preparation is the process of transforming data before serving them to downstream tasks, such as data analytics, data cleaning, and machine learning. Much data do not meet the requirements of the following tasks, leading users, including both expert data scientists and novice data users, to frequently conduct ad-hoc data preparation. It is reported that preparing data is both labour-intensive and tedious work, which accounts for 50%-80% of the time spent in the whole data lifecycle. 

We explore to build a data preparation framework to achieve two goals:

  • Enable users to rapidly prepare data
  • Enable repeatability of scientific experiments by deriving suitable data preparation specification

Taxonomy

We propose for both metadata and preparators a respective taxonomy. We use the defined metadata and preparators to create standard specifications of data transformations.

Metadata taxonomy

In the metadata taxonomy, we incorporate the metadata that indicate useful properties of data for data preparation. (For high resolution, please click on the image)

Preparator taxonomy

In the preparator taxonomy, each green node represents a group of preparators while each yellow node represents a real preparator. We try to incorporate as complete set of preparators as possible in this taxonomy. Except for the large number of general preparators, it also includes preparators for various data model, i.e. relational model, tree model, graph model, and flat file. (For high resolution, please click on the image)

Preparator API

Change file encoding
Rename file
Change property
Remove preamble
Split file

 

Contact

For further information on this project please contact Lan Jiang.