Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Preparation

Data preparation is the process of transforming data before serving them to downstream tasks, such as data analytics, data cleaning, and machine learning. Much data do not meet the requirements of the following tasks, leading users, including both expert data scientists and novice data users, to frequently conduct ad-hoc data preparation. It is reported that preparing data is both labour-intensive and tedious work, which accounts for 50%-80% of the time spent in the whole data lifecycle. 

We explore to build a data preparation framework to achieve two goals:

  • Enable users to rapidly prepare data
  • Enable repeatability of scientific experiments by deriving suitable data preparation specification

Taxonomy

We propose for both metadata and preparators a respective taxonomy. We use the defined metadata and preparators to create standard specifications of data transformations. The whole taxonomies can be found here.

Team Members

Projects

Data Knoller - A systematic data preparation framework

Strudel - Structure Detection in Verbose CSV Files

  • Datasets
  • Annotation Tool

AggreCol - Aggregation Detection in Verbose CSV Files

  • Datasets

Verbose CSV File Normalization - Recognize the useful parts of an arbitrary verbose CSV file and transform them into a normalized table.

Mondrian - Detecting layout templates in complex multiregion files

Pollock - A Data Loading Benchmark

SURAGH - Syntactic Pattern Matching to Identify Ill-Formed Records

Survey -  Data Preparation: A Survey of Commercial Tools

Publications

  • Structure Detection in Ve... - Download
    Jiang, Lan, Gerardo Vitagliano, and Felix Naumann. “Structure Detection in Verbose CSV Files”. In International Conference on Extending Database Technology (EDBT), 193–204, 2021. https://edbt2021proceedings.github.io/docs/p32.pdf.
     
  • Data Preparation: A Surve... - Download
    Hameed, Mazhar, and Felix Naumann. “Data Preparation: A Survey of Commercial Tools”. SIGMOD Record 49, no. 3 (2020).
     
  • A Scoring-based Approach ... - Download
    Jiang, Lan, Gerardo Vitagliano, and Felix Naumann. “A Scoring-Based Approach for Data Preparator Suggestion”. In Lernen, Wissen, Daten, Analysen (LWDA), 2454:6–9, 2019.
     

Related Work

We have compiled the related work corresponding to the individual research topics on data preparation. For the whole list please refer to here.

Contact

For further information on this project please fell free to contact us: Felix NaumannLan Jiang, Gerardo Vitagliano, Mazhar Hameed.