Data Preparation

Data preparation is the process of transforming data before serving them to downstream tasks, such as data analytics, data cleaning, and machine learning. Much data do not meet the requirements of the following tasks, leading users, including both expert data scientists and novice data users, to frequently conduct ad-hoc data preparation. It is reported that preparing data is both labour-intensive and tedious work, which accounts for 50%-80% of the time spent in the whole data lifecycle.

We explore to build a data preparation framework to achieve two goals:

Enable users to rapidly prepare data
Enable repeatability of scientific experiments by deriving suitable data preparation specification

Taxonomy

We propose for both metadata and preparators a respective taxonomy. We use the defined metadata and preparators to create standard specifications of data transformations. The whole taxonomies can be found here.

Team Members

Prof. Dr. Felix Naumann (Project leader)
Lan Jiang
Gerardo Vitagliano
Mazhar Hameed

Projects

Strudel - Structure detection in verbose CSV files
AggreCol - Aggregation detection in Verbose CSV files
Mondrian - Detecting layout templates in complex multiregion files
Pollock - A data loading benchmark
MaGRiTTE - Learning structural embeddings of data files
Survey - Data preparation from industry perspective: A survey
Suragh - Detecting ill-formed Records in CSV Files
Tasheeh - Cleaning ill-formed Records in CSV Files

Publications

Mazhar Hameed, Gerardo Vitagliano, Fabian Panse, Felix Naumann: TASHEEH: Repairing Row-Structure in Raw CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT), 2024
Mazhar Hameed, Gerardo Vitagliano, Felix Naumann: MORPHER: Structural Transformation of ill-formed Rows. Proceedings of the International Conference on Information and Knowledge Management (CIKM), 2023
Gerardo Vitagliano, Mazhar Hameed, Lucas Reisener, Lan Jiang, Eugene Wu, Felix Naumann: Pollock: A Data Loading Benchmark. Proceedings of the VLDB Endowment (PVLDB), 2023.
Gerardo Vitagliano, Mazhar Hameed, Felix Naumann: Structural embedding of data files with MaGRiTTE. Table Representation Learning Workshop at NeurIPS (TRL@NIPS), 2022.
Gerardo Vitagliano, Lucas Reisener, Lan Jiang, Mazhar Hameed, Felix Naumann: Mondrian: Spreadsheet Layout Detection. Proceedings of the International Conference on Management of Data (SIGMOD), 2022.
Lan Jiang, Gerardo Vitagliano, Mazhar Hameed, Felix Naumann: Aggregation Detection in CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT), 2022
Mazhar Hameed, Gerardo Vitagliano, Lan Jiang, Felix Naumann: SURAGH: Syntactic Pattern Matching to Identify Ill-Formed Records. Proceedings of the International Conference on Extending Database Technology (EDBT), 2022.
Gerardo Vitagliano, Lan Jiang, Felix Naumann: Detecting Layout Templates in Complex Multiregion Files. Proceedings of the VLDB Endowment (PVLDB), 2022
[Paper] [ACM]
Lan Jiang, Gerardo Vitagliano, Felix Naumann: Structure Detection in Verbose CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT), 2021
[Paper] [GitHub]
Mazhar Hameed, Felix Naumann: Data Preparation: A Survey of Commercial Tools. SIGMOD Record 49:(3), 2020
[Paper] [ACM]
Koumarelas, Ioannis, Lan Jiang, and Felix Naumann. Data Preparation for Duplicate Detection. Journal of Data and Information Quality (JDIQ) 12, no. 3 (2020): 1–24.
Lan Jiang, Gerardo Vitagliano, Felix Naumann: A Scoring-based Approach for Data Preparator Suggestion. Lernen, Wissen, Daten, Analysen (LWDA), 2019
[Paper]

Related Work

We have compiled the related work corresponding to the individual research topics on data preparation. For the whole list please refer to here.

Contact

For further information on this project please fell free to contact us: Felix Naumann, Lan Jiang, Gerardo Vitagliano, Mazhar Hameed.

Data Preparation

Taxonomy

Team Members

Projects

Publications

Related Work

Contact

Chair

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

01.11.2025 | Paper accepted at WOP@ISWC

29.09.2025 | Paper accepted at NeurIPS 2025

29.09.2025 | Paper accepted at SIGMOD 2026

09.07.2025 | Paper accepted in SIGMOD Record

Project highlights

People and open positions