Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Structural Preparation of Tabular Data Files

Gerardo Vitagliano

Information Systems Group
Hasso Plattner Institute

Office: F-2.05
Tel.: +49 03315509427
Email: gerardo.vitagliano(at)hpi.de
Links: Homepage

Member of the HPI Research School since December 2018

Supervisor: Prof. Dr. Felix Naumann

My research projects aim at characterizing the structural issues that require data preparation and design algorithms that can automatically perform or assist users in their data preparation tasks. Currently, I am actively researching on two projects: Pollock, to benchmark CSV data loading, and MaGRiTTE, to synthetically represent the structure of data files.

Structural Data Preparation

Data files are commonly created and distributed in the CSV file format. Although CSV files should follow a standard format, specified by the RFC4180, in practice they often use custom formats. For example, in the German locale floating point numbers are described with a comma, therefore a common choice for the value delimiter of fields is semicolon, to avoid confusion. Moreover, apart from non-standard dialects, some files contain metadata information in the form of preamble lines, footnotes or even multiple tables [6,7,8].

Because of this, loading data from CSV files is typically more cumbersome than simply parsing it according to the standard specification.

At the same time, because of its textual nature that grants flexibility for reading/writing operations and its non-proprietary nature, CSV is the most-common format used to distribute, consume and share data.

In a typical data-oriented workflow, different systems are at play: e.g. programming frameworks, for statistical and machine-learning operations; business intelligence tools, to build dashboards and data visualizations; database management systems, to store data.

Usually, in order to properly load data in any of these systems, users need to resort to cumbersome operations to address the structure of these files. For example, change delimiters to comma and possibly fix column formats accordingly, remove metadata lines, or extract the different tables. We refer to these operations as structural data preparation.

Pollock: A Data Loading Benchmark

The different, non-standard formats in which tabular data comes raise interesting research questions:

  • Can real-world tools successfully load data from non-standard CSV files?
  • What features affect most the (in)ability of a system to load data (and thus require preparation)?
  • How to measure the correctness of the data loading stage?

To address these questions, we designed a benchmark named Pollock, which stands for Polluted CSV benchmarK [2].

The term "pollution" describes a structural inconsistency that lets a file deviate from the CSV standard.

Pollock benchmarks a given system under test by systematically polluting a standard CSV file, loading it into the system, and measure its performances using a set of numerical metrics.

The Pollock Benchmark

We analyzed 3,712 CSV files crawled from open data portals, and identified a list of 2289 pollutions.
Drawing from the results of this real-world survey, we also created an exemplary source file to use as a starting point for the pollution process.

We analyze loaded files at three levels.

  •  Header: we consider a header to be loaded correctly, if the output files it contains all and only the columns of the input file.
  •  Record: since relational tables are not order-sensitive with respect to records, but columns are ordered, we consider records to be equal if they contains the same values in the same order.
  •  Cell: at the cell level, we do not consider record or column order, but consider two cells to be equal if their content is the same.

For every level we evaluate precision, recall, and F1-score comparing the input and the output of the loading. Additionally, we include the "success" measure, a binary measure of whether the system was able to load the file in memory.

The final benchmark score is obtained as the average of these scores across the original input file and the 2289 polluted versions of it, and finally sum across different scores.

As all the scores are in the range [0,1] with 1 being the hightest possible, the maximum score obtainable by a given system is 10.

We experiment with two weighting schemes for the average: a "simple" average, giving a weight of one to every polluted file, and a "weighted" average giving a weight to each polluted file depending on the frequency of the pollution in our real-world survey.

We used our benchmark to evaluate sixteen different real-world systems in four categories:

  • CSV parsing module from three different languages: CleverCSV, Pandas, and the standard library parser for Python; OpenCSV, Univocity, and CSVCommons for Java; Hypoparser and the standard library parser for R.
  • Relational databases: MariaDB, MySQL, PostgreSQL and SQLite.
  • Spreadsheet software: LibreOffice Calc, a commercial software defined as SpreadDesktop, and a web-based software defined as SpreadWeb (real names omitted due to licensing).
  • Business intelligence tools: a commercial tool defined as DataViz (real name omitted due to licensing).

The results of sixteen different real-world systems on the Pollock benchmark can be seen in the table below:

MaGRiTTE: Structural file embeddings

Preparing polluted files to leverage their data requires first and foremost to understand their structure, e.g., recognizing their dialect, extract metadata or detect the tabular regions within them.

While significant amount of work has been devoted to understanding the semantic of cells in tabular data files, a correct parsing is alway required, expecting knowledge, or consistency, of syntactical properties.

We propose MaGRiTTE [1, 3], standing for Machine Generated Representation of Tabular files with Transformer Encoders, a novel architecture to learn a representation of the structure of data files.

The architecture of MAGRiTTE is composed of two components. The first is a transformer-encoder architecture, based on BERT and pre-trained to learn row embeddings. To pre-train the transformer architecture on structural features, we propose two core adaptations: a novel tokenization stage and specialized training objectives.

The second stage is used to produce file-level embeddings, synthethizing local row- and column- features into a single file encoding vector. For this reason, we resort to a convolutional architecure, trained as a variational-autoencoder to produce file-level embeddings.
The image below shows an overview of the MaGRiTTE architecture.

    Research directions

    The topic of structural data preparation reserves many open and challenging research questions. Real-world files, as seen, are often polluted and require data preparation [3,5,8].

    One interesting question is: Is it possible to precisely characterize and quantify the degree of pollution of a file?

    First, being able to describe file pollutions with certainty allows precise engineering of the data preparation effort needed to consume the payload of a file. Plus, having a machine-readable description of the "structural problems" of a file, such as a feature vector, opens up the possibility of performing automatic pipelines for its preparation, either generating them ad-hoc or adapting them from known pipelines addressing similar problems.

    A combination of the MaGRiTTe model architecture with the Pollock benchmark may lead to a more systematic approach to reproduce the distribution of real-world pollutions in a set of files. Such results would directly benefit the Pollock project, allowing to benchmark systems under unforeseen, but realistic, structural file pollutions.


    1. G. Vitagliano, M. Hameed, F. Naumann: Learning file structure for data preparation. In preparation
    2. G. Vitagliano, L. Reisener, L. Jiang, M. Hameed, F. Naumann: Pollock: A Data Loading Benchmark. In preparation
    3. G. Vitagliano, M. Hameed, F. Naumann: Structural embedding of data files with MaGRiTTE. Table Representation Learning Workshop at NeurIPS (TRL@NIPS), 2022.
    4. G. Vitagliano, L. Reisener, L. Jiang, M. Hameed, F. Naumann: Mondrian: Spreadsheet Layout Detection. Proceedings of the International Conference on Management of Data (SIGMOD), 2022
    5. M. Hameed, G. Vitagliano, L. Jiang, F. Naumann: SURAGH: Syntactic Pattern Matching to Identify Ill-Formed Records. Proceedings of the International Conference on Extending Database Technology (EDBT), 2022
    6. L. Jiang, G. Vitagliano, M. Hameed, F. Naumann: Aggregation Detection in CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT), 2022
    7. G. Vitagliano, L. Jiang, F. Naumann: Detecting Layout Templates in Complex Multiregion Files. PVLDB (in press), doi: 10.14778/3494124.3494145 , 2021.
    8. L. Jiang, G. Vitagliano, F. Naumann: Structure Detection in Verbose CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT), 2021
    9. L. Jiang, G. Vitagliano, F. Naumann: A Scoring-based Approach for Data Preparator Suggestion. Lernen, Wissen, Daten, Analysen (LWDA), 2019