Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Pollock: A Data Loading Benchmark

The lingua franca for data

In data-driven workflows, practicioners typically resort to a plethora of different systems:

  •  Wrangling tools, such as Trifacta, can be used for the early preparation and cleaning stages;
  •  Spreadsheet tools, such as Microsoft Excel, are often used to carry out data exploration and analysis;
  •  Programming frameworks, such as Pandas, Tensorflow, PyTorch, allow users to perform statistical and machine-learning operations;
  •  Business intelligence software, such as Tableau, to create interactive dashboards and data visualizations;
  •  Database management systems, such as MySQL,  are traditionally used to efficiently organize and store data;

For its flexibility, ease of use and parsing, the de-facto standard file format for the data flow between heterogeneous systems is CSV.
Albeit the structure of CSV files is defined in the RFC4180 document, often times files have a different structure, for example due to different locale specifications (e.g. comma used for floating point delimiters in Germany).
Moreover, data files may contain metadata information such as preambles or footnotes that should not be considered as part of the tabular content of CSV files.

 

Real world data, real world problems

In our research, we crawled 3712 publicly available CSV files distributed on two Open Data portals: GovUK and Mendeley Data.

The first gathers data produced by actors of the public sector of the United Kingdom, e.g., city councils, ministry cabinets, or hospitals. The second acts as a repository for research projects carried by scientifical institutions, e.g. universities or research institutes.

We analyzed the crawled CSV files with respect to their compliance to the RFC4180 standard under different aspects: at the file-level, at the table-level and at the row- and column-level.

The image below summarizes the results of our finding.

 

Results of our real-world csv survey
Results of our real-world csv survey

The results of our survey highlight how often real-world files do not follow the standard specification of the RFC4180 document with respect to different structural features.

The key question that arises from these results is: how do different tools handle loading files with such non-standard features?

 

Pollock: Pollluted CSV Benchmark

The goal of the Pollock project is to provide a benchmark to assess the capabilities of a systems in loading data files with a non-standard structure.
We introduce the concept of "pollution", a modification of the structure of the file to let it deviate from the standard.
Our benchmark is run by systematically polluting a CSV file with different pollutions, and trying to load it in a system under test.
The result of the loading, for each input polluted file, is then exported into an output CSV file and compared with the input.

 

The Pollock Benchmark pipeline
The Pollock Benchmark pipeline

For each loading step we calculate three measures:

  • Success: a binary measure if the system was able to load the file in memory.
  • Completeness: a score in [0,1] if all input cells are present in the output.
  • Coinciseness: a score [0,1] if no spurious cells are introduced in the data loading process.

The final benchmark score is obtained summing up the three different score: as Pollock is composed of a source file with 125 pollutions, the maximum score obtainable by a system is 375.

We experimented applying Pollock to benchmark four systems: a data science programming framework (DS), a business intelligence tool (BI), a spreadsheet software (SS), and a database management software (DB).
The results can be seen in the table below.

Benchmark results of different systems
Benchmark results of different systems

A detailed discussion of the benchmark, the pollution process and the experimental results can be found in a research paper, available in the GitHub project page of Pollock.

Resources

Code

Pollock is an open source project. Its code is available on the project page at GitHub.

Datasets

Here we provide links to the datasets and their annotations used for the real-world CSV survey. Each link points to a folder that contains three resources: the crawled csv files, the annotation for each of them regarding their structure in JSON format, and a csv containing the provenance metadata for each of the files in the datasets, e.g., their webpage url and link to download them.

Please note that since the crawled files have been crawled as of April 1st, 2021, the resources may not be available anymore at their original locations.

DatasetSource# FilesDescription
GovUKdata.gov.uk1,438A random sample of the CSV files available on the Open Data repository of the government of the United Kingdom, data.gov.uk.
Mendeleydata.mendeley.com2,274A random sample of the CSV files belonging to projects shared on the Mendeley Data portal.

Contacts

This research project is conducted at the Data Preparation group. If you have any question, interesting idea, or request, do not hesitate to contact Gerardo Vitagliano.

Publications

  1. G. Vitagliano, L. Jiang, M. Hameed, F. Naumann: Pollock: A Data Loading Benchmark. In preparation