Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Pollock: A Data Loading Benchmark

The lingua franca for data

In data-driven workflows, practicioners typically resort to a plethora of different systems:

  •  Wrangling tools, such as Trifacta, can be used for the early preparation and cleaning stages;
  •  Spreadsheet tools, such as Microsoft Excel, are often used to carry out data exploration and analysis;
  •  Programming frameworks, such as Pandas, Tensorflow, PyTorch, allow users to perform statistical and machine-learning operations;
  •  Business intelligence software, such as Tableau, to create interactive dashboards and data visualizations;
  •  Database management systems, such as MySQL,  are traditionally used to efficiently organize and store data;

For its flexibility, ease of use and parsing, the de-facto standard file format for the data flow between heterogeneous systems is CSV.
Albeit the structure of CSV files is defined in the RFC4180 document, often times files have a different structure, for example due to different locale specifications (e.g. comma used for floating point delimiters in Germany).
Moreover, data files may contain metadata information such as preambles or footnotes that should not be considered as part of the tabular content of CSV files.

 

Real world data, real world problems

In our research, we crawled 3712 publicly available CSV files distributed on two Open Data portals: GovUK and Mendeley Data.

The first gathers data produced by actors of the public sector of the United Kingdom, e.g., city councils, ministry cabinets, or hospitals. The second acts as a repository for research projects carried by scientifical institutions, e.g. universities or research institutes.

We analyzed the crawled CSV files with respect to their compliance to the RFC4180 standard under different aspects: at the file-level, at the table-level and at the row- and column-level.

The image below summarizes the results of our finding.

 

Results of our real-world csv survey
Results of our real-world csv survey

The results of our survey highlight how often real-world files do not follow the standard specification of the RFC4180 document with respect to different structural features.

The key question that arises from these results is: how do different tools handle loading files with such non-standard features?

 

Pollock: Pollluted CSV Benchmark

The goal of the Pollock project is to provide a benchmark to assess the capabilities of a systems in loading data files with a non-standard structure.
We introduce the concept of "pollution", a modification of the structure of the file to let it deviate from the standard.
Our benchmark is run by systematically polluting a CSV file with different pollutions, and trying to load it in a system under test.
The result of the loading, for each input polluted file, is then exported into an output CSV file and compared with the input.

 

The Pollock Benchmark pipeline
The Pollock Benchmark pipeline

We analyze loaded files at three levels.

  •  Header: we consider a header to be loaded correctly, if the output files it contains all and only the columns of the input file.
  •  Record: since relational tables are not order-sensitive with respect to records, but columns are ordered, we consider records to be equal if they contains the same values in the same order.
  •  Cell: at the cell level, we do not consider record or column order, but consider two cells to be equal if their content is the same.

For every level we evaluate precision, recall, and F1-score comparing the input and the output of the loading. Additionally, we include the "success" measure, a binary measure of whether the system was able to load the file in memory.

The final benchmark score is obtained as the average of these scores across the original input file and the 2289 polluted versions of it, and finally sum across different scores.

As all the scores are in the range [0,1] with 1 being the hightest possible, the maximum score obtainable by a given system is 10.

We experiment with two weighting schemes for the average: a "simple" average, giving a weight of one to every polluted file, and a "weighted" average giving a weight to each polluted file depending on the frequency of the pollution in our real-world survey.

We used our benchmark to evaluate sixteen different real-world systems in four categories:

  • CSV parsing module from three different languages: CleverCSV, Pandas, and the standard library parser for Python; OpenCSV, Univocity, and CSVCommons for Java; Hypoparser and the standard library parser for R.
  • Relational databases: MariaDB, MySQL, PostgreSQL and SQLite.
  • Spreadsheet software: LibreOffice Calc, a commercial software defined as SpreadDesktop, and a web-based software defined as SpreadWeb (real names omitted due to licensing).
  • Business intelligence tools: a commercial tool defined as DataViz (real name omitted due to licensing).

The results of sixteen different real-world systems on the Pollock benchmark can be seen in the table below:

Benchmark results of different systems
Benchmark results of different systems

A detailed discussion of the benchmark, the pollution process and the experimental results can be found in a research paper, available in the GitHub project page of Pollock.

Resources

Code

Pollock is an open source project. Its code is available on the project page at GitHub.

Datasets

Here we provide links to the datasets and their annotations used for the real-world CSV survey. Each link points to a folder that contains three resources: the crawled csv files, the annotation for each of them regarding their structure in JSON format, and a csv containing the provenance metadata for each of the files in the datasets, e.g., their webpage url and link to download them.

Please note that since the crawled files have been crawled as of April 1st, 2021, the resources may not be available anymore at their original locations.

DatasetSource# FilesDescription
GovUKdata.gov.uk1,438A random sample of the CSV files available on the Open Data repository of the government of the United Kingdom, data.gov.uk.
Mendeleydata.mendeley.com2,274A random sample of the CSV files belonging to projects shared on the Mendeley Data portal.

Publications

  • G. Vitagliano, L. Reisener, M. Hameed, L. Jiang, E. Wu, F. Naumann: Pollock: A Data Loading Benchmark. PVLDB, 16(8):1870-1882, 2023. doi: 10.14778/3594512.3594518

Contacts

This research project is conducted at the Data Preparation group. If you have any question, interesting idea, or request, do not hesitate to contact Gerardo Vitagliano.