Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI
Login
 

Structural Preparation of Tabular Data Files

Gerardo Vitagliano

Information Systems Group
Hasso Plattner Institute

Office: F-2.05
Tel.: +49 03315509427
Email: gerardo.vitagliano(at)hpi.de
Links: Homepage

Member of the HPI Research School since December 2018

Supervisor: Prof. Dr. Felix Naumann

My research projects aim at characterizing the structural issues that require data preparation and design algorithms that can automatically perform or assist users in their data preparation tasks. Currently, I am actively researching on two projects: Pollock, to benchmark CSV data loading; and Mondrian, to recognize spreadsheet layout templates.

Structural Data Preparation

Data files are commonly created and distributed in the CSV file format. Although CSV files should follow a standard format, specified by the RFC4180, in practice they often use custom formats. For example, in the German locale floating point numbers are described with a comma, therefore a common choice for the value delimiter of fields is semicolon, to avoid confusion. Moreover, apart from non-standard dialects, some files contain metadata information in the form of preamble lines, footnotes or even multiple tables [3,4,6].

Because of this, loading data from CSV files is typically more cumbersome than simply parsing it according to the standard specification.

At the same time, because of its textual nature that grants flexibility for reading/writing operations and its non-proprietary nature, CSV is the most-common format used to distribute, consume and share data.

In a typical data-oriented workflow, different systems are at play: e.g. programming frameworks, for statistical and machine-learning operations; business intelligence tools, to build dashboards and data visualizations; database management systems, to store data.

Usually, in order to properly load data in any of these systems, users need to resort to cumbersome operations to address the structure of these files. For example, change delimiters to comma and possibly fix column formats accordingly, remove metadata lines, or extract the different tables. We refer to these operations as structural data preparation.

Pollock: A Data Loading Benchmark

The different, non-standard formats in which tabular data comes, rises interesting research questions:

  • Can real-world tools successfully load data from non-standard CSV files?
  • What features affect most the (in)ability of a system to load data (and thus require preparation)?
  • How to measure the correctness of the data loading stage?

To address these questions, we designed a benchmark named Pollock, which stands for Polluted CSV benchmarK [2].

The term "pollution" describes a structural inconsistency that lets a file deviate from the CSV standard.

Pollock benchmarks a given system under test by systematically polluting a standard CSV file, loading it into the system, and measure its performances using a set of numerical metrics.

The Pollock Benchmark

We analyzed 3,712 CSV files crawled from open data portals, and identified a list of 124 pollutions.
Drawing from the results of this real-world survey, we also created an exemplary source file to use as a starting point for the pollution process.

The benchmark includes three metrics, measured in loading each polluted file:

  •  Success : a 0 or 1, if the system was able to load the file in memory.
  •  Completeness: a number in [0,1], reflecting the ratio of input cells present in the output.
  •  Conciseness: a number in [0,1], reflecting the ratio of output cells present in the input (i.e., the amount of spurious cells introduced in the data loading process).

The final benchmark score is obtained as the sum of each of these scores obtained for the original input file and the 124 polluted versions of it. Therefore, the maximum score obtainable by a given system is 375.

The results of four different real-world systems on the Pollock benchmark can be seen in the table below:
 

The four systems in the table are: a data science programming framework (DS), a business intelligence tool (BI), a spreadsheet software (SS), and a database management software (DB).

Mondrian: Multiregion Layout Detection

In general, polluted CSV files can contain multiple "regions" that may serve multiple purposes: tables, footnotes, or additional metadata. We define such files as multiregion. In files that originate from the same source, it is possible that region layouts are repeated equally but with different data content. In light of automated data preparation, extraction, or integration, there is great value in recognizing the presence and layout of regions within a file, and discovering multiregion templates, i.e., file layouts that occur in multiple files.

We developed Mondrian[5], an automated approach to detect multiple regions in a spreadsheet, describe their layout using a graph representation, and compare these layouts with a similarity flooding-based algorithm.

First, the cells of a spreadsheet are converted into pixels, encoded with different colors, such that cells with similar syntax share similar colors (e.g., integer numbers are dark blue and floating point numbers are light blue, while strings are red).
Then, Mondrian detects multiple regions, partitioning the image into groups of adjacent pixels and clustering them together to form regions.
For each of the extracted regions, Mondrian calculates a "fingerprint" that reflect the region's syntactic and structural properties. This fingerprint is then used to compare regions across different files. Files that show at least one similar region are more probable to share the same layout. Therefore, they are selected as candidates instances of the same template.

Each file layout is then described with a connected graph that encodes the information about the regions that it contains and their connectivity (i.e., how they are arranged relative to each other in the spreadsheet). Finally, using a graph similarity score, we measure the similarity between file layouts that had similar regions.

To improve its usability, Mondrian includes a graphical interface to assist end-users [1]. Users are able to load a given spreadsheet file, run the automated region detection and template recognition algorithm, and visualize its results as well as interactively adjust the detected regions and templates.

An overview of the Mondrian pipeline.

Research directions

The topic of structural data preparation reserves many open and challenging research questions. Real-world files, as seen, are often polluted and require data preparation [3,4,7].

One interesting question is: Is it possible to precisely characterize and quantify the degree of pollution of a file?

First, being able to describe file pollutions with certainty allows precise engineering of the data preparation effort needed to consume the payload of a file. Plus, having a machine-readable description of the "structural problems" of a file, such as a feature vector, opens up the possibility of performing automatic pipelines for its preparation, either generating them ad-hoc or adapting them from known pipelines addressing similar problems.

My on-going and future research aims at obtaining a vectorial representation for the structural properties of a data file (e.g., the dialect, the number of columns, cells in a row, etc.) and to systematically reproduce the distribution of real-world pollutions in a set of files. Such results would directly benefit the Pollock project, allowing to benchmark systems under unforeseen, but realistic, structural file pollutions.

Publications

  1. G. Vitagliano, L. Jiang, M. Hameed, F. Naumann: Mondrian: Spreadsheet Layout Detection. In preparation
  2. G. Vitagliano, L. Jiang, M. Hameed, F. Naumann: Pollock: A Data Loading Benchmark. In preparation
  3. M. Hameed, G. Vitagliano, L. Jiang, F. Naumann: SURAGH: Syntactic Pattern Matching to Identify Ill-Formed Records. Proceedings of the International Conference on Extending Database Technology (EDBT) - submitted, 2022
  4. L. Jiang, G. Vitagliano, M. Hameed, F. Naumann: Aggregation Detection in CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT) - submitted, 2022
  5. G. Vitagliano, L. Jiang, F. Naumann: Detecting Layout Templates in Complex Multiregion Files. PVLDB - accepted (2021).
  6. L. Jiang, G. Vitagliano, F. Naumann: Structure Detection in Verbose CSV Files. Proceedings of the International Conference on Extending Database Technology (EDBT), 2021
  7. L. Jiang, G. Vitagliano, F. Naumann: A Scoring-based Approach for Data Preparator Suggestion. Lernen, Wissen, Daten, Analysen (LWDA), 2019

Teaching