Data files are commonly created and distributed in the CSV file format. Although CSV files should follow a standard format, specified by the RFC4180, in practice they often use custom formats. For example, in the German locale floating point numbers are described with a comma, therefore a common choice for the value delimiter of fields is semicolon, to avoid confusion. Moreover, apart from non-standard dialects, some files contain metadata information in the form of preamble lines, footnotes or even multiple tables [6,7,8].
Because of this, loading data from CSV files is typically more cumbersome than simply parsing it according to the standard specification.
At the same time, because of its textual nature that grants flexibility for reading/writing operations and its non-proprietary nature, CSV is the most-common format used to distribute, consume and share data.
In a typical data-oriented workflow, different systems are at play: e.g. programming frameworks, for statistical and machine-learning operations; business intelligence tools, to build dashboards and data visualizations; database management systems, to store data.
Usually, in order to properly load data in any of these systems, users need to resort to cumbersome operations to address the structure of these files. For example, change delimiters to comma and possibly fix column formats accordingly, remove metadata lines, or extract the different tables. We refer to these operations as structural data preparation.