Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Structure Detection in Verbose CSV Files

Numerous data are stored in semi-structured files with ad-hoc layout. Such data are valuable digital assets for various data-driven applications. This work introduces the notion of verbose CSV files. Verbose CSV files include content serving different purposes in various positions. They are designed for human visual inspection or statistical report collection. An important preliminary task for extracting information from such files is structure detection, in particular classifying lines or cells by their purpose. As manual efforts are infeasible and error-prone for large files or large sets of files, automatic approaches are desirable. This work addresses both the line and the cell classification problems on verbose CSV files. We use the following figure to demonstrate a typical verbose CSV file and the six different type of cell and line classes.

This work "Structure Detection in Verbose CSV Files" has been published at EDBT'21.

Strudel

To address the line/cell classication problem, we propose the Structure Detection in Verbose CSV Files (Strudel) approach, which is grounded on a multi-class random forest classifier. The following fiure shows the architecture of the approach. It first detects the dialect of a text file, and creates a verbose CSV file from it, based on the dialect. Then Strudel classifies first lines and then cells therein with the proposed feature sets. Cells of different types are distinguished by colors. We propose sophisticated features to model the individual classes for both classification tasks. The features can be categorized into three groups: 1) content features parsing the values of cells or lines, such as cell length and amount of words; 2) contextual features comparing the inspected cell or line with its neighbors, such as the similarity of data types between lines/cells; 3) computational features seeking to connect lines/cells with each other by inspecting arithmetic correlations between them.

Resources

Datasets

Here we list the datasets and their annotations used in our project. Note that due to license issues, only publicly distributable datasets are listed here. Each link points to a compressed json file that includes both the verbose CSV files and their annotations.

Dataset# files# lines# non-empty cellsDescription
SAUS22311,598157,767The Statistical Abstract of the United States (SAUS) from 2010.
CIUS26934,556367,172The Crime In the US Census Bureau (CIUS) from 2007 and 2017.
DeEx44477,852784,229A mixture of files from ENRON, FUSE, and EUSES datasets.
Troy2004,34823,0771000 tables collected from international statistical websited by DocLab graduate students in 2009-2010. This dataset includes 200 sample files of them.

Code

The source code is now available at Github.

Annotation Tool

Coming soon.

Contact

If you have any questions about this project, please do not hesitate to contact Lan Jiang.