Structure Detection in Verbose CSV Files

Numerous data are stored in semi-structured files with ad-hoc layout. Such data are valuable digital assets for various data-driven applications. This work introduces the notion of verbose CSV files. Verbose CSV files include content serving different purposes in various positions. They are designed for human visual inspection or statistical report collection. An important preliminary task for extracting information from such files is structure detection, in particular classifying lines or cells by their purpose. As manual efforts are infeasible and error-prone for large files or large sets of files, automatic approaches are desirable. This work addresses both the line and the cell classification problems on verbose CSV files. We use the following figure to demonstrate a typical verbose CSV file and the six different type of cell and line classes.

This work "Structure Detection in Verbose CSV Files" has been published at EDBT'21.

Strudel

To address the line/cell classication problem, we propose the Structure Detection in Verbose CSV Files (Strudel) approach, which is grounded on a multi-class random forest classifier. The following fiure shows the architecture of the approach. It first detects the dialect of a text file, and creates a verbose CSV file from it, based on the dialect. Then Strudel classifies first lines and then cells therein with the proposed feature sets. Cells of different types are distinguished by colors. We propose sophisticated features to model the individual classes for both classification tasks. The features can be categorized into three groups: 1) content features parsing the values of cells or lines, such as cell length and amount of words; 2) contextual features comparing the inspected cell or line with its neighbors, such as the similarity of data types between lines/cells; 3) computational features seeking to connect lines/cells with each other by inspecting arithmetic correlations between them.

Resources

Datasets

Here we list the datasets and their annotations used in our project. Note that due to license issues, only publicly distributable datasets are listed here. Each link points to a compressed json file that includes both the verbose CSV files and their annotations.

Dataset	# files	# lines	# non-empty cells	Description
SAUS	223	11,598	157,767	The Statistical Abstract of the United States (SAUS) from 2010.
CIUS	269	34,556	367,172	The Crime In the US Census Bureau (CIUS) from 2007 and 2017.
DeEx	444	77,852	784,229	A mixture of files from ENRON, FUSE, and EUSES datasets.
Troy	200	4,348	23,077	1000 tables collected from international statistical websited by DocLab graduate students in 2009-2010. This dataset includes 200 sample files of them.

Code

The source code is now available at Github.

Annotation Tool

Coming soon.

Contact

If you have any questions about this project, please do not hesitate to contact Lan Jiang.

Structure Detection in Verbose CSV Files

Strudel

Resources

Datasets

Code

Annotation Tool

Contact

Chair

News

23.05.2024 | Paper accepted at NLDB 2024

29.04.2024 | Paper accepted at ITISE 2024

03.04.2024 | Congratulations to the EDBT Best Paper Award!

05.03.2024 | Another Paper marked as reproducible by pVLDB Reproducibility Committee

21.01.2024 | Paper accepted at W-NUT 2024

Project highlights

People and open positions