Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI
  
Login
  • de
 

Verbose CSV File Normalization with Cell Classes

Lan Jiang

Information Systems
Hasso Plattner Institute

Office: Campus II, Building F, Room F-2.05
Tel.: +49 331 5509 1349
Email: lan.jiang(at)hpi.de
Links: Homepage

Supervisor: Prof. Dr. Felix Naumann
 

My research interests lay in data preparation of strcutred data. Data preparation in general is the process of obtaining, cleaning, transforming and storing raw data before serving them further to downstream applications. Plain-text files are widely used to store and exchange data. However, utilizing information from plain-text files might be troublesome, as the information are often not organized in a format favored by popular data loading/analyzing tools. My research is focused on understanding the structures of verbose CSV files: one type of plain-text files, and bringing information within these files into normalized table forms.

Introduction

Plain-text files are a popular format for storing data. Such a format compiles data without application-specific encodings, and therefore is appropriate for exchanging data amongst different applications. Plain-text files appear in various types, verbose CSV files are one such type. A CSV file is verbose, if it contains elements of not only tables (headers, data), but also non-tables (metadata, footnotes, group headers, aggregations). The following figure shows an example of a verbose CSV file.

The file contains apparently meaningful structured data (a table) that could be loaded into a database for querying, or integreated with data from various sources, if these data are correctly recognized and extracted from the file.

In order to make use of data stored in verbose CSV files, two questions remain to be answered: (i) given a verbose CSV file, how to understand the semantics of its structure? (ii) once the structures can be understood, which part of the information should be used to construct a corresponding table? To answer these two questions, we highlight two studies, respectively: (i) recognizing the types of elements (lines or cells) in a verbose CSV file; (ii) given the types of elements, find a sequence of operators to construct a table of normalized forms with the meaningful parts of the file.

Structure Detection in Verbose CSV Files

To understand the structures of verbose CSV files, we conduct this study to recognize the types of elements: lines or cells. The various types of lines and cells in the above figure show an expected output of this study. We have defined six different types of elements: metadata are the descriptive textabovea table; group are section headers of tables, as in verbose CSV files, data are often separated into several parts, and each part is led by such a group header; headers are the column labels in the top area of atable (or table sections); data are the content of a table that cannot bederived from any other elements; derived elements aggregate the values of some other numeric cells in the same table; notes are descriptive text that follow a table.

We proposed the Structure Detection in Verbose CSV Files (Strudel), which is grounded on a multi-class random forest classifer.. We have designed for the classifier a number of features for both line and cell classifications, and feed them to a random forest classifier to obtain a model. Our features fall into three categories: (i) content features that use the values of the elements; (ii) contextual features that use the information from neighbouring elements; (iii) computational features that exploits the arithematic relationships amongst elements.

We have tested the performance of our approach on five datasets from different domains, and obtained reasonable results. We have also compared our results with baselines and state-of-the-art approaches. Experiments have shown that our approach outperforms the competitors.

We have conducted an error analysis on the results of our approach, and summarized a handful of reasons that cause common misclassication cases, and recognize the effectiveness of computational features that are neglected by former studies, drawing key insights for further structure understanding research: (i) semantic features may be introduced to help boost the performance; (ii) the aggregation cell detection algorithm may be extended to recognize more aggregation functions.

On Normalizing Verbose CSV Files

Verbose CSV files are organized to ease human readability and understandability, and therefore use much non-table components, such as metadata, aggregations, and visual separators for such a purpose. However, such a form is hard, if not impossible, to be processed by a machine, as existing tools are designed to handle data with a definitive schema, such as relational tables, or hierarchical data.

This study aims at transforming the meaningful information of an arbitrary verbose CSV file into a normalized table form. A normalized table form is defined to fulfill the requirements of a relational table. To enable the transformations, we have defined a handful of possible operators, such as "merge two adjacent header lines", "fill down group headers", "removing aggregation lines". We explore a q-learning based approach to learn for a specific verbose CSV file, what is the best operator sequence to construct a normalized table form.

Publications

  • Data Preparation for Duplicate Detection.Koumarelas, Ioannis; Jiang, Lan; Naumann, Felix in Journal of Data and Information Quality (JDIQ) (2020). 12(3) 1-24.
     
  • A Scoring-based Approach for Data Preparator Suggestion.Jiang, Lan; Vitagliano, Gerardo; Naumann, Felix (2019).
     
  • Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms.Dürsch, Falco; Stebner, Axel; Windheuser, Fabian; Fischer, Maxi; Friedrich, Tim; Strelow, Nils; Bleifuß, Tobias; Harmouch, Hazar; Jiang, Lan; Papenbrock, Thorsten; Naumann, Felix (2019). 219–228.
     
  • Holistic Primary Key and Foreign Key Detection.Jiang, Lan; Naumann, Felix in Journal of Intelligent Information Systems (2019).
     
  • Biterm pseudo document topic model for short text.Jiang, Lan; Lu, Hengyang; Xu, Ming; Wang, Chongjun (2016). 865-872.
     
  • Multi-label emotion classification for tweets in weibo: Method and application.Yang, Jun; Jiang, Lan; Wang, Chongjun; Xie, Junyuan (2014).