To understand the structures of verbose CSV files, we conduct this study to recognize the types of elements: lines or cells. The various types of lines and cells in the above figure show an expected output of this study. We have defined six different types of elements: metadata are the descriptive textabovea table; group are section headers of tables, as in verbose CSV files, data are often separated into several parts, and each part is led by such a group header; headers are the column labels in the top area of atable (or table sections); data are the content of a table that cannot bederived from any other elements; derived elements aggregate the values of some other numeric cells in the same table; notes are descriptive text that follow a table.
We proposed the Structure Detection in Verbose CSV Files (Strudel), which is grounded on a multi-class random forest classifer.. We have designed for the classifier a number of features for both line and cell classifications, and feed them to a random forest classifier to obtain a model. Our features fall into three categories: (i) content features that use the values of the elements; (ii) contextual features that use the information from neighbouring elements; (iii) computational features that exploits the arithematic relationships amongst elements.
We have tested the performance of our approach on five datasets from different domains, and obtained reasonable results. We have also compared our results with baselines and state-of-the-art approaches. Experiments have shown that our approach outperforms the competitors.
We have conducted an error analysis on the results of our approach, and summarized a handful of reasons that cause common misclassication cases, and recognize the effectiveness of computational features that are neglected by former studies, drawing key insights for further structure understanding research: (i) semantic features may be introduced to help boost the performance; (ii) the aggregation cell detection algorithm may be extended to recognize more aggregation functions.