Datasets
Here we list the datasets and their annotations used in our project. Note that due to license issues, only publicly distributable datasets are listed here. Each link points to a compressed json file that includes both the verbose CSV files and their annotations.
The validation dataset comprises files from the Troy and the EUSES datasets, while the unseen dataset comprises files from the SAUS and the CIUS datasets.
Dataset | # Files | # Aggregations | Description |
Validation | 385 | 20,280 | The Statistical Abstract of the United States (SAUS) from 2010. |
Unseen | 81 | 5,854 | 1000 tables collected from international statistical websited by DocLab graduate students in 2009-2010. This dataset includes 200 sample files of them. |
Code
The source code is now available at Github.