Here we list the datasets and their annotations used in our project. Note that due to license issues, only publicly distributable datasets are listed here. Each link points to a compressed json file that includes both the verbose CSV files and their annotations.
Dataset | # files | # lines | # non-empty cells | Description |
SAUS | 223 | 11,598 | 157,767 | The Statistical Abstract of the United States (SAUS) from 2010. |
CIUS | 269 | 34,556 | 367,172 | The Crime In the US Census Bureau (CIUS) from 2007 and 2017. |
DeEx | 444 | 77,852 | 784,229 | A mixture of files from ENRON, FUSE, and EUSES datasets. |
Troy | 200 | 4,348 | 23,077 | 1000 tables collected from international statistical websited by DocLab graduate students in 2009-2010. This dataset includes 200 sample files of them. |