For our experimental evaluation, we use synthetic and real-world datasets from different domains.
We reuse the following datasets from the Metanome project:
The original datasets can be obtained from the Repeatability page for FD and OD publications.
The original DBLP dataset sources can be obtained from Uni Trier: https://dblp.uni-trier.de/xml/ (Accessed 2021-04-13). We transformed the XML into a single relational CSV table containing publication metadata. Some of the available attributes were removed (see script on Github) before using it in our evaluation.
The IMDb titles dataset contains information about movie, TV-series, and similar titles. A TSV file can be downloaded from https://www.imdb.com/interfaces/ (Accessed 2021-04-13).
The TPC-H lineitem dataset was generated using TPC-H 2.18.0 with a scale factor of 1. See http://www.tpc.org/tpch/ (Accessed 2021-04-13) for more details.
In our evaluation, we compared the runtime of DISTOD to two competitors: FASTOD-BID  and DIST-FASTOD-BID . The implementation of DIST-FASTOD-BID does not support data types other than integers, but our datasets contain strings, dates, and decimals. For this reason, we preprocessed all datasets to run DIST-FASTOD-BID on them: We removed the headers and substituted all values with their hash value so that each value is mapped to an integer representation. This transformation keeps all constant bODs (FDs) intact, but may change order compatible bODs. We provide the preprocessed datasets for repeatability. The CSV-files were used for DISTOD and FASTOD-BID, while the json-files were used for DIST-FASTOD-BID.
The following table summarizes the evaluation datasets after preprocessing:
|Dataset||Columns||Rows||Original Size (KiB)||Substituted Size (KiB)||Constant bODs (FDs)||Order compatible bODs|
|FD-reduced-short||30||1.000||278||597|| || |
|NCVoter-long||19||4.000.000||459.334||1.322.237|| || |
|Plista||63||1.001||575||834||≥ 32.404||≥ 313.296|