DISTOD (Efficient Distributed Discovery of Bidirectional Order Dependencies)

This is the repeatability page for our journal paper and VLDB poster on efficient and distributed discovery of bidirectional order dependencies [1].

This project is the result of Sebastian Schmidl's master thesis. The thesis explains the DISTOD algorithm with more technical details.

Authors

Sebastian Schmidl, Thorsten Papenbrock

Abstract

Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query op-timization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets. [1]

Algorithm Source Code

The source code for our distributed data profiling algorithm DISTOD and the scripts for the experiments described in the paper [1] are hosted on Github.

In our evaluation, we compare the following three algorithms:

Our algorithm DISTOD [1]: https://git.io/distod
FASTOD-BID [2]: https://git.io/fastodbid (Accessed 2020-08-26)
DIST-FASTOD-BID [3]: https://git.io/dist-fastodbid (Accessed 2020-08-26)

Evaluation Datasets

For our experimental evaluation, we use synthetic and real-world datasets from different domains.

We reuse the following datasets from the Metanome project:

Abalone
Adult
Bridges
Chess
FD-reduced
Flight
Hepatitis
Horse
Iris
Letter
NCVoter
Plista

The original datasets can be obtained from the Repeatability page for FD and OD publications.

The original DBLP dataset sources can be obtained from Uni Trier: https://dblp.uni-trier.de/xml/ (Accessed 2021-04-13). We transformed the XML into a single relational CSV table containing publication metadata. Some of the available attributes were removed (see script on Github) before using it in our evaluation.

The IMDb titles dataset contains information about movie, TV-series, and similar titles. A TSV file can be downloaded from https://www.imdb.com/interfaces/ (Accessed 2021-04-13).

The TPC-H lineitem dataset was generated using TPC-H 2.18.0 with a scale factor of 1. See http://www.tpc.org/tpch/ (Accessed 2021-04-13) for more details.

In our evaluation, we compared the runtime of DISTOD to two competitors: FASTOD-BID [2] and DIST-FASTOD-BID [3]. The implementation of DIST-FASTOD-BID does not support data types other than integers, but our datasets contain strings, dates, and decimals. For this reason, we preprocessed all datasets to run DIST-FASTOD-BID on them: We removed the headers and substituted all values with their hash value so that each value is mapped to an integer representation. This transformation keeps all constant bODs (FDs) intact, but may change order compatible bODs. We provide the preprocessed datasets for repeatability in this share. The CSV-files were used for DISTOD and FASTOD-BID, while the json-files were used for DIST-FASTOD-BID.

The following table summarizes the evaluation datasets after preprocessing:

Dataset	Columns	Rows	Original Size (KiB)	Substituted Size (KiB)	Constant bODs (FDs)	Order compatible bODs
Abalone	9	4.177	187	753	137	318
Adult	15	32.561	3.527	9.682	78	1.140
Bridges	13	108	6	28	142	1.097
Chess	7	28.056	519	3.955	1	0
DBLP	13	8.224.309	1.237.373	1.071.536	31	664
FD-reduced-short	30	1.000	278	597
FD-reduced-long	30	250.000	69.580	149.269	89.571	742
Flight-short	30	1.000	135	576	5.976	40.740
Hepatitis	20	155	7	61	8.250	54.766
Horse	29	300	25	168	128.727	2.231.321
IMDb titles	8	7.734.683	628.014	1.129.785	8	26
Iris	5	150	4	15	4	7
Letter	17	20.000	696	6.635	61	2.202
NCVoter-short	19	999.999	112.578	326.519	572	4.362
NCVoter-long	19	4.000.000	459.334	1.322.237
Plista	63	1.001	575	834	≥ 32.404	≥ 313.296
TPC-H lineitem	16	6.001.215	736.193	1.911.435	3.984	13.760

References

Sebastian Schmidl and Thorsten Papenbrock. 2021. Efficient distributed discovery of bidirectional order dependencies. The VLDB Journal. [ doi ]
Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Kargar, and Divesh Srivastava. 2018. Effective and complete discovery of bidirectional order dependencies via set-based axioms.The VLDB Journal, 27, 4, 573–591. [ doi ]
Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed implementations of dependency discovery algorithms. Proceedings of the VLDB Endowment, 12, 11, 1624–1636. [ doi ]