DISTOD (Efficient Distributed Discovery of Bidirectional Order Dependencies)
This is the repeatability page for our journal paper and VLDB poster on efficient and distributed discovery of bidirectional order dependencies [1].
This project is the result of Sebastian Schmidl's master thesis. The thesis explains the DISTOD algorithm with more technical details.
Authors
Sebastian Schmidl, Thorsten Papenbrock
Abstract
Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query op-timization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets. [1]
Algorithm Source Code
The source code for our distributed data profiling algorithm DISTOD and the scripts for the experiments described in the paper [1] are hosted on Github.
In our evaluation, we compare the following three algorithms:
- Our algorithm DISTOD [1]: https://git.io/distod
- FASTOD-BID [2]: https://git.io/fastodbid (Accessed 2020-08-26)
- DIST-FASTOD-BID [3]: https://git.io/dist-fastodbid (Accessed 2020-08-26)
Evaluation Datasets
For our experimental evaluation, we use synthetic and real-world datasets from different domains.
We reuse the following datasets from the Metanome project:
- Abalone
- Adult
- Bridges
- Chess
- FD-reduced
- Flight
- Hepatitis
- Horse
- Iris
- Letter
- NCVoter
- Plista
The original datasets can be obtained from the Repeatability page for FD and OD publications.
The original DBLP dataset sources can be obtained from Uni Trier: https://dblp.uni-trier.de/xml/ (Accessed 2021-04-13). We transformed the XML into a single relational CSV table containing publication metadata. Some of the available attributes were removed (see script on Github) before using it in our evaluation.
The IMDb titles dataset contains information about movie, TV-series, and similar titles. A TSV file can be downloaded from https://www.imdb.com/interfaces/ (Accessed 2021-04-13).
The TPC-H lineitem dataset was generated using TPC-H 2.18.0 with a scale factor of 1. See http://www.tpc.org/tpch/ (Accessed 2021-04-13) for more details.
In our evaluation, we compared the runtime of DISTOD to two competitors: FASTOD-BID [2] and DIST-FASTOD-BID [3]. The implementation of DIST-FASTOD-BID does not support data types other than integers, but our datasets contain strings, dates, and decimals. For this reason, we preprocessed all datasets to run DIST-FASTOD-BID on them: We removed the headers and substituted all values with their hash value so that each value is mapped to an integer representation. This transformation keeps all constant bODs (FDs) intact, but may change order compatible bODs. We provide the preprocessed datasets for repeatability in this share. The CSV-files were used for DISTOD and FASTOD-BID, while the json-files were used for DIST-FASTOD-BID.
The following table summarizes the evaluation datasets after preprocessing:
| Dataset | Columns | Rows | Original Size (KiB) | Substituted Size (KiB) | Constant bODs (FDs) | Order compatible bODs |
|---|---|---|---|---|---|---|
| Abalone | 9 | 4.177 | 187 | 753 | 137 | 318 |
| Adult | 15 | 32.561 | 3.527 | 9.682 | 78 | 1.140 |
| Bridges | 13 | 108 | 6 | 28 | 142 | 1.097 |
| Chess | 7 | 28.056 | 519 | 3.955 | 1 | 0 |
| DBLP | 13 | 8.224.309 | 1.237.373 | 1.071.536 | 31 | 664 |
| FD-reduced-short | 30 | 1.000 | 278 | 597 | ||
| FD-reduced-long | 30 | 250.000 | 69.580 | 149.269 | 89.571 | 742 |
| Flight-short | 30 | 1.000 | 135 | 576 | 5.976 | 40.740 |
| Hepatitis | 20 | 155 | 7 | 61 | 8.250 | 54.766 |
| Horse | 29 | 300 | 25 | 168 | 128.727 | 2.231.321 |
| IMDb titles | 8 | 7.734.683 | 628.014 | 1.129.785 | 8 | 26 |
| Iris | 5 | 150 | 4 | 15 | 4 | 7 |
| Letter | 17 | 20.000 | 696 | 6.635 | 61 | 2.202 |
| NCVoter-short | 19 | 999.999 | 112.578 | 326.519 | 572 | 4.362 |
| NCVoter-long | 19 | 4.000.000 | 459.334 | 1.322.237 | ||
| Plista | 63 | 1.001 | 575 | 834 | ≥ 32.404 | ≥ 313.296 |
| TPC-H lineitem | 16 | 6.001.215 | 736.193 | 1.911.435 | 3.984 | 13.760 |
References
- Sebastian Schmidl and Thorsten Papenbrock. 2021. Efficient distributed discovery of bidirectional order dependencies. The VLDB Journal. [ doi ]
- Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Kargar, and Divesh Srivastava. 2018. Effective and complete discovery of bidirectional order dependencies via set-based axioms.The VLDB Journal, 27, 4, 573–591. [ doi ]
- Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed implementations of dependency discovery algorithms. Proceedings of the VLDB Endowment, 12, 11, 1624–1636. [ doi ]