DISTOD (Efficient Distributed Discovery of Bidirectional Order Dependencies)

This is the repeatability page for our journal paper and VLDB poster on efficient and distributed discovery of bidirectional order dependencies [1].

This project is the result of Sebastian Schmidl's master thesis. The thesis explains the DISTOD algorithm with more technical details.

Authors

Sebastian Schmidl, Thorsten Papenbrock

Abstract

Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query op-timization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets. [1]

Algorithm Source Code

The source code for our distributed data profiling algorithm DISTOD and the scripts for the experiments described in the paper [1] are hosted on Github.

In our evaluation, we compare the following three algorithms:

Evaluation Datasets

For our experimental evaluation, we use synthetic and real-world datasets from different domains.

We reuse the following datasets from the Metanome project:

  • Abalone
  • Adult
  • Bridges
  • Chess
  • FD-reduced
  • Flight
  • Hepatitis
  • Horse
  • Iris
  • Letter
  • NCVoter
  • Plista

The original datasets can be obtained from the Repeatability page for FD and OD publications.

The original DBLP dataset sources can be obtained from Uni Trier: https://dblp.uni-trier.de/xml/ (Accessed 2021-04-13). We transformed the XML into a single relational CSV table containing publication metadata. Some of the available attributes were removed (see script on Github) before using it in our evaluation.

The IMDb titles dataset contains information about movie, TV-series, and similar titles. A TSV file can be downloaded from https://www.imdb.com/interfaces/ (Accessed 2021-04-13).

The TPC-H lineitem dataset was generated using TPC-H 2.18.0 with a scale factor of 1. See http://www.tpc.org/tpch/ (Accessed 2021-04-13) for more details.

In our evaluation, we compared the runtime of DISTOD to two competitors: FASTOD-BID [2] and DIST-FASTOD-BID [3]. The implementation of DIST-FASTOD-BID does not support data types other than integers, but our datasets contain strings, dates, and decimals. For this reason, we preprocessed all datasets to run DIST-FASTOD-BID on them: We removed the headers and substituted all values with their hash value so that each value is mapped to an integer representation. This transformation keeps all constant bODs (FDs) intact, but may change order compatible bODs. We provide the preprocessed datasets for repeatability in this share. The CSV-files were used for DISTOD and FASTOD-BID, while the json-files were used for DIST-FASTOD-BID.

The following table summarizes the evaluation datasets after preprocessing:

Dataset Columns Rows Original Size (KiB) Substituted Size (KiB) Constant bODs (FDs) Order compatible bODs
Abalone 9 4.177 187 753 137 318
Adult 15 32.561 3.527 9.682 78 1.140
Bridges 13 108 6 28 142 1.097
Chess 7 28.056 519 3.955 1 0
DBLP 13 8.224.309 1.237.373 1.071.536 31 664
FD-reduced-short 30 1.000 278 597    
FD-reduced-long 30 250.000 69.580 149.269 89.571 742
Flight-short 30 1.000 135 576 5.976 40.740
Hepatitis 20 155 7 61 8.250 54.766
Horse 29 300 25 168 128.727 2.231.321
IMDb titles 8 7.734.683 628.014 1.129.785 8 26
Iris 5 150 4 15 4 7
Letter 17 20.000 696 6.635 61 2.202
NCVoter-short 19 999.999 112.578 326.519 572 4.362
NCVoter-long 19 4.000.000 459.334 1.322.237    
Plista 63 1.001 575 834 ≥ 32.404 ≥ 313.296
TPC-H lineitem 16 6.001.215 736.193 1.911.435 3.984 13.760

 

References

  1. Sebastian Schmidl and Thorsten Papenbrock. 2021. Efficient distributed discovery of bidirectional order dependencies. The VLDB Journal. [ doi ]
  2. Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Kargar, and Divesh Srivastava. 2018. Effective and complete discovery of bidirectional order dependencies via set-based axioms.The VLDB Journal, 27, 4, 573–591. [ doi ]
  3. Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed implementations of dependency discovery algorithms. Proceedings of the VLDB Endowment, 12, 11, 1624–1636. [ doi ]