Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

DISTOD (Efficient Distributed Discovery of Bidirectional Order Dependencies)

This is the repeatability page for our journal paper and VLDB poster on efficient and distributed discovery of bidirectional order dependencies [1].

This project is the result of Sebastian Schmidl's master thesis. The thesis explains the DISTOD algorithm with more technical details.

Abstract

Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query op-timization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes, though, which is why existing bOD discovery algorithms cannot process datasets of practically relevant size in a reasonable time. In this paper, we propose the distributed bOD discovery algorithm DISTOD, whose execution time scales with the available hardware. DISTOD is a scalable, robust, and elastic bOD discovery approach that combines efficient pruning techniques for bOD candidates in set-based canonical form with a novel, reactive, and distributed search strategy. Our evaluation on various datasets shows that DISTOD outperforms both single-threaded and distributed state-of-the-art bOD discovery algorithms by up to orders of magnitude; it can, in particular, process much larger datasets. [1]

Algorithm Source Code

The source code for our distributed data profiling algorithm DISTOD and the scripts for the experiments described in the paper [1] are hosted on Github.

In our evaluation, we compare the following three algorithms:

Evaluation Datasets

For our experimental evaluation, we use synthetic and real-world datasets from different domains.

We reuse the following datasets from the Metanome project:

  • Abalone
  • Adult
  • Bridges
  • Chess
  • FD-reduced
  • Flight
  • Hepatitis
  • Horse
  • Iris
  • Letter
  • NCVoter
  • Plista

The original datasets can be obtained from the Repeatability page for FD and OD publications.

The original DBLP dataset sources can be obtained from Uni Trier: https://dblp.uni-trier.de/xml/ (Accessed 2021-04-13). We transformed the XML into a single relational CSV table containing publication metadata. Some of the available attributes were removed (see script on Github) before using it in our evaluation.

The IMDb titles dataset contains information about movie, TV-series, and similar titles. A TSV file can be downloaded from https://www.imdb.com/interfaces/ (Accessed 2021-04-13).

The TPC-H lineitem dataset was generated using TPC-H 2.18.0 with a scale factor of 1. See http://www.tpc.org/tpch/ (Accessed 2021-04-13) for more details.

In our evaluation, we compared the runtime of DISTOD to two competitors: FASTOD-BID [2] and DIST-FASTOD-BID [3]. The implementation of DIST-FASTOD-BID does not support data types other than integers, but our datasets contain strings, dates, and decimals. For this reason, we preprocessed all datasets to run DIST-FASTOD-BID on them: We removed the headers and substituted all values with their hash value so that each value is mapped to an integer representation. This transformation keeps all constant bODs (FDs) intact, but may change order compatible bODs. We provide the preprocessed datasets for repeatability in this share. The CSV-files were used for DISTOD and FASTOD-BID, while the json-files were used for DIST-FASTOD-BID.

The following table summarizes the evaluation datasets after preprocessing:

DatasetColumnsRowsOriginal Size (KiB)Substituted Size (KiB)Constant bODs (FDs)Order compatible bODs
Abalone94.177187753137318
Adult1532.5613.5279.682781.140
Bridges131086281421.097
Chess728.0565193.95510
DBLP138.224.3091.237.3731.071.53631664
FD-reduced-short301.000278597  
FD-reduced-long30250.00069.580149.26989.571742
Flight-short301.0001355765.97640.740
Hepatitis201557618.25054.766
Horse2930025168128.7272.231.321
IMDb titles87.734.683628.0141.129.785826
Iris515041547
Letter1720.0006966.635612.202
NCVoter-short19999.999112.578326.5195724.362
NCVoter-long194.000.000459.3341.322.237  
Plista631.001575834≥ 32.404≥ 313.296
TPC-H lineitem166.001.215736.1931.911.4353.98413.760

 

References

  1. Sebastian Schmidl and Thorsten Papenbrock. 2021. Efficient distributed discovery of bidirectional order dependencies. The VLDB Journal. [ doi ]
  2. Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Kargar, and Divesh Srivastava. 2018. Effective and complete discovery of bidirectional order dependencies via set-based axioms.The VLDB Journal, 27, 4, 573–591. [ doi ]
  3. Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed implementations of dependency discovery algorithms. Proceedings of the VLDB Endowment, 12, 11, 1624–1636. [ doi ]