Optimized Theta-Join Processing through Candidate Pruning and Workload Distribution

This is the repeatability page for our BTW 2021 conference paper on efficient theta-join processing within our actor database prototype A²DB.

Content

Authors
Abstract
Algorithm Source Code
Evaluation Data

Authors

Julian Weise, Sebastian Schmidl, Thorsten Papenbrock

Abstract

The Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <,≤,≠,≥,>.

In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A²DB. Our evaluation on various real-world and synthetic datasets shows that A²DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries. [1]

Algorithm Source Code

The source code for A²DB can be found on Github.

Evaluation Data

For our experiments, we use synthetic and real-world datasets, which are differently sized subsets of four base datasets listed in the table below.

We link to the used SQL queries for each dataset in the column Queries.

Dataset	# Rows	# Columns	Size on disk	Queries
TPC-H (2020-08-08)	6 001 215	25	1 639 MB	Link (2020-11-30)
DataSF (2020-08-08)	968 373	22	197 MB	Link (2020-11-30)
Flight (2020-08-08)	7 268 232	15	701 MB	Link (2020-11-30)
Cloud (2020-08-08)	384 584 555	28	521 MB	Link (2020-11-30)

Publication

Data dependencies for query optimization: a survey. Kossmann, Jan; Papenbrock, Thorsten; Naumann, Felix in VLDB Journal (2021).

[ BibTeX ]

Optimized Theta-Join Processing through Candidate Pruning and Workload Distribution

Content

Authors

Abstract

Algorithm Source Code

Evaluation Data

Publication

Chair

News

29.04.2024 | Paper accepted at ITISE 2024

03.04.2024 | Congratulations to the EDBT Best Paper Award!

05.03.2024 | Another Paper marked as reproducible by pVLDB Reproducibility Committee

21.01.2024 | Paper accepted at W-NUT 2024

19.12.2023 | Congratulations Dr. Gerardo Vitagliano!

Project highlights

People and open positions