Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
  
 

Optimized Theta-Join Processing through Candidate Pruning and Workload Distribution

This is the repeatability page for our BTW 2021 conference paper on efficient theta-join processing within our actor database prototype A2DB.

Content

  1. Authors
  2. Abstract
  3. Algorithm Source Code
  4. Evaluation Data

Authors

Abstract

The Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <,≤,≠,≥,>.

In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A2DB. Our evaluation on various real-world and synthetic datasets shows that A2DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries. [more (Link to publication is tbd)]

Algorithm Source Code

The source code for A2DB can be found on Github.

Evaluation Data

For our experiments, we use synthetic and real-world datasets, which are differently sized subsets of four base datasets listed in the table below.

We link to the used SQL queries for each dataset in the column Queries.

Dataset# Rows# ColumnsSize on diskQueries
TPC-H (2020-08-08)6 001 215251 639 MBLink (2020-11-30)
DataSF (2020-08-08)968 37322197 MBLink (2020-11-30)
Flight (2020-08-08)7 268 23215701 MBLink (2020-11-30)
Cloud (2020-08-08)384 584 55528521 MBLink (2020-11-30)