Our group includes PostDocs, PhD students, and student assistants, and is headed by Prof. Felix Naumann. If you are interested in joining our team, please contact Felix Naumann.

For bachelor students we offer German lectures on database systems in addition to paper- or project-oriented seminars. Within a one-year bachelor project, students finalize their studies in cooperation with external partners. For master students we offer courses on information integration, data profiling, and information retrieval enhanced by specialized seminars, master projects and we advise master theses.

Most of our research is conducted in the context of larger research projects, in collaboration across students, across groups, and across universities. We strive to make available most of our datasets and source code.

Please do not hesitate to reach out directly to us, if you cannot find a paper, slides, or other research artifacts.

Optimized Theta-Join Processing through Candidate Pruning and Workload Distribution

This is the repeatability page for our BTW 2021 conference paper on efficient theta-join processing within our actor database prototype A²DB.

Content

Authors
Abstract
Algorithm Source Code
Evaluation Data

Authors

Julian Weise, Sebastian Schmidl, Thorsten Papenbrock

Abstract

The Theta-Join is a powerful operation to connect tuples of different relational tables based on arbitrary conditions. The operation is a fundamental requirement for many data-driven use cases, such as data cleaning, consistency checking, and hypothesis testing. However, processing theta-joins without equality predicates is an expensive operation, because basically all database management systems (DBMSs) translate theta-joins into a Cartesian product with a post-filter for non-matching tuple pairs. This seems to be necessary, because most join optimization techniques, such as indexing, hashing, bloom-filters, or sorting, do not work for theta-joins with combinations of inequality predicates based on <,≤,≠,≥,>.

In this paper, we therefore study and evaluate optimization approaches for the efficient execution of theta-joins. More specifically, we propose a theta-join algorithm that exploits the high selectivity of theta-joins to prune most join candidates early; the algorithm also parallelizes and distributes the processing (over CPU cores and compute nodes, respectively) for scalable query processing. The algorithm is baked into our distributed in-memory database system prototype A²DB. Our evaluation on various real-world and synthetic datasets shows that A²DB significantly outperforms existing single-machine DBMSs including PostgreSQL and distributed data processing systems, such as Apache SparkSQL, in processing highly selective theta-join queries. [1]

Algorithm Source Code

The source code for A²DB can be found on Github.

Evaluation Data

For our experiments, we use synthetic and real-world datasets, which are differently sized subsets of four base datasets listed in the table below.

We link to the used SQL queries for each dataset in the column Queries.

Dataset	# Rows	# Columns	Size on disk	Queries
TPC-H (2020-08-08)	6 001 215	25	1 639 MB	Link (2020-11-30)
DataSF (2020-08-08)	968 373	22	197 MB	Link (2020-11-30)
Flight (2020-08-08)	7 268 232	15	701 MB	Link (2020-11-30)
Cloud (2020-08-08)	384 584 555	28	521 MB	Link (2020-11-30)

Publication

Data dependencies for query optimization: a survey. Kossmann, Jan; Papenbrock, Thorsten; Naumann, Felix in VLDB Journal (2021).

[ BibTeX ]

Chair

Prof. Dr. Felix Naumann

Information Systems

E-Mail: felix.naumann(at)hpi.de

Assistant: Diana Stephan

Office: Campus II, House F, F-2.01
Tel.: +49 (0)331 5509-280
E-Mail: office-naumann(at)hpi.de

To visit us, please see these directions.

News

Project highlights

Metanome: Big Data Profiling

Metis: Data Quality Assessment

Janus: Change exploration

KITQAR: AI and Data Quality

Optimized Theta-Join Processing through Candidate Pruning and Workload Distribution

Content

Authors

Abstract

Algorithm Source Code

Evaluation Data

Publication

Chair

News

17.11.2025 | New book chapter about "Data Quality for Enterprise AI" published

01.11.2025 | Paper accepted at WOP@ISWC

29.09.2025 | Paper accepted at NeurIPS 2025

29.09.2025 | Paper accepted at SIGMOD 2026

09.07.2025 | Paper accepted in SIGMOD Record

Project highlights

People and open positions