Open theses

The information systems group is always looking for good master students to advise on their master's theses. If you are interested in any of our research topics, please directly contact Felix Naumann or any of the researchers in our team to arrange a meeting. There, we can discuss any of the topics listed below, find new topics, or you can suggest a topic of your own. Please note that the list below is only a small sample of possible thesis topics and ideas.

For more information about writing a master's theses in our group, please see here.

Univariate Anomaly Detection in Time Series

Efficient, Distributed, and Holistic Discovery of Data Dependencies

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values, the column’s data types, or the most frequent patterns within a column. Metadata that are more difficult to compute involve multiple columns, such as correlations or data dependencies. Research in data dependency discovery focused on developing efficient algorithms for individual dependency types, such as unique column combinations (UCCs), functional dependencies (FDs), order dependencies (ODs), or inclusion dependencies (INDs). However, many downstream tasks, such as data exploration and query optimization, need information about types of dependencies at the same time, requiring the execution of multiple discovery algorithms on a given input dataset.

The goal for this master thesis is to develop a holistic algorithm to discover specifically UCCs, FDs, ODs, and INDs simultaneously. A holistic approach for these four types can optimize execution orders, share intermediate results for additional search space pruning, and re-use temporary data structures. This work can build on the entire research history of our chair in data profiling, including individual discovery algorithms for all mentioned data dependencies and corresponding datasets. [1] already proves that a holistic approach is feasible, but this work should extend the idea to more dependency types and to allow the processing of larger datasets.

Since the discovery of UCCs, FDs, ODs, and INDs entails an exponential search space rendering the space and time complexity of discovery algorithms exponential as well, this work must consider the scalability of the algorithm to be able to process larger datasets. In comparison to [1], we will, therefore, focus on parallelization, caching, and distribution aspects of the algorithm.

[1]: Ehrlich, Roick, Schulze, Zwiener, Papenbrock, Naumann. Holistic Data Profiling: Simultaneous Discovery of Various Metadata. EDBT. 2016. https://openproceedings.org/2016/conf/edbt/paper-20.pdf

For more information please contact Sebastian Schmidl or Youri Kaminsky.

Distributed Discovery of Denial Constraints

Denial constraints (DCs) are the de facto language to specify integrity constraints [1], which have various usages, such as database design, data integration, query optimization, or data cleaning. DCs generalize unique column combinations, functional dependencies, and order dependencies. Each DC defines a set of predicates for which its predicates cannot hold true simultaneously. This expressive power, however, comes with the cost of a very large search space. So, discovering denial constraints (DCs) is computationally expensive. Researchers have developed efficient algorithms to discovery DCs, such as FastDC [2], BFastDC [3], HYDRA [4], or DCFinder [1]. Their application is limited to rather small datasets, though, because executing those algorithms on datasets with around 1 mio. rows and 20 attributes already takes hours [1]. Research for other data dependencies has shown that dynamic parallelization and distribution techniques can decrease the algorithm runtimes enough to make the processing of larger datasets possible [5].

The goal of this master thesis is to develop a distributed algorithm to efficiently discover denial constraints in large datasets.

[1] Eduardo H. M. Pena, Eduardo C. de Almeida, and Felix Naumann. Discovery of Approximate (and Exact) Denial Constraints. PVLDB, 13 (3) (2019). DOI:10.14778/3368289.3368293
[2] Xu Chu, Ihab. F. Ilyas, and Paolo Papotti. Discovering denial constraints. PVLDB, 6 (13) (2013). DOI:10.14778/2536258.2536262
[3] Hai Liu, Dongqing Xiao, Pankaj Didwania, and Mohamed Y. Eltabakh. Exploiting soft and hard correlations in big data query optimization. PVLDB, 9 (12) (2016). DOI:10.14778/2994509.2994519
[4] Tobias Bleifuß, Sebastian Kruse, and Felix Naumann. Efficient denial constraint discovery with Hydra. PVLDB, 11(3) (2017). DOI:10.14778/3157794.3157800
[5] Sebastian Schmidl and Thorsten Papenbrock: Efficient Distributed Discovery of Bidirectional Order Dependencies. The VLDB Journal (2022). DOI:10.1007/s00778-021-00683-4

For more information please contact Sebastian Schmidl or Youri Kaminsky. Supervision will be in cooperation with Eduardo Pena from Universidade Tecnológica Federal do Paraná (UTFPR) in Toledo, Brasil.

Combined Filtering for Speeding up Joins

Database systems aim to execute workloads on a given dataset as efficiently as possible. Especially for analytical workloads, where complex queries access many tuples in multiple tables, choosing a decent execution plan is challenging [1]. Joining tables is costly [2], and various query optimization techniques have been proposed to make this operation efficient. One of these optimization techniques is the additional execution of a semi-join before an inner join (semi-join reduction) [3]. The semi-join can be executed more efficiently and might decrease the number of input tuples for the inner join, leading to better performance. After applying the semi-join, only tuples that are guaranteed to have a join partner are fed to the inner join.

However, the additional semi-join adds execution overhead itself. Thus, different techniques have been proposed to leverage the performance overhead and benefit, replacing the semi-join reductions: (i) Bit vector filtering using Bloom filters [4] is a probabilistic approach that filters incoming tuples by hashing the join key and combining the hash values into a bitmask. Its usefulness heavily depends on the chosen bitmask size and hash functions. (ii) Data-induced predicates (diPs) [5] filter one join input by the minimal and maximal join key present in the other input. This technique is well-suited if one join input is filtered by a predicate correlating with the join key. Deciding whether and when to use one of the techniques is intricate [5, 6, 7].

Combining both techniques seems promising to increase the performance of analytical queries. In the proposed master's thesis, we want to evaluate the application of joint bit vector filtering and diPs. The prototype will be implemented using the Hyrise in-memory research database management system.

Research opportunities
We want to answer the following research questions:

How can we construct beneficial Bloom filters efficiently? What is the influence of Bloom filter size, different hash functions, etc.?
When should the combined filter be applied in query optimization? The challenge is adding the filter to both join inputs, leading to implications on the operator scheduling due to mutual filtering and potential dynamic pruning using the min/max join key (whereas the latter is already implemented).

Contribution
Multiple adaptions to the existing codebase are required to answer the research questions. For instance, you have to

Develop a new table scan implementation using both bit vector probing and min/max filtering.
Efficiently construct and evaluate valuable Bloom filters, measuring the impact of Bloom filter size, hash functions, etc.
Represent the combined filter in the logical query plan.
Derive an optimizer rule to find well-suited heuristics when to apply the filter.

Prerequisites

Deepened understanding of database systems, e.g., completed DBS II or DYOD courses.
Profound C++ knowledge.

For more information do not hesitate to contact Daniel Lindner.

[1] Yannis E. Ioannidis. Query Optimization. CSUR. 28(1) (1996). DOI: 10.1145/234313.234367
[2] Markus Dreseler, Martin Boissier, Tilmann Rabl, Matthias Uflacker. Quantifying TPC-H Choke Points and Their Optimizations. PVLDB 13(8) (2020). DOI: 10.14778/3389133.3389138
[3] Philip A. Bernstein, Dah-Ming W. Chiu. Using Semi-Joins to Solve Relational Queries. JACM 28(1) (1981). DOI: 10.1145/322234.322238
[4] James K. Mullin. Optimal Semijoins for Distributed Database Systems. TSE 16(5) (1990). DOI: 10.1109/32.52778
[5] Laurel J. Orr, Srikanth Kandula, Surajit Chaudhuri. Pushing Data-Induced Predicates Through Joins in Big-Data Clusters. PVLDB 13(3) (2019). DOI: 10.14778/3368289.3368292
[6] Goetz Graefe, Diane L. Davison. Encapsulation of Parallelism and Architecture-Independence in Extensible Database Query Execution. TSE 19(8) (1993). DOI: 10.1109/32.238579
[7] Bailu Ding, Surajit Chaudhuri, Vivek R. Narasayya. Bitvector-aware Query Optimization for Decision Support Queries. SIGMOD (2020). DOI: 10.1145/3318464.3389769

Open theses

Univariate Anomaly Detection in Time Series

Efficient, Distributed, and Holistic Discovery of Data Dependencies

Distributed Discovery of Denial Constraints

Combined Filtering for Speeding up Joins

Contextual Schema Transformation for Generating Data Integration Benchmarks

Chair

News

03.04.2024 | Congratulations to the EDBT Best Paper Award!

05.03.2024 | Another Paper marked as reproducible by pVLDB Reproducibility Committee

21.01.2024 | Paper accepted at W-NUT 2024

19.12.2023 | Congratulations Dr. Gerardo Vitagliano!

13.12.2023 | Two papers accepted at EDBT Conference 2024

Project highlights

People and open positions