State-of-the-art database management systems (DBMSs) efficiently process large amounts of data by storing and processing them purely in memory, applying sophisticated query optimization techniques, using massive parallelization, or a combination of these approaches. The queried datasets are often large and contain data dependencies. An example of such a data dependency is the unique column combination (UCC) on the students’ matriculation number: by definition, there are no duplicates. While research proposes numerous dependency-based optimization techniques, DBMSs actually apply only few.
This project will provide a deeper insight into database internals and dependency-based query optimization. Pruning, i.e., excluding irrelevant data from processing, saves I/O and computation. Especially in cloud environments, transferring data is time-consuming and costly. Dependency-based optimizations can, e.g., rewrite joins to linear scans on the result of a subquery. However, the subquery’s result is not known beforehand, and dynamic pruning during the query execution is required. Together with data-induced predicates (diPs) that pre-filter joined tables, dynamic pruning accelerates query execution and speeds up dependency-based optimizations. Thus, we will investigate in which cases and workloads diPs are valuable or where they have drawbacks, adding dependencies to the execution order of operators.
We shall implement and measure the impact of dynamic pruning and data-induced predicates with and without dependency-based optimization techniques in the in-memory DBMS Hyrise.