Hasso-Plattner-Institut
Prof. Dr. h.c. mult. Hasso Plattner
 

Master's Project: Parallelization and Query Plan Optimizations for the TPC-DS benchmark

Description

For relational databases, new features are quite rare. SQL has been largely unchanged for years, so all that users care about is more performance for less cost. Typically, standardized benchmarks are used to compare the throughput of two competing database systems or that of an old and a new version of a database. One of these benchmarks is the TPC-DS benchmark, which simulates queries as seen in decision support systems. Compared to the TPC-H benchmark, which we already support, the TPC-DS poses more challenges as queries are more complex and the input data is skewed.

In this project, we will take the TPC-DS benchmark as a yardstick for improving our own database, Hyrise. The focus will be (1) on improving the scalability of the system, i.e., using additional CPU cores as efficiently as possible and (2) on optimizing the query plans so that more efficient execution paths are chosen.

As we already have a benchmark framework in place, it will be a matter of days before we can look at first performance numbers. From there, we can track our improvements and will have measurable successes early in the project.

We will not perform any “throw-away work”, but aim for results that can be integrated into the main code base and will improve the overall project. After this project, there will be opportunities to dive deeper into identified issues as part of Master’s theses.

Goals

A number of goals can be addressed independently. We will select goals depending on the number of students, their interests, and our progress during the project:

  1. Implement the TPC-DS benchmark (generating data, parameterizing queries)
  2. Add SQL features that might be missing for the execution of all TPC-DS queries
  3. Analyze the query plans generated for the TPC-DS and teach the optimizer to generate more efficient plans
  4. Identify control flow inefficiencies and improve the operators to a point where we are entirely memory-bound
  5. Improve the scheduler and the placement of data on different NUMA nodes in order to increase the TPC-DS throughput in systems with many cores

Existing Infrastructure

We will not waste any time in setting up dependencies, as most of the setup needed for this project already exists. This includes

  • a database system that can execute most queries out of the box,
  • a benchmark framework, which automatically executes queries in parallel, tracks their execution time and reports the results,
  • scripts for comparing multiple benchmark runs, e.g., for tracking the improvement made by a commit; as well as scripts that plot the throughput with varying number of CPUs or the improvement over time, and
  • a code base with a high degree of test coverage (>90%) and a CI server that enforces code quality for all pull requests

Learning Goals

During this project, you will gain insights into how databases execute complex queries. This will be helpful even if you do not plan to continue building your own database, because it also enables you as a user of databases to write more efficient SQL queries.

Furthermore, you will get a better understanding of multithreading both on a single processor and on systems with up to 16 CPUs and 480 logical cores.

Finally, as you will have to work with existing components such as the query optimizer and the scheduler, you will learn to familiarize yourself with an existing code base.

Prerequisites

Prior understanding of the fundamentals of databases (e.g., from the Datenbanksysteme lecture, the Trends and Concepts online class, or the Develop your own Database seminar) is expected as well as knowledge of C++.

Contact

You are welcome to contact one of us via mail or visit us in the villa.

Matthias Uflacker, Markus Dreseler, Jan Koßmann