Hasso-Plattner-Institut
Prof. Dr. h.c. Hasso Plattner
 

Research and Implementation of Database Concepts

General Information

  • Teaching staff: Thomas Bodner, Martin Boissier, Stefan Halfpap, Marcel Weisgut, Jan Koßmann, Dr. Daniel Ritter, Dr. Michael Perscheid
  • 4 Semesterwochenstunden (SWS) - 6 ECTS (graded)
  • First meeting: 25 October 2021
  • Room: Online via Zoom (Passcode: 96714552) and A1.2 (changed!) - choose whatever suits you
  • Time: Monday 15:15 (only applies to first meeting. Afterwards, appointsments are scheduled with the project supervisor)
  • Enrollment: 1 October until 22 October 2021
  • Exam date: no exam, see schedule below
  • Specialization areas:
    • ITSE: BPET, OSIS, ITSE-Analyse, ITSE-Maintenance
    • DATA: Scalable Data Systems

 

About this Seminar

Our database research seminar invites students that are interested in working on research-related topics in the area of database systems and, in particular, our research database systems Hyrise and Skyrise. An introduction is given in the Hyrise and Skyrise research papers and the open source Hyrise repository.

Logistics

  • In the first meeting, we will introduce the instructors and present the different topics.
  • The first meeting will be held online.
  • Submit your choices of topics that interest you until October 31, 2021. Topic assignments will be announced on November 1, 2021. (Details discussed in 1st meeting).
  • Following meetings will be held in the different groups. Depending on the preferences of you and your instructor, these can be on- or offline.

Example Topics

This list of topics is not exhaustive and we are happy to discuss research projects based on your previous experience and personal interests.

  • Partial Indexes: Indexes can improve the latency of database queries. However, indexes have to be maintained and introduce memory overhead. If tables are horizontally partitioned, indexing only frequently accessed partitions can reduce the memory footprint while still enabling efficiently locate frequently accessed data. This project will look into partial indexes that are only built on a subset of a table's partitions. We will analyze the latencies of lookup and maintenance operations and the memory footprint compared to traditional indexes created on entire tables.
  • In-Memory Pipelined Query Execution: The pipelined query execution model passes intermediate results between query operators a tuple-at-a-time or a-batch-a-time, and not in their entirety. This benefits the memory footprint and enables parallelism along pipelines of operators. In this work, we remodel the query execution within the FaaS-based Skyrise workers to pipeline intermediates.
  • Database Node Placement in the Cloud: Database systems are increasingly deployed and run in the cloud. Thereby, we have to assign database VMs (or containers) to physical resources (large interconnected storage and compute servers) with respect to their performance goals and availability requirements (e.g., by using different availability zones). Such assignment problems are usually hard to solve optimally. However, greedy heuristics and decomposition approaches using mathematical optimization may enable us to obtain reasonable solutions quickly.
  • Analyzing Traces of Serverless Query Execution: An inherent issue of serverless software systems is the observability of their inner mechanics, rendering debugging and profiling efforts cumbersome. Skyrise has a monitoring subsystem that generates a myriad of logs, metrics, and traces per query executed in parallel on hundreds to thousands of workers. This topic is about effectively and efficiently analyzing these artifacts to help database developers better understand serverless query execution.
  • Efficient Hisograms: Histograms are used in database systems to estimate cardinalities during query optimization. Improving their accuracy can thus have a significant impact on performance. Hyrise builds histograms for entire columns, making their creation rather expensive, especially for 100 GB+ data sets. In this year’s seminar, you will implement alternative histograms and sampling. The evaluation is done using well-known metrics such as the q-error and end-to-end benchmarks.
  • Incorporating Distributed Plans into Query Optimization: Query processing on scalable cloud infrastructure presents new opportunities and challenges for query optimization. A query optimizer for this environment may exploit the parallelism of the underlying infrastructure but must be aware of data partitioning, and thus data distribution and data shuffling during query execution. This project looks into rewrite rules for both logical and physical query plans, based on heuristics and costs.

Learning Goals

Participants will deepen their understanding of data management technologies, improve their system’s development skills by working with a large existing code base. Additionally, they will gain experience in the scientific method and writing, which will serve as a preparation for their upcoming master’s theses.

Seminar Schedule

  1. Topics: During the first week of the lecture period, potential topics will be presented by the supervisors and chosen by the participants. The topics can be worked on alone or in groups of two.
  2. Familiarization: The participants are expected to familiarize themselves with the chosen topic and study recent publications that are provided by the supervisors.
  3. Project: Afterwards, implementations and evaluations will be conducted while participants receive guidance by the supervisors.
  4. Final Presentations of approximately 30 minutes (~20 min. presentation + 10 min. Q&A) will be held after the end of the lecture period on February 28, 2022 (expected).
  5. Scientific Report: In the end, a scientific report (4-8 pages, depending on the group size, in ACM format) should set the targeted problem into context (challenges, motivation, and related work), document the taken approach, and present evaluations as well as learnings to answer raised research questions. The expected date for the final report is March 20, 2022.

Prerequisites

  • Good knowledge of C++ and/or Python
  • Basic knowledge of database systems (e.g., DBS or TuK I lectures)
  • Former attendance of the Develop Your Own Database seminar is beneficial but not obligatory

Grading

  • 50% project result and presentation
  • 40% scientific report
  • 10% personal engagement