Hasso-Plattner-Institut
Prof. Dr. h.c. mult. Hasso Plattner
 

Research and Implementation of Database Concepts

General Information

 

About this Seminar

Our database research seminar invites students that are interested in working on research-related topics in the area of database systems and, in particular, our research database systems Hyrise and Skyrise. An introduction is given in the Hyrise and Skyrise research papers and the open source Hyrise repository.

Logistics

  • In the first meeting, we will introduce the instructors and present the different topics.
  • The first meeting will be held online.
  • Following meetings will be held in the different groups. Depending on the preferences of you and your instructor, these can be on- or offline.

Example Topics

This list of topics is not exhaustive and we are happy to discuss research projects based on your previous experience and personal interests.

  • What-If Optimizer: Query optimizers aim at generating the most efficient execution plan for declarative queries based on the underlying data and configuration, e.g., indexes. What-if optimizers fulfill the same task but instead, they consider hypothetical, non-existing configurations and data distributions. The returned information is vital for self-driving database systems that adjust their configuration autonomously.
  • Histograms: Histograms are indispensable for accurate cardinality estimations in database systems. But at the same time, they can be expensive to create and update. We will look into local histograms which are persisted on disk (allowing faster data loads & recoveries) and which can be merged (while retaining accuracy) and updated.
  • Tracking Memory Allocations: In-memory databases need to carefully manage their memory resources. While system profilers such as perf or vTune help us in understanding where memory is allocated, they lack understanding of the semantic level (i.e., which table it was allocated for). By tracking memory allocations directly in the application using polymorphic memory resources, we can enrich them with additional context information. This helps us in tracking down memory waste as well as optimizing the resource allocation in scenarios where DRAM capacity limits are reached.
  • Cost-Performance Tradeoffs in Query Execution on Cloud Functions: To enable Skyrise's prospective query optimizer to trade off cost and performance of queries, we introduce related degrees of freedom into its execution engine. We allow for pre-provisioning of cloud functions to avoid coldstart latencies. We support interleaved materialized execution to reduce the impact of stragglers. And finally, we add a staged data exchange operator for reduced parallelism and storage request cost.
  • Object Metadata Management for Cloud Storage: Cloud object storage systems, such as Amazon S3, can cost-efficiently store terabytes to petabytes of data in thousands to millions of objects. They, however, provide only weak data consistency guarantees, simplistic data access APIs, and poor request latencies. To enable effective and efficient relational query processing on top of these cloud object stores, we design and implement a table format for Skyrise on top of commonly used columnar file formats, such as Apache ORC, that supports concurrency control, fast statistics lookups, and data pruning.

Learning Goals

Participants will deepen their understanding of data management technologies, improve their system’s development skills by working with a large existing code base. Additionally, they will gain experience in the scientific method and writing, which will serve as a preparation for their upcoming master’s theses.

Seminar Schedule

  1. Topics: During the first week of the lecture period, potential topics will be presented by the supervisors and chosen by the participants. The topics can be worked on alone or in groups of two.
  2. Familiarization: The participants are expected to familiarize themselves with the chosen topic and study recent publications that are provided by the supervisors.
  3. Project: Afterwards, implementations and evaluations will be conducted while participants receive guidance by the supervisors.
  4. Final Presentations of approximately 20 minutes (15 min. presentation + 5 min. Q&A) will be held at the end of the lecture period.
  5. Scientific Report: In the end, a scientific report (4-8 pages (depending on the group size) in IEEE format) should set the targeted problem into context (challenges, motivation, and related work), document the taken approach, and present evaluations as well as learnings to answer raised research questions.

Prerequisites

  • Good knowledge of C++ and/or Python
  • Basic knowledge of database systems (e.g., DBS or TuK I lectures)
  • Former attendance of the Develop Your Own Database seminar is beneficial but not obligatory

Grading

  • 50% project result and presentation
  • 40% scientific report
  • 10% personal engagement