Hasso-Plattner-Institut
Prof. Dr. h.c. Hasso Plattner
  
 

Open Master Theses in In-Memory Data Management

We are looking for interested students to tackle the following master thesis topics in the area of in-memory data management:

 

Transactional Optimizations for Hyrise

A key characteristic of transactional database workloads is a high number of single row accesses, e.g., "Give me order #123". Finding these records can be greatly accelerated by using inverted indexes. In a database system that employs Multi-Version Concurrency Control, such an index can contain multiple entries for this order. Only one of these entries is valid for the current transaction. An optimized scan has to take both the possibility of invalidated data and the existence of non-indexed parts of the table into account.

Another challenge in the world of transactional workloads is the cost of query parsing and optimization. While analytical queries are dominated by the time spent in the execution engine, transactional queries typically have a very short run time. As such, the time spent in the parser and optimizer becomes more relevant and should be reduced.

The result of this work shall be an analysis of the biggest cost factors in transactional workloads and the optimization thereof.

Contact: Markus Dreseler


 

Autonomous Database Systems

Increasing volumes of data, varying workloads, and complex systems make database administration increasingly challenging for human database administrators. Autonomous or self-driving database systems utilize their knowledge of processed workloads, the stored data, and other runtime information to support database administrators in their tasks or to optimize the system's configuration without any human intervention. For example, such systems are capable of selecting indexes that substantially improve the system's runtime performance.

There are various challenges regarding autonomous approaches: (i) achieving robust and efficient optimization by relying on heuristics, optimization, or machine learning methods; (ii) integrating such approaches into database systems with acceptable implementation and runtime overhead; and (iii) mitigating trust issues of database administrators and users that are caused by non-explainable decisions made by autonomous systems.

There are several potential topics for master's theses available in the areas above.

Contact: Jan Kossmann


 

Workload-driven Replication

In replication schemes, replica nodes process queries on snapshots of the master. By analyzing the workload, we can identify query access patterns and replicate data depending to its access frequencies.  We offer to investigate how to optimize individual replication nodes in scale-out scenarios,

  • e.g., to lower the overall memory footprint by partial replication,
  • or to increase the analytical throughput by specialized indexes.

Contact: Stefan Halfpap


Optimized Data Structures for In-Memory Trajectory Data Management

In recent years, rapid advances in location-acquisition technologies have led to large amounts of time-stamped location data. Positioning technologies like Global Positioning System (GPS)-based, communication network-based (e.g., 4G or Wi-Fi), and proximity-based (e.g., Radio Frequency Identification) systems enable the tracking of various moving objects, such as vehicles and people.  A trajectory is represented by a series of chronologically ordered sampling points. Each sampling point contains a spatial information, which is represented by a multidimensional coordinate in a geographical space, and a temporal information, which is represented by a timestamp. Trajectory data is the foundation for a wide spectrum of services driven and improved by trajectory data mining. By analyzing the movement behavior of individuals or groups of moving objects in large-scale trajectory data, improvements in various fields of applications could be achieved. 

However, it is a challenging task to manage, store, and process trajectory data. Based on the characteristics of spatio-temporal trajectory data, there exist four key challenges: the data volume, the high update rate (data velocity), the query latency of analytical queries, and the inherent inaccuracy of the data. For these reasons, it is a nontrivial task to manage and store vast amounts of these data, which are rapidly accumulated. Especially, if we consider hybrid transactional and analytical workloads (so-called HTAP or mixed workloads), which are challenging concerning space and time complexity. 

  • Compression
    The scope of this topic is the analysis and evaluation of different trajectory compression techniques for columnar in-memory databases.

Contact: Keven Richly


 

Enterprise Streaming Benchmark - Result Validation

In recent years, new technologies were developed that are able to handle and analyze data streams, e.g., Apache Flink or Apache Spark Streaming. To compare different systems and architectures, we are developing the Enterprise Streaming Benchmark. After running the queries/workload on the system under test, it is crucial to check whether the produced results are correct or not. Moreover, performance KPIs need to be calculated. Technologies used for doing that are Scala and Akka.

The goal of the thesis is to identify ways to validate more complex queries, which do not have a validaton yet due to certain technical challenges. Furthermore, relevant performance KPI and possibilities of computing these should be developed. The work on the master thesis includes:

  • Working with Scala and Akka  
  • Understanding data stream processing system concepts
  • Extending the validation and KPI calculation application and comparing different validation approaches WRT runtime   

Moreover, we offer positions as research assistant/Hiwi for the described topic as well as further areas in the context of the Enterprise Streaming Benchmark.

Contact: Guenter Hesse