Prof. Dr. h.c. Hasso Plattner

Master Thesis Topic Areas

Please find our list of available master's thesis topics below. Should you be interested in any of those topics please feel free to contact the responsible research assistant for further information.

A Storage Engine for HTAP Workloads on Standard Columnar Data Formats

The database and data science communities are together standardizing column-oriented data layouts to facilitate data interchange between their systems. These data layouts allow efficient data exchange by maximizing language interoperability and minimizing (de)serialization overhead. They also enable efficient analytical data processing by supporting vectorized execution, data locality and compression. They, however, are designed for strictly read-only workloads. Thus, they are not potential direct targets for enterprise applications, which also write data via transactions and would otherwise benefit greatly from the capability to easily share data.
In this work, we integrate the exemplary Apache Arrow in-memory data format into the storage engine of our research HTAP DBMS Hyrise. We thereby extend Arrow to support transaction processing while preserving its desirable properties. Furthermore, we remodel Hyrise's concurrency control mechanism to accommodate for Arrow.

Contact: Thomas Bodner

Enterprise Streaming Benchmark - Result Validation

In recent years, new technologies were developed that are able to handle and analyze data streams, e.g., Apache Flink or Apache Spark Streaming. To compare different systems and architectures, we are developing the Enterprise Streaming Benchmark. After running the queries/workload on the system under test, it is crucial to check whether the produced results are correct or not. Moreover, performance KPIs need to be calculated. Technologies used for doing that are Scala and Akka.

The goal of the thesis is to identify ways to validate more complex queries, which do not have a validaton yet due to certain technical challenges. Furthermore, relevant performance KPI and possibilities of computing these should be developed. The work on the master thesis includes:

  • Working with Scala and Akka  
  • Understanding data stream processing system concepts
  • Extending the validation and KPI calculation application and comparing different validation approaches WRT runtime   

Moreover, we offer positions as research assistant/Hiwi for the described topic as well as further areas in the context of the Enterprise Streaming Benchmark.

Contact: Guenter Hesse

Workload-driven Replication

In replication schemes, replica nodes process queries on snapshots of the master. By analyzing the workload, we can identify query access patterns and replicate data depending to its access frequencies.  We offer to investigate how to optimize individual replication nodes in scale-out scenarios,

  • e.g., to lower the overall memory footprint by partial replication,
  • or to increase the analytical throughput by specialized indexes.

Contact: Stefan Halfpap

Combining Machine Learning and External Knowledge for Analyzing Gene Expression Profiles

Gene expression is the cell process by which information from specific sections of the DNA, i.e. genes, is used to synthesized functional products like proteins, which are catalyzing the metabolic processes in our cells. Analyzing gene expression profiles is of particular interest for researchers, as they provide insights on cell processes and gene functions and can thus improve disease diagnosis and treatment.

Nowadays, gene expression profiles from several thousand genes of several hundred tissue samples can be generated. These data sets require computational tools applying Machine Learning techniques for a meaningful analysis. On the other hand, there exist many publicly available databases containing curated biomedical information, e.g. on protein-disease interactions. Contact: Cindy Perscheid

Topic Area: Association Rule Mining on Gene Expression Data (Contact: Cindy Perscheid)

Association Rule Mining, or Itemset Mining, is applied on gene expression data to identify correlations between the expression levels of different genes. A derived rule would have the form of GeneA (up) —> GeneB (up), meaning that if GeneA is upregulated, then typically GeneB is upregulated as well. This information helps researchers to derive unknown gene functions and better understand regulatory processes in cells for different disease types. The amount of rules resulting from those analyses are typically filtered with standard interestingness measures, e.g. support and confidence. These measures are driven by statistical analyses of the data sets. However, the interestingness of a gene or its resulting rule for gene expression data should also take into account its biological relevance, which can only be derived from external sources. Possible topcis for a Master Thesis are:

  • Application of association rule mining to gene expression data, considering computational feasibility, e.g. high data dimensionality with comparably low numbers of transactions
  • Definition of a subjective interestingness measures for association rules with special focus on their biological relevance, e.g. by incorporating external knowledge

Topic Area: Biclustering on Gene Expression Data (Contact: Cindy Perscheid)

Currently, clustering and classification is applied to gene expression data to identify specific expression profiles, e.g. for a particular cancer type. Traditional clustering assigns each gene to a single cluster. A gene, however, participates on average in 10 processes of a cell. This said, traditional clustering cannot appropriately reflect the correlations of genes, as it would only show one specific view on the data. Biclustering allows to identify overlapping clusters and subspaces in gene expression data, reflecting the underlying cell processes much better. The amount of resulting biclusters must be filtered with interestingness measures. These measures are driven by statistical analyses of the data sets. However, the interestingness of a bicluster should also take into account its biological relevance, which can only be derived from external sources. Possible topcis for a Master Thesis are:

  • Visualization of biclustering results
  • Definition of a subjective ranking measure for biclusters with special focus on their biological relevance, e.g. corresponding to known cell processes

Tracing and Sampling Memory Accesses and the Conflict between Accuracy and Performance

High-capacity NVRAM will soon enter the storage pyramid between DRAM and SSDs. It allows for cheaper main memory, but will first be slower than DRAM. We expect data structures to be placed either on DRAM or NVRAM, depending on how they are used and with the goal of minimizing the impact of NVRAM’s higher latency. In our research group, we developed a system that automatically migrates data between DRAM and NVRAM. To do so efficiently, we need to understand how data is accessed. This includes the frequency and recency of accesses as well as their type, such as sequential versus random accesses.

Many approaches exist to trace memory accesses during runtime. These vary in their accuracy and in the overhead imposed on the execution. For instance, breaking the program on every load and store can be done to capture all memory accesses, but comes with a runtime cost that is prohibitive for live applications. Various other approaches exist that use hardware counters, modifications to the page management, and code hot patching.

The goal of this work is to (1) compare and evaluate different approaches and (2) build a library that unifies different approaches behind a common frontend.

Contact: Markus Dreseler

Data Management for Non-Volatile Memories

Storage Class Memories (SCM) are a new class of byte-addressable persisted storage media that blur the line between memory & storage due to their memory-like (~100ns) latency performance. They are expected to lead to new revolutionary programming paradigms that give memory-like byte level access to non-volatile storage. On the memory side, sharing data across processes and ensuring consistent address spaces across server reboots become important issues to be addressed. On the storage side, atomicity of updates, controlling the visibility of in-flight updates, versioning & failure/disaster recovery become key data management challenges to be addressed.

  • Data Structures for In-Memory Column Stores using Non-Volatile Memories
    The goal of this master's thesis will be to investigate the applications of SCM in the context of in-memory column stores. How can in-memory databases profit from large amounts of SCM and what kind of data structures are needed to address the scale and possible distribution of data in such systems, especially in the context of transaction processing, logging and recovery.
  • Distributed In-Memory Column Stores using Non-Volatile Memories
    Distributed database systems leveraging fast interconnects and keeping all data in DRAM scale well, but as memory is volatile, such systems typically achieve durability by replicating data across multiple machines. This thesis will investigate the potential of distributed systems using non-volatile memories as well as how concepts and data structures can be adapted to exploit the durability of SCM.

Contact: Markus Dreseler

Optimized Data Structures for In-Memory Trajectory Data Management

In recent years, rapid advances in location-acquisition technologies have led to large amounts of time-stamped location data. Positioning technologies like Global Positioning System (GPS)-based, communication network-based (e.g., 4G or Wi-Fi), and proximity-based (e.g., Radio Frequency Identification) systems enable the tracking of various moving objects, such as vehicles and people.  A trajectory is represented by a series of chronologically ordered sampling points. Each sampling point contains a spatial information, which is represented by a multidimensional coordinate in a geographical space, and a temporal information, which is represented by a timestamp. Trajectory data is the foundation for a wide spectrum of services driven and improved by trajectory data mining. By analyzing the movement behavior of individuals or groups of moving objects in large-scale trajectory data, improvements in various fields of applications could be achieved. 

However, it is a challenging task to manage, store, and process trajectory data. Based on the characteristics of spatio-temporal trajectory data, there exist four key challenges: the data volume, the high update rate (data velocity), the query latency of analytical queries, and the inherent inaccuracy of the data. For these reasons, it is a nontrivial task to manage and store vast amounts of these data, which are rapidly accumulated. Especially, if we consider hybrid transactional and analytical workloads (so-called HTAP or mixed workloads), which are challenging concerning space and time complexity. 

  • Compression
    The scope of this topic is the analysis and evaluation of different trajectory compression techniques for columnar in-memory databases.

Contact: Keven Richly

Parallel Execution Strategies for Causal Structure Learning

Learning the causal relationships in observational data provides relavant insights for researchers in many domains, such as genomics or manufacturing.
Determining the causal structures, in particular using constraint-based approaches on high-dimensional datasets, becomes a challenge with regards to single threaded execution times. In order to overcome this obstacle, we investigate parallel execution strategies in multi-core and heterogenous, GPU-accelerated, systems. In this context, we offer to work on different challenges, e.g.,

  • derive, implement & evaluate a GPU-based implementation for multinomial data
  • optimize an existing GPU-based implementation for multivariate normal distributed data for the use in a multi-GPU system
  • investigate & optimize memory utilization for existing causal structure learning algorithms
  • ...

Contact: Christopher Hagedorn