• de
Prof. Dr. Holger Karl

Open Bachelor and Master Theses Topics

Topic areas: 

BA/MA: Topics suitable primarily for Bachelor or Master thesis; usually possible to use otherwise as well but needs to be discussed. 

Orchestration and Management

Orchestration and management refers to handling pieces of software, individually or combined, when they are deployed into a distributed system. Typical example is handling microservices. 

MA: Build chain for multi-version executables

Services are being deployed into conventional clouds, but more and more also into new systems like edge clouds, "far" clouds, or so-called fog computing setups. These systems feature highly heterogeneous devices of very different capabilities, with very different connectivity.  Dealing with data flow is a problem in such contexts, but so is deadline with software distribution and deployment. One idea is to flexibly distribute different versions of software artefacts, ranging from full-fledged virtual machine images  down to mere source code. When deploying such a generalized form of a component, it needs to be built on an edge device: possibly compiling from source code, possibly just downloading Docker layers, etc.

To address this idea, this these has two goals. First, design and prototype a build toolchain that is capable of building artefacts based on generalized descriptions of software; this toolchain should leverage and encompass existing CI/CD concepts as much as useful. Second, obtain an understanding and performance characteristics of using this toolchain on different types of devices for representative examples of typical microservice software. 


  • Familiar with build toolchains (Make, Maven, etc.) and microservice software engineering concepts
  • Good knowledge of Linux OS, shell scripts, OS API.
  • Familiarity with cloud computing and typical toolchains clearly a plus 

MA: Placement / Scaling with moveable infrastructure

When deploying and running microservices (or closely related network function chains) to and in edge or core clouds, typical assumptions about these kinds of infrastructure prevail: it is dependable, does not fail, does not move. On that basis, many so-called orchestration algorithms have been designed; these algorithms decide, e.g., how many instances of a service to run, where each instance runs, and which instance serves which data flow.

This mindset, however, changes with new types of infrastructure: vehicles can be seen as a moving cloud, but only vehicles in the vicinity of a particular intersection can be of interest. Fleets of drones similarly can act as (very simple, very specialized) service providers; but they need to handover service execution once they run out of battery power and have to be replaced by another drone, for a few minutes. For such volatile, evolving infrastructures, there is very little in the literature about suitable orchestration concepts.

The goal of this thesis is hence to identify a suitable model for volatile infrastructure, to cast some typical orchestrations problems into that model and to design and to evaluate their performance. As this is a fairly open area, the topic is also fairly open and evolving the concept is clearly part of the thesis assignment. 


  • Familiar with cloud computing concepts and microservices / network function virtualization 
  • Ideally also familiar with vehicle-to-anything   
  • Good modeling skills
  • Some experience in one of: optimization problems, heuristic design, machine learning is useful 

MA: A Simian Army meets Machine Learning - Introducing errors into learning and inference

In conventional distributed software systems, the deliberate introduction of faults has proven to be a powerful tool to ensure that programmers prepare for actual malfunction of such systems. A popular example for this approach is the so-called "Simian Army" concept, developed by Netflix: So-called "monkeys" are little programmers that inject misbehavior, for simply killing components of a microservice to disconnecting an entire data center from a network. This Simian Army is (accordinly to Neflix claims) part of their operational system and has substantially improved the resilience and dependability of their systems by ensure programmers are actually preparing for the worst since they experience it everyday. 

The idea for this thesis is to evaluate whether this idea can be translated to the case when the components are not developed by programmers but realized as machine-learning agents. While introducing random variation in learning input is a standard technique, we want to check here whether more substantial perturbations, akin to this Simian Army, make sense in ML-controlled or ML-reliazed environments as well. This can pertain to service components that are part of a user-facing application; it can also be pertain to control and management software of a platform itself. These options should be explored in the thesis. Fault injection could happen during training, during inference, or during continuous training, at different system level. 

The ideal outcome of the thesis is (1) a characterization which types of services can profit from what type of fault injection and (2) a prototypical implementation of a subset of such services plus fault injection, with a demonstration of improved dependability. 


  • Familiar with cloud computing concepts and microservices / network function virtualization 
  • Some software engineering experience, in the sense of DevOps and continuous integration/delivery
  • Some machine-learning background 
  • Good programming skills; distributed platforms a plus  



In the context here, profiling is the process of obtaining quantitative data about a piece of software. For example, to handle how much load, how many resources are necessary? Profiling appears in various contexts and profiling data is usually a stepping stone for management and orchestration systems. 


MA: Workload characterization of ML workloads

When trying to run machine-learning workloads in resource-limited environments, and understanding of their performance characteristics is useful: how much does an ML workload profit from additional cores or additional memory? What does that mean, specifically, for inference or training? If, e.g., training can be split over multiple machines, what data flows ensue, what are performance impacts? Overall, how malleable are these workloads? 

The goal of this thesis is to develop the notion of a performance profile (so-far used mostly for conventional applications) to ML workloads. Then, an existing profiling environment should be extended to deal with such workloads and for characteristic examples, the concept should be proven by example characterizations.


  • Very good understanding of machine-learning techniques and practical implementations
  • Good understanding of concepts like microservices
  • Good implementation and system skills (e.g., scripting) are a clear plus


Machine Learning

Topics here pertain both to the application of machine learning to operate a network / a distribtued system as well as to running machine-learning applications inside a distributed system. 

MA: Manage competing ML workflows

Suppose there are limited resources available, for example, in an edge cloud environment. Suppose further that these resources should be shared among conventional applications (e.g., web services), machine-learning inference applications, and machine-learning training applications. In a limited environment, tradeoffs between these applications will be necessary. 

The goal of this thesis is to devise a resource management approach that assigns resources to these competing applications and takes their varying requirements into account. E.g., an ML training application might well be postponed somewhat, but at some point, model accuracy will deteriorate rapidly. Hence, a new concept of fair resource allocations are necessary; that concept needs to be developed and realized by the resource management approach. A proof-of-concept realization should demonstrate that desired goals are indeed achieved, using representative examples for both conventional as well as ML applications.


  • Very good understanding of machine-learning techniques and practical implementations
  • Very good modeling skills  
  • Good understanding of resource management concepts, e.g., various fairness concepts like max-min fairness
  • Good implementation and practical system skills

MA: Line-rate ML

Machine-learning applications can be used in situations where very fast decisions are necessary, e.g., when operating on individual packets in a router or switch. A conventional approach - receive the packet, copy the packet into user space, let an ML inference application work on it, and inject the packet back into the network stack - work fine if there is ample time. But when packets need to be processed at the speed at which they arrive without causing delay - so-called "at line rate processing" - then such simplistic approaches do not suffice.

The goal of this thesis is thereof to investigate techniques how ML-based inference (possibly also input into learning) can be achieved at line rate. The thesis entails concept development, prototypical implementation, example selection, and demonstration of a proof-of-concept. 


  • Very good understanding of machine-learning techniques and practical implementations
  • Very good operation system-level implementation skills (e.g., device drivers) are a real plus! 
  • Experience with low-level hardware (e.g., network drivers, P4) are a real plus! 

BA: Quality of Learning

For conventional applications (e.g., three-tier Web servers, video streaming, gaming), the notion of Quality of Service and Quality of Experience are well understood: They describe quantitative, low-level measurable metrics (like data rate) or user-perceived metrics (like mean opinion score about a video quality).

For machine-learning applications, specifically when training happens in a distributed fashion, such metrics are not well developed and not tied in with these conventional metrics. It is the goal of this thesis to develop concepts for suitable machine-learning metrics (e.g., Quality of Learning, Model accuracy, ...) and to characterize example applications using these metrics.


  • Very good understanding of machine-learning techniques and practical implementations

BA: Distributed ML over Bundle

Distributed machine learning has received considerable attention, e.g., in the form of Federated Learning (Google). Most schemes assume constant network connectivity to exchange data or model updates. But what happens if distributed learning  takes place in an environment where devices are only intermittently connected?

For such environments, protocols to exchange data do exist, for example, the Bundle protocol from the Delay-Tolerant Networking community. The goal of this thesis is to take scenarios for distributed ML and check what happens if the underlying network is intermittently connected and a protocol like Bundle is used. How does this affect learning progress, can data forwarding be prioritized meaningfully, knowing that this is machine-learning related?


  • Good understanding of machine-learning techniques and practical implementations
  • Good understanding of networking basics
  • Good practical implementation skills 


Wireless networks

Wireless networks have characteristic challenges not present in wired networks: Users move around, wireless channel quality can change rapidly, the transmission techniques are considerably more complex and have substantially more control options. In particular, the resource management problems become harder but must be solved faster. 

Machine learning for Resource Management in CoMP networks

Cooperative multi-point (CoMP) is a cellular transmission technique where a mobile user is supported by multiple base stations simultaneously, for example, to stabilize throughput for users that are at the edge of wireless cells. This entails complex resource management and scheduling problems (which PRBs of which cell to use; how to schedule these users across multiple cells, depending on the specific CoMP technique, ...). In prior work, we have tackled the downlink CoMP problem by using Reinforcement Learning, in a simplified model. Goal of this thesis would be to make the wireless model more precise (possibly restructuring the learning problem in multiple ways) or to look at the uplink case. To undertake this thesis, you should have at least some prior exposure to machine learning and networking (wireless networking strongly recommended). 

Assigned Topics and ongoing theses

Completed Topics

MA: Terraform goes IoT (COMPLETED)

Cloud computing happens in a competitive environment: Multiple vendors offer cloud resources under incompatible APIs. On the other hand, spreading an application (e.g., a microservice-based chain of components) over multiple clouds from different vendors can have commercial and technical advantages. Terraform [1] is a popular tool to bridge API gaps between vendors and hide them under a uniform interface.

The goal of this thesis is to extend this idea to also incorporate IoT devices and very slim-lined "far cloud" scenarios into Terraform: Extend Terraform in such a fashion that components can be deployed in such contexts as well. This entails obtaining an understanding of options to run software on such devices, of Terraform. A subgoal is to design a proper extension, and the proof of concept lies in demonstrating the ability to run a cloud application via Terraform on either a conventional or an IoT/far cloud.


  • Good knowledge of Linux OS, shell scripts, OS API. 
  • Experience with cloud computing and typical toolchains clearly a plus.

MA: Orchestrate microservice chains with WebAssemblies (COMPLETED)

Deploying microservices in a complex environment comprising core, edge, and far clouds requires so-called orchestration functions: decide how many instances of a component are needed to deal with load, where which component runs, which instances deals with which traffic flows, etc. This entails lifecycle management of these components: starting, stopping, migrating, state transfer, etc. Typically, components are realized as virtual machine images or containers, which are relatively easy to manage but heavy-weight.

An alternative idea is to use WebAssemblies [1]. As they come from a browser context, it is not clear whether they are suitable to act as components in such chains. The goal of this thesis is to develop a concept how to integrated WebAssemblies in such chains, which lifecycle management approaches are suitable, and how they can be orchestrated. As a proof-of-concept, this orchestration functionality should be integrated into a common open-source orchestrator, e.g., Open-Source MANO [2]. 


  • Familiar with microservices, virtual function chains, or similar concepts
  • Good software engineering skills   
  • Good knowledge of Linux OS, shell scripts, OS API.
  • Familiarity with cloud computing and typical toolchains clearly a plus