Hasso-Plattner-Institut
 
    • de
Hasso-Plattner-Institut
Prof. Dr. Holger Karl
 

Open Bachelor and Master Theses Topics

Topic areas: 

BA/MA: Topics suitable primarily for Bachelor or Master thesis; usually possible to use otherwise as well but needs to be discussed. 


 

Orchestration and Management

Orchestration and management refers to handling pieces of software, individually or combined, when they are deployed into a distributed system. Typical example is handling microservices. 

MA: Implement a chaotic network (reserved)

Today's networks are increasingly built from of-the-shelf building blocks that are configured and administered by software. This development requires optimized engineering and operating processes: classical software engineering tools (keywords: CI/CD pipeline, DevOps) and practices from distributed systems (keyword: Chaos Engineering). For his thesis you integrate both ideas into a simple emulation testbed and discuss your design decision. 

  • Task: Build a continuous delivery (CD) pipeline and implement and evaluate a 'simian army' on top.
  • Background: Netflix introduced the concept of a 'simian army' that randomly disrupt the deployment environment. This forces the development team to see failure as the normal case and build more resilient solutions. We want to implement this idea in an emulation setup for networks. Your tasks are twofold: first, build a simple CD pipeline to deploy a network configuration into an emulated and running network. Second, disrupt the running network randomly by 'simian soldiers' who e.g. shut down or delay switches, links and servers or report unused resources.
  • Question: Building a CD pipeline with a simian army for reconfiguring a software defined network:  What does a good system design look like?
  • Literature:  Netflix' blogpost and the open-source software Containernet are good starting points.

MA: Build chain for multi-version executables

Services are being deployed into conventional clouds, but more and more also into new systems like edge clouds, "far" clouds, or so-called fog computing setups. These systems feature highly heterogeneous devices of very different capabilities, with very different connectivity.  Dealing with data flow is a problem in such contexts, but so is deadline with software distribution and deployment. One idea is to flexibly distribute different versions of software artefacts, ranging from full-fledged virtual machine images  down to mere source code. When deploying such a generalized form of a component, it needs to be built on an edge device: possibly compiling from source code, possibly just downloading Docker layers, etc.

To address this idea, this these has two goals. First, design and prototype a build toolchain that is capable of building artefacts based on generalized descriptions of software; this toolchain should leverage and encompass existing CI/CD concepts as much as useful. Second, obtain an understanding and performance characteristics of using this toolchain on different types of devices for representative examples of typical microservice software. 

Prerequisites:

  • Familiar with build toolchains (Make, Maven, etc.) and microservice software engineering concepts
  • Good knowledge of Linux OS, shell scripts, OS API.
  • Familiarity with cloud computing and typical toolchains clearly a plus 
     

MA: Placement / Scaling with moveable infrastructure

When deploying and running microservices (or closely related network function chains) to and in edge or core clouds, typical assumptions about these kinds of infrastructure prevail: it is dependable, does not fail, does not move. On that basis, many so-called orchestration algorithms have been designed; these algorithms decide, e.g., how many instances of a service to run, where each instance runs, and which instance serves which data flow.

This mindset, however, changes with new types of infrastructure: vehicles can be seen as a moving cloud, but only vehicles in the vicinity of a particular intersection can be of interest. Fleets of drones similarly can act as (very simple, very specialized) service providers; but they need to handover service execution once they run out of battery power and have to be replaced by another drone, for a few minutes. For such volatile, evolving infrastructures, there is very little in the literature about suitable orchestration concepts.

The goal of this thesis is hence to identify a suitable model for volatile infrastructure, to cast some typical orchestrations problems into that model and to design and to evaluate their performance. As this is a fairly open area, the topic is also fairly open and evolving the concept is clearly part of the thesis assignment. 

Prerequisites: 

  • Familiar with cloud computing concepts and microservices / network function virtualization 
  • Ideally also familiar with vehicle-to-anything   
  • Good modeling skills
  • Some experience in one of: optimization problems, heuristic design, machine learning is useful 
     

MA: A Simian Army meets Machine Learning - Introducing errors into learning and inference

In conventional distributed software systems, the deliberate introduction of faults has proven to be a powerful tool to ensure that programmers prepare for actual malfunction of such systems. A popular example for this approach is the so-called "Simian Army" concept, developed by Netflix: So-called "monkeys" are little programmers that inject misbehavior, for simply killing components of a microservice to disconnecting an entire data center from a network. This Simian Army is (accordinly to Neflix claims) part of their operational system and has substantially improved the resilience and dependability of their systems by ensure programmers are actually preparing for the worst since they experience it everyday. 

The idea for this thesis is to evaluate whether this idea can be translated to the case when the components are not developed by programmers but realized as machine-learning agents. While introducing random variation in learning input is a standard technique, we want to check here whether more substantial perturbations, akin to this Simian Army, make sense in ML-controlled or ML-reliazed environments as well. This can pertain to service components that are part of a user-facing application; it can also be pertain to control and management software of a platform itself. These options should be explored in the thesis. Fault injection could happen during training, during inference, or during continuous training, at different system level. 

The ideal outcome of the thesis is (1) a characterization which types of services can profit from what type of fault injection and (2) a prototypical implementation of a subset of such services plus fault injection, with a demonstration of improved dependability. 

Prerequisites: 

  • Familiar with cloud computing concepts and microservices / network function virtualization 
  • Some software engineering experience, in the sense of DevOps and continuous integration/delivery
  • Some machine-learning background 
  • Good programming skills; distributed platforms a plus  

 

Reconfigurable Optical Networks & Orchestration

Classic electrically switched networks forward packets based on header information using electrical circuits. However, for large distances and high data rates the packets are often transported as an optical signal on glass fiber. This requires an optic-to-electronic conversion at the ends of each fiber. Optically switched networks direct the optical signal directly from the incoming to the outgoing port without such a conversion, e.g. by a system of small mirrors. This can improve the network's utilization, cost, energy consumption and queuing delays compared to electrically switched networks. Modern optical switches reconfigure the light-paths on a millisecond time scale. This already opened the door to a fine-grained optimization of the network topology for observed traffic patterns. 
We expect even more gains if we control the traffic pattern e.g. by deliberately placing microservices in the network. We offer three thesis topics that focus on algorithmic approaches to jointly optimize the network topology and the traffic pattern.

MA: Formalize and evaluate an ILP

  • Task: Formalize and evaluate a linear programming approach to jointly place functions and reconfigure an optical network.
  • Background: In the setting above, the problem of finding an optimal light-path topology and an optimal microservice placement can be described as integer linear (mathematical) program, or ILP for short. ILPs are a powerful tool to describe and solve real-world problems and a popular, well-studied optimization technique.
  • Question: How much can the maximal link utilization be reduced by jointly optimizing topology and traffic matrix via an ILP?
  • Literature: Google's paper Jupiter evolving is a good starting point.

MA: Study heuristic algorithms

  • Task: Implement and evaluate multiple heuristic algorithms. 
  • Background: The joint optimization of traffic and the topology as described above is a hard problem. Classical optimization methods can lead to long execution times for larger problem instances. To overcome this issue, we propose a ping-pong metaheuristic: alternate between a heuristic optimizing the traffic pattern and a heuristic optimizing the light-path topology.
  • Question: Are there any pairs of heuristics particularly suited for the ping-pong metaheuristic and what are the resulting gains? 
  • Literature: An older survey on light-path topology design and a more recent survey on controlling network traffic in terms of allocating resources for virtualized network functions are starting points. If a paywall causes problems contact valentin.kirchner{at}hpi{dot}de. 

MA: Train a reinforcement learning agent

  • Task: Implement and evaluate a reinforcement learning (RL) solution.
  •  Background: In the setting above we can think of an aggregated traffic flow as a directed, acyclic graph which needs do be embedded into the network. Such traffic flows might arrive as requests to the network's management system. That system then needs to decide online which requests to accept and if it should reconfigure the network to (hopefully) accept more requests in the future. As there is a decision to make at each time step with the aim to maximize a reward over a larger time horizon, the problem seems suitable for a reinforcement learning approach.
  •  Question: Which advantages in terms of accepted requests can be achieved by using an RL agent to decide admission and reconfiguration?
  • Literature: This article is a starting point (access via hpi's network).

 

Profiling

In the context here, profiling is the process of obtaining quantitative data about a piece of software. For example, to handle how much load, how many resources are necessary? Profiling appears in various contexts and profiling data is usually a stepping stone for management and orchestration systems. 

 

MA: Workload characterization of ML workloads

When trying to run machine-learning workloads in resource-limited environments, and understanding of their performance characteristics is useful: how much does an ML workload profit from additional cores or additional memory? What does that mean, specifically, for inference or training? If, e.g., training can be split over multiple machines, what data flows ensue, what are performance impacts? Overall, how malleable are these workloads? 

The goal of this thesis is to develop the notion of a performance profile (so-far used mostly for conventional applications) to ML workloads. Then, an existing profiling environment should be extended to deal with such workloads and for characteristic examples, the concept should be proven by example characterizations.

Prerequisites:

  • Very good understanding of machine-learning techniques and practical implementations
  • Good understanding of concepts like microservices
  • Good implementation and system skills (e.g., scripting) are a clear plus
     

 

Machine Learning


Topics here pertain both to the application of machine learning to operate a network / a distribtued system as well as to running machine-learning applications inside a distributed system. 

MA: Manage competing ML workflows


Suppose there are limited resources available, for example, in an edge cloud environment. Suppose further that these resources should be shared among conventional applications (e.g., web services), machine-learning inference applications, and machine-learning training applications. In a limited environment, tradeoffs between these applications will be necessary. 

The goal of this thesis is to devise a resource management approach that assigns resources to these competing applications and takes their varying requirements into account. E.g., an ML training application might well be postponed somewhat, but at some point, model accuracy will deteriorate rapidly. Hence, a new concept of fair resource allocations are necessary; that concept needs to be developed and realized by the resource management approach. A proof-of-concept realization should demonstrate that desired goals are indeed achieved, using representative examples for both conventional as well as ML applications.

Perquisites:

  • Very good understanding of machine-learning techniques and practical implementations
  • Very good modeling skills  
  • Good understanding of resource management concepts, e.g., various fairness concepts like max-min fairness
  • Good implementation and practical system skills
     

MA: Line-rate ML

Machine-learning applications can be used in situations where very fast decisions are necessary, e.g., when operating on individual packets in a router or switch. A conventional approach - receive the packet, copy the packet into user space, let an ML inference application work on it, and inject the packet back into the network stack - work fine if there is ample time. But when packets need to be processed at the speed at which they arrive without causing delay - so-called "at line rate processing" - then such simplistic approaches do not suffice.

The goal of this thesis is thereof to investigate techniques how ML-based inference (possibly also input into learning) can be achieved at line rate. The thesis entails concept development, prototypical implementation, example selection, and demonstration of a proof-of-concept. 

      
Perquisites:

  • Very good understanding of machine-learning techniques and practical implementations
  • Very good operation system-level implementation skills (e.g., device drivers) are a real plus! 
  • Experience with low-level hardware (e.g., network drivers, P4) are a real plus! 
     

MA: Machine learning for Resource Management in CoMP networks

Cooperative multi-point (CoMP) is a cellular transmission technique where a mobile user is supported by multiple base stations simultaneously, for example, to stabilize throughput for users that are at the edge of wireless cells. This entails complex resource management and scheduling problems (which PRBs of which cell to use; how to schedule these users across multiple cells, depending on the specific CoMP technique, ...). In prior work, we have tackled the downlink CoMP problem by using Reinforcement Learning, in a simplified model. Goal of this thesis would be to make the wireless model more precise (possibly restructuring the learning problem in multiple ways) or to look at the uplink case. To undertake this thesis, you should have at least some prior exposure to machine learning and networking (wireless networking strongly recommended). 


 

Assigned Topics and ongoing theses

Completed Topics

MA: Terraform goes IoT (COMPLETED)

Cloud computing happens in a competitive environment: Multiple vendors offer cloud resources under incompatible APIs. On the other hand, spreading an application (e.g., a microservice-based chain of components) over multiple clouds from different vendors can have commercial and technical advantages. Terraform [1] is a popular tool to bridge API gaps between vendors and hide them under a uniform interface.

The goal of this thesis is to extend this idea to also incorporate IoT devices and very slim-lined "far cloud" scenarios into Terraform: Extend Terraform in such a fashion that components can be deployed in such contexts as well. This entails obtaining an understanding of options to run software on such devices, of Terraform. A subgoal is to design a proper extension, and the proof of concept lies in demonstrating the ability to run a cloud application via Terraform on either a conventional or an IoT/far cloud.


Prerequisites: 

  • Good knowledge of Linux OS, shell scripts, OS API. 
  • Experience with cloud computing and typical toolchains clearly a plus.
     

MA: Orchestrate microservice chains with WebAssemblies (COMPLETED)

Deploying microservices in a complex environment comprising core, edge, and far clouds requires so-called orchestration functions: decide how many instances of a component are needed to deal with load, where which component runs, which instances deals with which traffic flows, etc. This entails lifecycle management of these components: starting, stopping, migrating, state transfer, etc. Typically, components are realized as virtual machine images or containers, which are relatively easy to manage but heavy-weight.

An alternative idea is to use WebAssemblies [1]. As they come from a browser context, it is not clear whether they are suitable to act as components in such chains. The goal of this thesis is to develop a concept how to integrated WebAssemblies in such chains, which lifecycle management approaches are suitable, and how they can be orchestrated. As a proof-of-concept, this orchestration functionality should be integrated into a common open-source orchestrator, e.g., Open-Source MANO [2]. 

Prerequisites: 

  • Familiar with microservices, virtual function chains, or similar concepts
  • Good software engineering skills   
  • Good knowledge of Linux OS, shell scripts, OS API.
  • Familiarity with cloud computing and typical toolchains clearly a plus