Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Internet Technolgies & Softwarization

Summary written by Isabel Kurth & Matthias Kind

About the Speaker

Prof. Dr. Holger Karl has been a professor of internet technology and softwarization at the Hasso Plattner Institute since July 2021. He completed his studies in Computer Science at the Karlsruhe Institute of Technology and received his PhD from Humboldt-University, Berlin in 1999. Following his doctorate, he served as a research assistant in the Telecommunication Networks group at TU Berlin and later became a professor in the Computer Networks group at Paderborn University. Prof. Karl is a renowned expert in mobile communication networks and optimization problems, with his research focusing on network softwarization, mobile and wireless networking, and data centers. One of his significant projects is the openG6Hub, a collaborative initiative involving 14 partners coordinated by the DFKI, where his group contributes to network intelligence and programmable infrastructure using machine learning techniques.

About the Talk

This lecture explores the application of machine learning (ML) techniques to optimize mobile network operations, focusing on dense urban environments. It presents a scenario involving Cooperative Multipoint (CoMP) transmission in mobile networks, where User Equipment (UEs) can connect to multiple base stations simultaneously. The optimization problem aims to maximize the Quality of Experience (QoE) for all UEs, framed as a Partially Observable Markov Decision Process (POMDP).

Networks and ML?

Summary written by Isabel Kurth, Matthias Kind, and Tobias Jordan

In his lecture, Prof. Karl delves into mobile networks, starting with an introduction to mobile terminals and cells to set the stage for discussing related difficulties and solutions.

CoMP: Cooperative Multipoint

Figure 1: Mobile terminals and cells [Prof. Karl]

Mobile networks, composed of mobile terminals and cells, must manage resources among multiple terminals, base stations, and mobile devices. A key technique is Cooperative Multipoint (CoMP) Systems, which enables user equipment (UE) to connect to multiple cells and receive data simultaneously from different cells (see Figure 1). This approach enhances the capacity of mobile networks, particularly benefiting UEs at the boundaries of cells. However, resources must be shared among all UEs connected to a cell, necessitating a balance between improving a UE’s data rate by connecting to multiple cells and reducing available resources at each cell.

Challenge: Dense cells with moving UEs

Prof. Karl's research investigates a scenario where UEs are in dense cells using CoMP joint transmission, and their channel states change as they move (see Figure 2). All UEs connected to a single base station compete for resources. The goal is to find a heterogeneous resource allocation that maximizes the Quality of Experience (QoE). This involves addressing questions about the number of cells each UE should connect to and which specific cells should serve each UE.

The challenge is that allocating radio resources in mobile networks—determining which UEs should be served by how many and which cells—is a complex combinatorial optimization problem. This task is NP-hard and therefore difficult to solve with classical methods. Traditional approaches to solving these optimization issues require strict assumptions about perfect knowledge of UEs’ movements and the radio system, information that is typically unavailable. Prof. Karl aims to improve resource allocation in mobile communication networks to enhance overall performance and efficiency.

Figure 2: Dense cells scenario [Prof. Karl]

Key Optimization Goals

Maximizing the QoE can be interpreted in multiple ways, Prof. Karl's presents some of them and their downsides:

  1. Maximizing Data Rate: Initial idea, however, this approach often leads to a “birthday cake syndrome,” where the benefit of improving an already good connection diminishes rapidly. (1 KB/s → 2 KB/s vs. 100 KB/s →101 KB/s)
  2. Max-Min Fairness: Another approach is to maximize the minimum data rate across all UEs. This “Robin Hood” approach takes resources from well-connected UEs to support those with poor connections, promoting equality but potentially at the cost of overall network efficiency if we try to improve QoE for a UE in a bad position at all costs.
  3. Logarithm of sum of Data Rates: This method, known as proportional fairness, helps ensure a fairer distribution of resources by giving diminishing returns for already high data rates.

Given these considerations, Prof. Karl advocates for maximizing the sum of the logarithms of the data rates. This strategy balances fairness and efficiency, ensuring that resources are allocated in a way that proportionally benefits all users.

Dynamic Nature of the Problem

The resource allocation problem is further complicated by the dynamic nature of mobile networks. The reception quality can change drastically due to various factors, including the movement of UEs and environmental changes (e.g., moving vehicles or obstacles). While the channel state can change dramatically over small distances (tens of centimeters to a few meters, λ/2), it is generally stable over very short timescales (around 10 milliseconds). LTE networks, for instance, operate on a 10ms scheduling interval, assuming channel conditions remain relatively constant within this timeframe.

Traditional optimization methods fall short due to the complexity and rapid changes in the network; in addition to the already mentioned NP-hardness of the task, which is difficult to scale in these narrow time scales, the system operates with delayed and partial observations of the environment. Information on the state of a channel can only be obtained through probing, and might get stale sooner than we can gather new information.

Partially Observable Markov Decision Process (POMDP)

To address these challenges, we model the system as a Partially Observable Markov Decision Process:

  • System Components:
  • UEs and Base Stations: The primary actors in the system.
  • State: Defined by which UEs are served by which base stations.
  • Environment: Reception quality of links, which is not fully controllable.
  • Actions: Changes in radio resource allocations / connect UEs to different base stations
  • Solution Algorithm: The research utilizes stochastic gradient descent/ascent methods to find optimal solutions within this framework.

Figure 3: POMDP [Prof. Karl]

Given the inherent uncertainty and partial observability, machine learning becomes a crucial tool to efficiently use CoMP. However, domain knowledge is essential to develop and deploy the machine learning models effectively.

While in theory, resource allocation could be managed either centrally or in a distributed manner across UEs, practical considerations favor a more controlled approach. Distributed approaches require trust in UEs to report their environment honestly and not hoard resources. A centralized approach, while theoretically feasible, poses scalability and latency challenges.

Machine Learning Approaches

The lecture presents three DRL approaches developed to address this problem:

  1. DeepCoMP: A centralized approach where a single agent observes the entire system state and makes decisions for all UEs.
  2. D3-CoMP: A fully distributed approach where each UE has its own independent DRL agent.
  3. DD-CoMP: A hybrid approach with centralized training but distributed inference, allowing agents to share experiences during training.

Figure 4: Central and decentral CoMP approaches [Prof. Karl]

 

Each approach has its advantages and drawbacks. The centralized approach can make more coordinated decisions, but suffers from scalability issues and requires global information. The latter cannot really be achieved within the necessary time frame. The distributed approaches are more scalable and practical but require trust in individual devices (“Never trust a phone.”) and risk greedy behavior from individual agents. While DD-CoMP requires more synchronization since information sharing is required, D3-CoMP has a huge training overhead because every device is trained independently.

Implementation and Evaluation

The research team developed a prototype implementation using Python 3.8, TensorFlow 2, and Ray RLlib. They also created a visualization tool to demonstrate the effectiveness of their approaches. The code and documentation are available on GitHub, promoting reproducibility and further research.

The evaluation results indicate that the DRL approaches outperform existing heuristics. Interestingly, the distributed approaches (D3-CoMP and DD-CoMP) learn faster, but the centralized approach (DeepCoMP) ultimately achieves better performance when given enough training time. The lecture emphasizes that these results are promising but have only been shown in simplified scenarios so far.

A significant portion of the lecture discusses the challenges in evaluating ML systems in network contexts. Prof. Karl contrasts simulation and emulation approaches, discussing tools like MiniNet, ContainerNet, and FaultyNet for network emulation. MiniNet is a wildly used network emulator, while ContainerNet enables it to create performance-limited points of presence for service chains using Docker containers, and FaultyNet injects fault into the network for a more realistic emulation. To make FaultyNet usable for machine learning, a GPU integration is currently in development at the chair and almost ready for release.

Figure 5: Discrete event simulator [projectguideline.com]

Another approach is the evaluation by simulation using discrete event simulators like NS/3 or OmNet++. Unfortunately, they are challenging to use with ML frameworks because they are typically written in different programming languages (C++ vs. Python). With NS/3-AI there exist a Python integration of NS/3, but it is not really a usable option since it lacks important features of the standalone version. The DEFIANCE Bachelor project of Prof. Karl’s chair currently works on a solution to solve this.

Outlook

The latter part of the lecture explores future research directions, flipping the perspective to consider how networks can help ML, not just how ML can help networks. This includes challenges in distributed ML training and inference within networks, and optimizing resource allocation for competing ML training runs. Prof. Karl presents an example problem of allocating resources between two models that need continuous retraining, considering factors like diminishing returns of reward over learning episodes.

The lecture concludes by emphasizing that while ML techniques show great promise in improving mobile network operations, there are still many challenges to overcome. The integration of ML and networking opens up a rich area for future research, with many degrees of freedom in distributed ML workloads presenting opportunities for optimization. The intersection of ML and networking is a complex but potentially very rewarding area of study, with practical applications in improving the performance and efficiency of mobile networks.

Nevertheless, Prof. Karl concludes by emphasizing that while machine learning is exciting, its successful application to networking requires deep domain knowledge. The real challenges lie in understanding system details, data availability, and building robust evaluation tools. Simply getting something to “work in Python” is insufficient for serious research. This underscores the need for rigorous, domain-specific approaches when applying ML to complex network systems, highlighting the importance of engaging deeply with both ML techniques and network intricacies.