Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Zsolt István is a Full Professor at the Technical University of Darmstadt, Germany. In his research, he focuses on Distributed and Networked Systems topics, as part of the Systems@TUDa Group. Earlier, he was an Associate Professor at the IT University of Copenhagen, Denmark, and an Assistant Research Professor at the IMDEA Software Institute in Madrid, Spain. Zsolt received his PhD degree from the Systems Group at ETH Zurich, Switzerland. 

About the Talk

Modern data-intensive applications, such as Analytical Database Management Systems or Machine Learning pipelines, are increasingly run as distributed systems in the public cloud or in enterprise datacenters. Distribution helps with scaling compute and storage resources but also introduces various data movement bottlenecks. Placing parts of the computation closer to the network can reduce these bottlenecks and allow data-intensive systems to scale better. SmartNICs, that is, network interface cards with compute capabilities, enable such close-to-network computation and are becoming common in the cloud. This talk is composed of three parts: First, we look at the driving forces behind distributed architectures that are now standard in the clouds and motivate why computation close to the network is necessary. Second, we cover the design spectrum of SmartNICs, explaining how their internal architecture can look like and what specific processing elements they can incorporate. In the third part of the talk, we sample from recent research projects that successfully leverage SmartNICs to make applications more efficient and more scalable.

SmartNICs in the Cloud: The Why, What and How of In-network Processing

Summary written by Magnus Menger and Heinrich Timphus

Overview

In his talk Zsolt István gives an engaging introduction to the topic of SmartNICs (smart network interface cards) explaining what they are and how they can be used to make cloud computing more efficient by reducing CPU load and data movement. It consists of two parts: In the first part he gives some background about modern scalable architectures and the bottlenecks and overheads we still face. He then shows two examples of how processing close to the network can help overcome these challenges. In the second part, he gives an overview on the design space of SmartNICs, showing how and where you can add "smartness" to a NIC and when they are actually useful.

Background

Cloud computing is becoming more relevant every day. While it started out as a cost-effective way of outsourcing virtual machines to commodity hardware, it now serves a dual purpose. On the one hand, many companies are running most of their compute intensive jobs in the cloud. On the other hand, we have the cloud providers themselves which are running own applications and core-services in their cloud.

This great variety of workloads results in largely differing requirements on CPU, memory and storage, which makes provisioning resources very challenging. Also, there is often a gap between the average and peak requirements of a cloud workload. This leads to underutilized resources, which result in large costs.

One way to avoid these problems is co-locating multiple workloads and using disaggregation to gain flexibility. Today, many applications are designed as a collection of microservices. This enables better and more flexible provisioning and therefore improved scalability. But this distributed design results again in new challenges. It can lead to data movement bottlenecks.

At first glance, this might not seem like a big problem, since network hardware is getting faster all the time. Ethernet with speeds of 100 Gbps are already common in clouds, while 400 Gbps is coming. But disaggregation does not only require fast networks, it also costs a so-called "data center tax". This term describes CPU cycles that are not spent on the main task but on things like moving data in and out, reformatting data, or encrypting / decrypting it. And as of today CPU performance is increasing much slower than network speed so it can become a bottleneck for disaggregated systems in the cloud.

A way to make further scaling possible is reducing the data center tax using specialized network devices. Offloading some processing to the NIC allows freeing up CPU cycles. Zsolt István presents two examples where this was successfully done.

Azure Accelerated Networking

Cloud providers have to apply many networking rules to ensure security, flexibility, performance and isolation. There needs to be some virtualization of the networking to ensure that each virtual machine thinks they are alone.

To achieve this in the Azure cloud, Microsoft's software defined networking (SDN) execution required several CPU cores. To avoid this, they developed specialized hardware. Their SmartNICs combined regular network processing with programmable SDN rule evaluation. The per-flow rule execution was offloaded to the network card, which reduces CPU overhead. The result was an overall lower latency, higher bandwidth and increased efficiency thanks to specialized hardware.

Since 2015 Microsoft has deployed this technology in all new servers. There are now millions of machines using accelerated networking in azure. Today, the same hardware is also used by Microsoft to offload some machine learning operations.

SQL Filter Offloading

In disaggregated architectures, the storage is often attached via network. Many applications can avoid moving all data to the compute layer. Some operations where this is possible are common enough to be improved by specialized networking hardware at the storage level. This reduces network overheads while also reducing CPU requirements and allows for more predictable performance.

One example where this technology was commercially used is Amazon AQUA. In his talk, Zsolt István explains the technology using Caribou as an example. Caribou is a distributed storage solution that he helped develop and which works similarly as AQUA.

Caribou allows pushing down filter operations in SQL queries to the storage layer. There they are executed by specialized hardware that enable processing at "line-rate". This term means that the computation is at least fast as the network speed and never slows down retrieval. In hardware, the algorithms can be re-designed to become bandwidth bound instead of compute-bound.

Overview of the design space

After understanding the problem that motivate the use of SmartNICs and seeing these examples of their successful use, there remain open questions of when and how we can design SmartNICs. To answer them, we start by taking a look at how a conventional NIC works.

On a very basic level, a NIC translates between physical signals (like changing voltages on a wire) and in-memory representation of data. It consists of three logical layers:

  • Virtualization / Steering layer: Controls how data is sent to the cores that need it and how the interface is virtualized (e.g. Open vSwitch support)
  • Network layer: Controls how nodes are identified and how they talk to each other (e.g. IPv4, IPv6)
  • Physical layer: Controls how bits are encoded on the transfer medium (e.g. Ethernet)

In terms of actual chip design, the network and virtualization layers are often very closely integrated. Logically, they are two separate layers. Usually, the NIC is connected via PCIe to the CPU. There are three main design concepts on where you can add additional processing to a NIC.

Design 1: Smart Component fully integrated in NIC functionality

One possibility is to integrate the smart component into all three layers of the NIC. This gives you the highest flexibility by allowing offloading at any level (application, protocol, ...). It results in a low latency and high throughput thanks to tight integration. The main drawback is the high engineering overhead that comes with this approach. NIC chip designs have to be modified, resulting in a 5+ years lead time. Also, due to physical reasons the complexity of the offloading is limited by the NIC chip size. Because of these disadvantages, this approach is rarely used in practice.

Design 2: Smart Component between Network wire and NIC functionality

Another option is to add the smart component between the network layer and the physical layer components.

This is much easier to engineer and works at line rate on the packet level, much like a smart switch. The drawback of this approach is that you can only work at packet level. For higher level operations, the smart component would need to reimplement app-level logic.

Conceptually, this design was also used for Azure accelerated networking. In general, this approach works well if you want to offload operations between the operating system / hypervisor and the infrastructure, that do not need application level concepts. In this case, the operations can be offloaded to reduce CPU load and increase efficiency.

Design 3: Smart Component between CPU and NIC functionality

The third option is to add the smart component in front of the virtualization / steering layer component. This is the most common approach and also the one used at the group of Zsolt István.

This structure of a new main component in front of the NIC functionality is only a logical view. In practice, it is often unfeasible to send all data through the smart component. Usually the smart component is connected via PCIe. This way, the CPU and NIC can send only relevant data there.

This design approach is very easy to engineer, since it basically means just adding another PCIe device. It allows for very flexible scheduling and offloading and provides a natural way to offload application-level operations. A drawback of this approach is that the internal PCIe bandwidth can become a bottleneck. This means that bandwidth and latency cannot be guaranteed.

This design was used in the SQL offloading examples presented before. In general, this architecture allows offloading of application and operating system level operations. With RPC infrastructure operations like compression and serialization, we can reduce package sizes and reduce congestion. Application level operations allow us to reduce data movement and the latency of queries.

What Processing Elements to use?

Regardless of the chosen design concept, it is important that the used hardware requires only little energy and allows highly parallel processing. When choosing the hardware, we have a trade-off between flexibility and performance. In order of increasing flexibility, our options are:

  • Configurable "accelerator" ASICs for things like cryptography or compression (e.g. Nvidia connect X7)
  • Programmable hardware like FPGAs (e.g. AMD Alvos)
  • Multi core ARMs (e.g. Nvidia Bluefield 2)

Conclusion and Summary

Disaggregation enables scalable architectures in the cloud. At the same time, it introduces new data movement overheads where the CPU can become a bottleneck. Offloading some operations to SmartNICs can make systems more efficient. SmartNICs have a large design space with multiple approaches on where to add smart components. The choice of the architecture determines the level of operations which they can be used for.

The main goal of SmartNICs is not to accelerate computation, even though they are sometimes called accelerators. Usually, the CPU could be nominally faster. But at the same time, it would be less efficient and "pull up" functionality from software. The real benefit of SmartNICs is, that they allow to remove the CPU from the critical path, reduce data movement and help by shifting bottlenecks.

References

This summary is based on the lecture "SmartNICs in the Cloud - The Why, What and How of In-network Processing" by Zsolt István in the HPI Lecture Series on Database Research 2023/2024. All figures are taken from the lecture slides. If sources were provided for them, they are also part of the figure.

[1] Zsolt István's Homepage. zistvan.github.io