Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Prof. Dr. Tilmann Rabl works in database research since 2007 and received his Ph.D. from the University of Passau in 2011. After finishing his PhD thesis on the subject of scalability and data allocation in cluster databases, he continued his work as a postdoctoral researcher at the Middleware Systems Research Group at the University of Toronto. In 2015, he joined the Database Systems and Information Management group at Technische Universität Berlin as a senior researcher and visiting professor and held the position of Vice Director of the Intelligent Analytics for Massive Data group at the German Research Center for Artificial Intelligence. Since 2019, he has held the chair for Data Engineering Systems at the Digital Engineering Faculty of the University of Potsdam and the Hasso Plattner Institute. His research focuses on efficiency of database systems, real-time analytics, hardware efficient data processing, and benchmarking.

About the Talk

Sensors, user input, and monitoring produce events at very high rates that are hard to process with traditional data management systems. In many applications, such as in network monitoring, data is most valuable at its generation time and becomes stale quite quickly. Therefore, timely stream processing is often of high economic value, but can also be life saving as in digital health applications. Often

In this talk, after opening the lecture and introducing the logistics, we will discuss efficient stream processing. We will first point out inefficiencies in current stream processing engines and discuss the reason of these inefficiencies in hardware design. We will then explain how to generate more efficient code and, with the example of SIMD computations, discuss portable optimizations for code generation.

Stream Processing

Summary written by Klara Munz, Fabien Kavuganyi and Mika Hoppe

Classical database systems take a certain amount of information as input and therefore are fixed in the size of their input. The query, as well as the result, of those systems are tables.

In contrast, stream processing systems have an unlimited continuous data stream as input. Low-latency systems have the potential to be utilized in numerous domains, including the processing of data from the Internet of Things (IoT), machine learning systems, and stream processing engines. As of now, there are orders of magnitudes of improvements possible on hardware utilization between production systems and research hardware systems.
This article will elaborate on techniques and research to improve utilization in stream processing systems.

What is Stream Processing?

In a stream processing system, individual records are passed through filter, aggregator, or join operators, as shown in Figure 1. These operators are either stateful or stateless. For example, a join operator will push records to a sink, which returns a result stream. Furthermore, the aggregation and join operators are stateful, as these operations are performed on more than one individual record.

Figure 1: Conceptual visualization of a stream processing job [6]

Usage and benefits of stream processing in the real-world

The need for hardware optimization can be demonstrated by looking at Singles’ Day at Ali Baba in 2020. On November 11, Ali Baba processed over four billion records per second using Apache Flink [1]. As Apache Flink is a scale-out system, more queries require more servers. Therefore, Ali Baba utilized 1.5 million 16-core CPUs, costing approximately $1 per hour. The company deployed close to 93,750 virtual machines, costing approximately $2.25 million this day. Each virtual machine received approximately 42 thousand events per second.

By using scale-up research stream processing hardware systems, Ali Baba could have processed over 100 million events per second per machine. The company could have saved $2 million by using 52-core instances at a cost of approximately $6 per hour. The cost would have been around $5,760 per day.

The Iterator and the Query Compilation Model

There are multiple modes of processing data in a database system. The benefits and considerations that influence the appropriateness of any approach differ wildly between classical and stream processing systems. The iterator model looks at each event individually and pushes it through the predefined pipeline [3]. Therefore, it has many virtual method calls and poor cache locality for data and code. For that reason, Apache Flink, which uses the iterator model, is slow compared to research systems. To improve efficiency, a query compilation model can be used instead [4]. The query compilation model consists of a diminutive program that includes all essential operators and is compiled into a single, compact binary. The resulting software has good cache locality and few to no virtual method calls.

Aspects of stream processing research

First, as data is processed incrementally rather than in batches, applications can differ from traditional database systems. Engaging with and experimenting with these novel application domains can facilitate the acquisition of valuable insights.

Second, research can be done at the operator level, looking at how aggregations and joins can be done efficiently in stream processing. As a result, organizations may be able to use resources more cost-effectively and sustainably.

Third, research may examine the semantics of the stream. This is helpful when dealing with large amounts of data, while waiting for an event to happen or any input to occur.

Last, execution strategy research can be done to improve query execution through multi-query processing. In addition, hardware optimization can be performed by developing and using novel available hardware.

SIMD compiler intrinsics

What is Single Instruction Multiple Data (SIMD)?

Today's CPUs have multiple cores that execute multiple instructions per cycle. To process multiple data objects in parallel, the single instruction multiple data (SIMD) registers within each core can be utilized [2]. These registers perform the same operation on multiple data points simultaneously.

What’s the problem with the current SIMD?

Most processor types currently have different instruction sets. The correct utilization of the corresponding concrete instruction set for any given processor can speed up table scans, hash tables, and sorting within a database. As each instruction set requires its own SIMD code, the resulting code is difficult to test, develop, and benchmark. Therefore, SIMD code is not often used. More regularly, the application code is translated into SIMD intrinsics using a library. These intrinsics are translated into a compiler representation, which is then compiled into assembly [2]. 

To improve performance, compiler intrinsics can be used directly instead of application code [7]. SIMD intrinsics can be written using GCC vector extensions. The compiler will then take care of the platform-specific instruction selection.

The efficiency of this approach is benchmarked by unpacking packed 9-bit integers to 32 bits using shuffling and shifting [2]. As shown in Figure 2, on average, the handwritten code will be slower than the vectorized code. As of now, there are certain edge cases, for which the SIMD compiler intrinsics do not work well because certain operations are not supported.

Figure 2: Performance difference for vectorized code with different CPUs [2]

Application of SIMD intrinsics to the Velox engine

Compiler intrinsics have been used to optimize Velox, a query engine developed by Meta [5]. While achieving the same performance as SIMD code using only compiler intrinsics, 54 platform-specific functions and hundreds of lines of SIMD were removed [2].

Summary

The lecture given by Prof. Rabl focused on the challenge of economics and efficiency in stream processing systems, especially in big data processing systems like Apache Flink. For that, it discussed the Iterator and Query Compilation Model in Stream Processing. It also foregrounded open research aspects and challenges in SIMD processing.

Prof. Rabl highlighted the need for efficient use of hardware resources to handle large amounts of data flow, as encountered at Ali Baba, where billions of records are processed every second. He pointed out that traditional stream processing schemes are often inefficient due to the recursive models employed, which involve many virtual method calls and poor cache locality.

The lecture emphasized the importance of using SIMD instructions to optimize database operations, noted the challenges to write, and the importance to maintain SIMD-optimized code across different hardware architectures. It also discussed the financial aspects of efficiency in cloud-based systems, pointing out that cloud providers have no interest in optimizing individual server performance.

The primary focus of the presentation was to demonstrate how the sophisticated techniques in stream processing and hardware implementation can result in more effective, faster, and more economical database systems, particularly in large data processing environments.

Written by Klara Munz, Fabien Kavuganyi and Mika Hoppe

References

[1]      Alibaba Cloud Community. 2024.000Z. Four Billion Records per Second! What is Behind Alibaba Double 11 - Flink Stream-Batch Unification Practice during Double 11 for the Very First Time (January 2024.000Z). Retrieved January 10, 2024.652Z from ​www.alibabacloud.com​/​blog/​four-billion-records-per-second-stream-batch-integration-implementation-of-alibaba-cloud-realtime-compute-for-apache-flink-during-double-11_596962.

[2]      Lawrence Benson, Tilmann Rabl, and Richard Ebeling. 2023. Evaluating SIMD Compiler-Intrinsics for Database Systems. Hasso Plattner Institute, University of Potsdam.

[3]      G. Graefe. 1994. Volcano-an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6, 1, 120–135. DOI: doi.org/10.1109/69.273032.

[4]      Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. Proc. VLDB Endow. 4, 9, 539–550. DOI: doi.org/10.14778/2002938.2002940.

[5]      Pedro Pedreira. 2023. Introducing Velox: An open source unified execution engine. Meta (Mar. 2023).

[6]      Tilmann Rabl. 2023. Hardware Efficient Stream Processing. Lecture Series on Database Research WiSe 2023/24.

[7]      Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units. Proc. VLDB Endow. 2, 1, 385–394. DOI: doi.org/10.14778/1687627.1687671.