Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Data Processing on Modern Hardware

About the Speaker

Prof. Dr. Tilmann Rabl has been conducting research on databases since 2007. He began his research career at the University of Passau, where he wrote his doctoral thesis on Efficiency in Cluster Database Systems. After receiving his PhD in 2011, he worked for four years as a postdoctoral researcher in the Middleware System Research Group at the University of Toronto. From 2015 to 2019, he was employed as a senior researcher and, since 2017, as a visiting professor at the Database Systems and Information Management Group (DIMA) at the Technische Universität Berlin and the German Research Center for Artificial Intelligence (DFKI). Since 2019, he has been a professor at the Hasso Plattner Institute, where he holds the Chair of Data Engineering. Additionally, he is a co-founder of bankmark, a company specializing in database benchmarking [1].

About the Talk

Today, virtually everything is data-driven, necessitating fast data processing and analysis. As the value of data diminishes over time, quick processing is crucial.

However, hardware performance is no longer increasing as it used to. Historically, CPU performance improved consistently with increasing numbers of transistors, higher frequencies, and better single-thread performance. This allowed software to improve performance without significant changes. Around the mid-2000s, the industry encountered a power wall where further frequency increases became economically unviable due to power consumption and heat dissipation limits. As shown in Figure 1, the solution was to parallelize by adding more cores instead of further increasing the frequency of a single core. This resulted in the need to change the software to leverage parallel processing capabilities.

Figure 1: The historical trend in microprocessor development: Until the mid-2000s, the frequency of a single core increased steadily. Afterward, the number of cores increased, while the frequency remained constant [2].

GPU Accelerated Data Processing

To meet the increasing demands on data processing systems, specialized hardware such as GPUs (Graphics Processing Units) can be utilized. A GPU is a specialized coprocessing chip, an accelerator for specific types of operations, originally built for rendering images.

A key feature of GPUs is their significantly higher number of cores compared to CPUs, enabling them to perform a vastly greater number of computing operations simultaneously. Modern GPUs can process up to 3 terabytes (TB) of data per second, whereas conventional CPUs typically achieve data processing speeds of around 100 gigabytes (GB) per second [2]. In the context of data processing, the quantity of data often exceeds the capacity of the GPU's onboard memory, necessitating the supply of data from external sources.

In traditional computer architectures, the GPU is connected to the CPU as a coprocessor via the PCIe (Peripheral Component Interconnect Express) interface, which is then connected to the system memory. The maximum data transfer rate of PCIe is only 32 GB per second [2], a fraction of the GPU's data processing capability. Therefore, the connection between the GPU and the system memory becomes a significant bottleneck, limiting the overall performance of data processing systems [3].

GPU Interconnect

To prevent the data connection from becoming a bottleneck [7], GPU manufacturers have developed specialized bridges, such as NVLink 4.0, which significantly enhance data transfer rates between GPUs and between GPUs and CPUs, supporting transfer speeds of up to 900 GB per second [2, 4, 5]. This advanced interconnect technology reduces latency and increases bandwidth for data-intensive applications, ensuring that GPUs can operate at their full potential without being hindered by slow data transfers.

To fully utilize the system's maximum capacity, a comprehensive understanding of individual hardware components and the overall hardware topology is essential. This includes knowledge of how the GPUs, CPUs, memory, and interconnects are arranged and how they communicate with each other. A well-designed hardware topology can minimize data transfer bottlenecks and maximize computational efficiency, allowing for optimal performance in tasks such as machine learning, scientific simulations, and large-scale data processing.

Scalable GPU-based Join

One research topic from Professor Rabl's group focuses on leveraging the advanced processing capabilities of GPUs and improved interconnects, such as NVLink, for efficient data management of arbitrary data volumes. One of the most crucial operations in data management for relational databases is the join operation, which combines tables of data based on a shared key.

Traditionally, this join operation can be performed through a nested loop, where for each key in the first table, a search is conducted across the entirety of the second table for corresponding keys. This method, while straightforward, is computationally expensive and time-consuming, especially as the size of the tables increases. A more efficient approach is the hash join operation, which constructs a hash table from the keys of one table. The keys of the second table are then hashed and matched against this hash table, allowing for faster lookup and matching [6].

For the primary tasks within the hash join operation—namely the construction of the hash table and probing for matches—it is possible to leverage the parallel processing power of GPUs. When using the GPU for probing, the hash table is first loaded into the GPU's memory. The data to be matched, residing in the CPU memory, is transferred to the GPU using high-speed interconnects like NVLink. This setup allows the GPU to hash the incoming data and check for matches against the hash table in parallel. The parallel nature of GPUs significantly accelerates this process, making it much faster than performing the same task on a CPU, as illustrated in Figure 2. The increased speed and efficiency are particularly beneficial for large-scale data processing tasks.

Figure 2: Throughput performance of different interconnects [7].

By using the GPU to build the hash table, significant improvements in throughput can be achieved compared to the CPU, provided that the hash table fits within the GPU's memory. However, once the hash table exceeds the GPU memory capacity, the efficiency of using the GPU for the hash join operation diminishes. This is illustrated in Figure 3: As soon as the hash table exceeds the GPU memory capacity, this leads to an abrupt drop in throughput performance. When the hash table is too large, data must be transferred back and forth between the GPU and the CPU memory, leading to increased latency and reduced performance.

Figure 3: The throughput performance of CPU-based and GPU-based hash join operations, highlighting the impact of GPU memory limitations on performance [7].

Despite this limitation, Professor Rabl's group has demonstrated through recent research that performance improvements can still be achieved with a hash table that does not fit into the GPU memory by employing smart partitioning techniques, such as breaking down the large hash table into smaller chunks that individually fit into the GPU memory.

Summary

Professor Rabl's lecture emphasizes the need for specialized hardware and a thorough understanding of the underlying architecture to meet the rising demands for data processing in today's data-driven world. For example, he demonstrated that the processing time of a typical CPU in a JOIN operation can be significantly reduced by utilizing the parallel processing capabilities of a GPU in combination with specialized interconnects.

References

[1] HPI, availabe online: hpi.de/rabl/team/prof-dr-tilmann-rabl.html
[2] Rabl, Lecture: Data Processing on Modern Hardware, 2024
[3] Chris Gregg and Kim M. Hazelwood, 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In ISPASS.IEEE, 134–144. doi.org/10.1109/ ISPASS.2011.5762730
[4] S. Tyagi and M. Swany, "Flexible Communication for Optimal Distributed Learning over Unpredictable Networks," 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 2023, pp. 925-935, doi: 10.1109/BigData59044.2023.10386724.
[5] NVLink, Availabe online: www.nvidia.com/en-us/data-center/nvlink/
[6] Spyros Blanas, Yinan Li, and Jignesh M. Patel. 2011. Design and evaluation of main memory hash join algorithms for multi-core CPUs. In SIGMOD. ACM, New York, NY, USA, 37–48. doi.org/10. 1145/1989323.1989328
[7] Lutz, C., Breß, S., Zeuch, S., Rabl, T., & Markl, V., 2020, June. Pump up the volume: Processing large data on gpus with fast interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 1633-1649).