Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Flexible Vector Processing for Data Science Engines

Wolfgang Lehner, TU Dresden

The Speaker

Prof. Dr.-Ing. Wolfgang Lehner is a German data scientist and currently head of the Database Technology group at Technische Universität Dresden.

He received his Master’s degree in Computer Science in 1995 at the University of Erlangen-Nuremberg. After earning his master's degree, he continued his studies as research assistant at the Database System group in Erlangen-Nuremberg. After receiving his Ph.D., he continued researching within the Business Intelligence group at the IBM Almaden Research Center in San Jose, where he was involved in projects about adding materialized view support and multiple query optimization techniques to the core engine of the IBM DB2/UDB database system.

After that, he returned to his Erlangen-Nuremberg as a senior researcher within his former group. They researched the topic of exploiting database technologies to support complex message-based notification systems. From October 2000 to February 2002 he held the professorship for database systems at the University of Halle-Wittenberg. During this time, he also finished his postdoctoral studies with a thesis on subscription systems. Since the end of 2002, he has been teaching, researching, and working on various industrial projects at the TU Dresden. His field ranges from hardware near processor designing to visually supported data exploration. In his career as a scientist, he collaborated on over 509 publications, was cited over 8,375 times so far. Thus, he has achieved a notable H-index of 45.

Summary

Written by Adnan Kadric, Caterina Mandel & Gerd Rössler

Introduction

In an increasingly digital world, data is becoming more and more the most important resource. Nowadays data systems have to handle data from all kinds of digital sources: Text vs. images, structured vs. unstructured, raw vs. edited vs. defined. Therefore, the slogan "Variety is king" is used to describe the data used in today's systems.

"Variety is King" does not only refer to the data. It also refers to the hardware components that are responsible for processing, storing, and transmitting data. Together with volume and variety, variability forms the three dimensions of the design space for architectures fulfilling the need of current applications. The architectures of today's data systems thus do not only have to support volume and variety, but also variability. Volume refers to the scalability of systems: single query vs. overall system performance, scheduling and data placement, and concurrency control, while variety describes the degree of heterogeneity of the components used. The term variability denotes the ability to reconfigure the system at runtime and by that aiming at reducing, increasing or also replacing components used to build a system. This architecture is also called Scale-"Flex“ and brings up the question on the software side of how to build a highly scalable, highly elastic, and highly robust data-science engine? The challenge is to build a solid system on top of a very elastic composition of hardware components.

In the following, we will take a look at the underlying technology and protocols. This will be followed by an in-depth look at the approach presented to enable composable systems. We will then present vector libraries provided to enable efficient vectorization.

Background

Before taking a closer look at the approach of the research group at TU Dresden, we will provide an overview of the underlying technique and protocols.

Composable systems or infrastructure is a software-defined method of disaggregation. Disaggregation is used to abstract resources from hardware, so that the developer is not bound to the limits of the underlying hardware. Software platforms or designated APIs distribute resources into pools depending on the need of the application and make them available on demand. Instead of being accessible through a computer or server, they are virtualized with common memory spaces of the hardware.

Compute Express Link (CXL) is an open interconnect standard for enabling efficient, coherent memory accesses between a host, such as a CPU, and a device, such as a hardware accelerator, that is handling an intensive workload. The Protocol is designed to be an industry-open standard interface for high-speed communications. The idea behind this standard is that the application only sees one contiguous memory and can write, read, store and access the data on it. This is accomplished by translating the different memory units into a virtual address space. The standard defines three protocols (io, mem, cache) that are dynamically multiplexed together before being transported via PCIe 5.0.

Approach

The research group under Prof. Lehner has been addressing the research question on how to build a highly scalable, highly elastic, and highly robust data-science engine for years and has come up with a two-folded solution on different scales. As a solution on the smaller scale, referred to as "xPU-Scale", the group presents MorphStore. MorphStore is an in-memory smart storage system, that is single core by design. The focus is on compression and vectorization. Thus, MorphStorage supports numerous lightweight integer compression algorithms as well as vectorization concepts. Subsequently, the query processing concept is also compression-aware. The underlying processing model is operator-at-a-time. This allows parallelization at a low abstraction level.

The solution on the large scale, "Rack-Scale" is more extensive. The aim of ERIS is to flexibly combine MorphStorage systems. The combination of both is meant to enable Scale-"Flex" in order to pave the path to composable systems. Since the properties of Morphstore are also used ERIS, only the single-core approach will be presented in the following.

  1. All intermediate results generated during query processing should be able to be represented using a lightweight data compression algorithm. This enables the continuous use of compression throughout the query execution.
  2. Characteristics affect the choice of the compression scheme. However, since they can change during processing, it should be possible to find a suitable scheme for each intermediate product.
  3. Complete decompression of the input data should be avoided, as this would severely limit the benefits achievable through compression.
  4. To reduce the computational cost of compression and decompression, the processing model heavily applies vectorization. On mainstream CPUs, this vectorization is done using SIMD extensions.

Vectorization in Database Systems

When applications have to perform explicit instructions on many data points, modern hardware can make use of Single Instruction Multiple Data (SIMD) execution. SIMD allows loading different data points into a vector and performing the same operation on all of them simultaneously. A common use case for this parallelization technique is changing the brightness of an image, where a value is added to the RGB-Values of each pixel. It turns out, that common database operations can be expressed as a sequence of SIMD-Operations. This results in significant performance improvements as the number of instructions for loading data and performing operations is heavily reduced. In the following, we will present various use cases in which vectorization proves to be beneficial.

Figure 1: RLE - Performance Comparison

Run Length Encoding (RLE)

Run Length Encoding is a common technique for data compression. Subsequent occurrences of the same value are combined into a run. A run is represented only by its run value and the run length. Thus, storage cost is decreased if the data contains long runs of the same value.

Computing the run-length encoding is not a particularly complex task. However, utilizing SIMD operations can speed up the computation. Multiple subsequent values are loaded into a SIMD register and are concurrently compared to the current run value. This way, every instruction can process up to 4 values (assuming we have 32 Bit integer values and 128 Bit SIMD registers). This approach works great on data sets with long run lengths. However, if the run lengths are short, a lot of load instructions are performed.

New algorithms have been proposed to face this issue. Those rely on vectorization rather than on loading instructions. Thus, performance is not bound to the run length characteristics of the dataset. Choosing the "best" algorithm is a question of robustness vs. maximum performance. One should consider multiple aspects such as the dataset and underlying hardware possibilities

Figure 2 : BitWeaving - Performance Comparison

BitWeaving

Filtering columns is an everyday task of all database systems. Filtering is the same as evaluating a predicate (e.g. <,>,≥,≤,=) for each value similarly. As explained, this gives a great opportunity for parallelization (e.g. using SIMD Instruction).

With SIMD processing performance increases can be achieved, however, more specialized algorithms - like BitWeaving - have been popularized. BitWeaving makes use of "intra-cycle" parallelization. The main idea of BitWeaving is to store multiple values in one processor word and then perform scans for all values. If all values can be encoded with 3 Bits, this approach can fit values of eight different words into one 32 Bit processor word. Evaluating the predicate happens simultaneously. Compared to a scalar scan, where each value is loaded individually (with a 29 Bits "waste" per processor-word), the BitWeaving method filters up to a magnitude faster. The ratio of the processors' word length and the bit length of the encoded values determine the performance increase.

Challenges

At first glance, vectorization seems to be a powerful tool for speeding up standard database tasks. And it is! Filtering or hashing can be implemented in a parallel manner and therefore executed faster. However, developers face a lot of new challenges, when it comes to SIMD programming.

First and foremost - hardware diversity: Different processor architectures provide different SIMD instruction sets and vector lengths. As we have seen, the algorithms heavily depend on those metrics. That's why programmers have to adapt their code and provide multiple implementations for each architecture to maximize performance. It may even go so far that the architecture does not even support vectorization. Combining different machines into a flexible and scalable system comes with its problems.

Likewise, the performance of optimized algorithms like BitWeaving and RLE is strongly influenced by the data input (e.g. bit length of values or average run length) and hardware constraints (e.g. processor word lengths or vector sizes). Choosing the best algorithm for the system and input accordingly isn't easy - predicting speedup using different vector sizes is not trivial. We will see now see how the team around Wolfgang Lehner tries to handle those problems.

Vector Libraries

At the moment, we are still bound to constraints based on the underlying hardware in vector processing. Thus, for example, the vector size and also the syntactical dialect of the underlying hardware determine the efficiency of vectorization. To make vectorization as efficient as possible, the idea is to aim at the same virtual view as with operations that we know in operating systems.

Thus, we want to reach a decoupling of what is available on the hardware side and what is used or required from the software side. The decoupling is meant to be reached by providing three libraries: The Template Vector Library (TVL), the Virtual Vector Library (VVL), and the Vector Sharing Library (VSL). TVL is built on top of the hardware layer, while VVL and VSL are both built on top of TVL (as shown in fig 3).

Figure 3 : Vector Libraries

Template Vector Library

The first step towards the decoupling is to abstract the details of the underlying hardware. This can be seen as the syntactical requisite. The Template Vector Library (TVL) provides this kind of abstraction for vector-based programming. That way, TVL allows a single code base for database operators. Within the codebase, only one line, containing the processing style, has to be adapted according to the underlying hardware.

TVL works as follows: A set of primitives is mapped to different backends supporting different hardware platforms. Thus, the program is implemented against the primitives and the mapping takes place during compilation. This mapping during compilation for example enables generating and debugging a scalar version and then executing the vectorized version. Especially for debugging this can be useful. The primitives are specified by the processing style. The processing style includes the hardware dialect (e.g. scalar) and the vector size. As mentioned above, this specification is the only part of the codebase that has to be changed. TVL is open source and provided on GitHub.

Experiments show, that the overhead of the mapping during compilation can be neglected.

Virtual Vector Library

Subsequent to the abstraction of the syntactic specifics of the underlying hardware, the next step is the decoupling of the application vector size and the hardware vector size. The Virtual Vector Library (VVL) is built on top of TVL and enables mapping virtual vectors to hardware implementations. The same primitives as in TVL are used. This enables different possibilities of parallel, sequential or mixed executions, depending on what is beneficial for the application at hand. The goal is to allow for runtime reconfiguration. To enable the mapping of virtual vectors, the processing style is extended by the virtual vector style. The virtual vector includes information about the vector length, vector extension, and thread count. That way, the system can resolve larger virtual vectors into smaller ones on the hardware side.

Again, as only the processing style has to be adapted, there is no need for a change in the actual programming logic. At the same time, only changing the processing style leads to a high degree of freedom: arbitrary combinations of the three dimensions number of elements per hardware vector, vectors per core, and the number of threads/cores running in parallel are possible. Thus, the mapping of virtual vectors becomes an optimization problem.

Vector Sharing Library

As experiments show that bigger vectors are not always better but instead smaller vectors can also be beneficial, depending on the application, there should also be a possibility to proceed the mapping the other way around. Thus, Vector Sharing Library enables the mapping of small vectors on the application layer to be merged and executed jointly on the hardware layer. This is done by using vector registers as hardware resources for sharing data. VSL is - similar to VVL - also build on top of TVL (as shown in fig 3).

This approach especially applies to highly selective queries, that start with a lot of data, but end with only a few rows. These large vectors are beneficial at the beginning, but at the end, big parts of the vectors are not used anymore. The idea is to optimize the workload by multiple query execution, which means using multiple predicates and comparing them to one data object. So actually sharing the query instead of the data. Accordingly, this is referred to as SIMQ - Same Instruction Multiple Query. An obvious shortcoming of this approach is that there's no more data parallelism. To address this, a combination of both approaches is proposed. The design space is then defined by the degree in which data respectively query sharing should be used. By moving within this design space, the best degree of fine-grained parallelism can be found.

Coming back to the overarching goal, it can be said, that vectors are used as a technical vehicle for optimization. Through VSL not only data-parallelism is applicable, but also the sharing of queries.

Summary

With the challenges that applications and infrastructures have to be provoked and scaled faster, composable systems in combination with vectorization are a promising solution approach. Providing cache-coherent memory across multiple hardware components opens up new possibilities at the software level of processing data. The idea of vectorization that large amounts of data can be processed in parallel from a small set of instructions is a promising programming paradigm. But different hardware also brings different problems that need to be solved.

The provided vector libraries aim at a decoupling of what is provided by the underlying hardware and what is required from the software side. While the Template Vector Libraries provides the syntactical prerequisites by abstracting the details of the underlying hardware. The Virtual Vector Library and Sharing Vector Libraries are concerned with decoupling the vector size that is used by the application from the hardware vector size. VVL does that by mapping virtual vectors to hardware vectors and thus enables using larger vectors than provided by the hardware. VSL enables merging and jointly executing smaller vectors. In general, the vector libraries allow leveraging vectors as a technical vehicle for optimisation.

The approach of the research group shows great potential and leaves room for many new ideas. It remains to be seen what ideas the research team will implement in the future and what results they will present. If interested, the current status of the research can be followed here.

References

[1] Wolfgang Lehner. (2021). Flexible Vector Processing for Database Engines - Presentation.

[2] Li, Y., & Patel, J. (2013). Bitweaving: Fast scans for main memory data processing. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (pp. 289–300).

[3] (2021). TVL GitHub Repo. github.com/MorphStore/TVLLib.git.