Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Dimensions of Hardware Parallelism and Exploiting Them for Data-Intensive Systems

Pinar Tözün, IT University of Copenhagen, Sweden

Abstract

Out of that broad field, she held a talk on different parallelism techniques and their characteristics. Each of them exploits different hardware mechanisms and therefore requires different handling. In this summary, we go through each, explain how they work, and outline their strengths and weaknesses.

Before starting with the techniques, we summarize the evolution of the technical basis, the CPU. At first, the CPUs experienced a trend of more complex, and therefore faster, cores. But this trend led to problems, most namely the problem of the exponentially increased power consumption for faster operation frequencies, the power wall. Therefore the trend changed to CPUs having more cores at the same speed. Finally, as these CPUs also hit hardware-specific limitations, systems started using more CPUs for a task.

Biography

Pinar Tözün is an Associate Professor at the IT University of Copenhagen. In her research, she focuses on the hardware-software ecosystem for data-intensive systems.

A recording of the presentation is available on Tele-Task.

Summary

written by Tim Kuffner, Hendrik Patzlaff, and Nils Thamm

While today's systems commonly consist of multiple multicore CPUs, parallelization on a single core still plays an important rule. For that reason, we firstly explain different parallelization techniques on a single core, the so-called implicit parallelism. The second part then focuses on the parallelism between multiple cores or even CPU's, the so-called explicit parallelism.

The reason why parallelism on a single core still plays an important role is, that instructions often spend most of their time waiting for data-related tasks of other hardware components. But if scheduled correctly these idle CPU cycles can be used by other instructions. Accessing data from the hard drive can take thousands of cycles, which shows that there is a lot of potential for parallelizing instructions to not lose this computation time.

Figure 1 Example for instruction pipelining. (Ailamaki, 2017)

Luckily for the developer, this kind of parallelism is mostly taken care of by the hardware vendor. As the name suggests it happens implicitly on a system level. The most basic way to achieve this is called instruction pipelining. With instruction pipelining, the CPU schedules the instructions in a way that their different stages overlap. An instruction usually not only consists of the pure execution but requires some fetching and decoding beforehand and writing to memory afterward. Figure 1 shows an example of instructions and how the different stages can overlap. While for one instruction the CPU issues a write to disk, for other instructions it writes something to the main memory, decodes instructions and fetches a new instruction. This results in a lower overall cycle count for multiple instructions.

Also implicitly handled by the CPU, is the scheduling of multiple tasks at once on different cores, as far as they are free. Again, on each core instruction pipelining can be used at the same time. So the performance gain from the instruction pipeline can be multiplied by the number of cores, at least in theory. In practice, instructions can not be pipelined or executed on different cores in that simple way.

Both of these techniques are obviously suitable for applications running in parallel. But, it is also possible to use this parallelization with instructions from the same application. Not all instructions of a single application depend on each other. Therefore, it is possible to schedule non-dependent instruction at the same time. Moreover, this characteristic allows it to change the ordering of the execution completely. Most modern CPU's use this so-called out-of-order execution to schedule transactions depending on their data-availability. By doing this the stale time waiting for data is even reduced further.

Further looking at multiple applications running in an operating system, the simultaneous multithreading can bring additional benefits. Modern, more advanced, CPU's allow sharing the registers with two or more instructions at a time, by increasing their size.

Without keeping the data, a context switch would be necessary each time the CPU switches the executed instruction. During this switch, the old register and instruction values are evicted and the new ones are loaded from memory or the caches. While this is still cheaper than waiting for data from memory, it costs a few cycles. Having simultaneous multithreading active this penalty can be spared. The CPU can switch each cycle without any costs. On the other side, the other resources, like the caches, can't be easily multiplied in their capacity. Even with increased size, the bottleneck for one application is thinner than without simultaneous multithreading. If handled incorrectly this results in even lower performance. Therefore it is not uncommon to disable this feature since the operating system usually takes care of the management, which makes it less predictable.

Figure 2 Concept of single instruction multiple data. (Ailamaki, 2017)

Apart from parallelizing instructions, it is also possible to parallelize data. This means that one instruction process multiple data units at the same time. Figure 2 shows the conceptional idea, the single instruction multiple data (SIMD) technique. For each instruction, a multiple of the data is loaded. Since this highly depends on the type of instruction and data, modern CPUs provide only the hardware. The usage and management have to be done by the application or operating system. Still, it counts as an implicit parallelization technique because the parallelization that happens leverages hardware-specific features and is executed on a single core.

How exactly can we exploit these techniques for database systems?

In the following we will focus on transaction processing systems. In a 4-way issues Intel processor you can theoretically execute 4 instructions per cycle, but for most applications it is typically not much more than one instruction per cycle, so there is a lot of room for improvement. The reason for the low number of instructions per cycle is the high stall time, the time in which no instruction can be processed because the memory has to be accessed for instructions or data. This behavior does not only occur with Shore-MT transaction processing systems. Also commercial systems and in-memory systems that have a newer code base spend a large part of their time in stall cycles. These systems can be very good and have a high throughput, but you could get even more out of them by using the hardware as best as possible.

This problem can be solved by looking at how the stall cycles are used. In most cases, they are referring to instruction-related memory accesses. If you then take a closer look at these memory accesses, you will see that most instances of a transaction generally access the same instructions, but almost never the same data. Even with different transactions, this phenomenon occurs because transactions are made up of different atomic components that use the same instructions.

You can take advantage of the fact that instances of the same transaction have a lot of instructions in common by processing the transactions in such a way that the instructions in the L1 instruction cache do not need to be reloaded each time. So we can break our transactions down into such small pieces that all instructions fit into the L1 cache. If more than one transaction is to be processed, the first part of the transaction is executed on one core and then switched to another core on which the second part of the transaction is executed, the instructions for the execution of the first part of the transaction are kept in the L1 cache of the first core and can be used by the second transaction to immediately start executing its first part. You can also load the instructions needed for the next part into the L1 cache of another core to avoid the downtime for waiting for instructions to be fetched. So you save yourself having to get the instructions each part needs, since you already have them in the cache. As a result, the cache needs to be filled less often and there is less downtime to retrieve instructions.

However, there are also negative sides to this approach, the first transaction has to load each of its instructions because there was no transaction before it that already cached the instructions, and in addition it has to bear the cost of transferring them to another core. So if a short latency is more important than high throughput, this method is not recommended. Furthermore, it must also be taken into account that when migrating to another core, the data must also be reloaded. Loading the data when migrating to a new core on the same socket is not as time-consuming as reloading the instructions. However, if you want to migrate to a new socket, this step is more important and the entire process is no longer significantly faster.

So as long as we only have more cores on a socket, we get an almost ideal ratio of throughput to the number of threads with the method described, but as soon as we start using several processors on different sockets, as is often the case in newer systems, the bad effect of the access times on the data comes into play and worsens the ratio. So it is important that we get better metrics to decide on scalability, because the pure throughput that is possible on paper does not necessarily say anything about whether future systems scale as well as today's systems do.

When looking deeper into the cause of this effect, a critical section analysis is the next step to take. A critical section is section of the code, that accesses shared parts of the data. Due to not causing conflicts, this part of the code can only be executed by one thread. There are several strategies to tackle the critical section problem, which are described in the following paragraphs.

The first scenario are unbounded critical sections. This occurs when shared data access happens. Every thread try to access the critical sections simultaneously. That leads to the problem of increased critical sections access when the system is scaled up - the number of processors or cores is increased - as all running threads could try to access the critical section at the same point in time.

The next scenario are cooperative critical sections. In this case, work of different threads is aggregated. The aggregated instructions, which could be a commit in a database, are all done together at a later point in time. This way of coping with critical sections leads to less unstructured threads that want to access the critical section at a point of time. Its upside is the increase in parallelisation, as lots of different instructions can be handled at a point of time. One disadvantage is the difficulty of managing the access of the critical section and the aggregation. One example, where this is used are group commits.

The last presented scenario are fixed critical sections. For those, there is a given, fixed order for accessing the critical section. It is the same scenario as the producer consumer problem and here is no parallelism involved. The goal is, to reduce the number of unbounded critical sections by transforming those into cooperative and fixed ones. When looking at shared everything database management systems, there are two things, that can be accessed by threads where critical sections can happen. The shared system state, monitoring proper execution of transactions and commits, and the shared data space. If the data space is not managed properly, there can happen a lot of unpredicted database access.

To show the number of critical sections, in a test scenario a probe customer is selected and their balance updated. This leads to more than 70 critical sections, out of which more than 75% are unbounded. To minimize those, physiological partitioning (PLP) can be used. In this case, worker threads have specific data ranges, which will be accessed by those. Also the index structures is mapped to subindexes to grant a fast access of data.

This method is limited by the structure of the saved data. The data has to be separable to decide on which thread is responsible for which data ranges. For example, Names could be split alphabetically by its first letter. Deciding on the separation points is not easy and highly depends on the data. For that, a good knowledge of the given data is required. Sections of data could also be assigned to a number of threads, which makes some critical sections possible.

By using PLP, the number of critical sections can be reduced by 70%. But PLP does not help on a multicore with several processors. In such servers, as the remaining unbounded critical sections are based on lock-free or atomic mechanisms.

Having reduced the number of critical sections does not solve all of the problems, as sharing data among cores on different processors also leads to high latency. It is 10 times as high as accessing another core in the same processor.

A question for further research is the possibility of adding heterogeneity. Adding more and more cores in one processor does not work, as those cannot be powered all at the same time. One possible solution for this problem is to use many light cores. Those could, in the end, consume more power when running for long hours instead of having high performance processors running a short period of time. An alternative is using diverse cores. That is a better long term solution, as specific processors can take over the tasks, which they are best at.

Problems, that arise when using diverse cores, is the proper exploitation of the cores. Scheduling becomes a problem and also complex energy management is not easy and currently a challenge for further research. As a summary, you can see that hardware parallelism is a complex topic, which needs to be considered when thinking about data-intensive work. Even though hardware is getting faster every year, this does not help tackle the massive amount of data, that is processed. It is widely known, that there are several possibilities to handle parallelism and for writing software, most people were ignorant and relied on the performance increase when developing new data processing algorithms. Putting more and more processors and cores in one system increases the amount of explicit parallelism, that could slow down the system when scaling up the hardware. When thinking about data processing, due to the increasing amount of data, server farms are scaled up to increase calculation speed. As described in the article, this is not always the case and could lead to problems when not thinking about the underlying software. When designing systems, one should always be aware of hardware parallelism and the effects, which could result from that.

References:

Ailamaki, Anastasia / Erietta Liarou / Pinar Tözün / Danica Porobic / Iraklis Psaroudakis (2017).Databases on Modern Hardware: How to Stop Underutilization and Love Multicores Synthesis Lectures on Data Management.Morgan & Claypool Publishers