Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Data Structure Engineering

Viktor Leis, Uni Erlangen

Introduction of the speaker

Viktor Leis is a professor at the Friedrich-Alexander-Universität in Erlangen-Nürnberg. Since April 2021 he heads the department für Data Management. Generally he is interested in high performance database systems. For example, he currently works on “Lean Store” with his research group. Lean Store is a high performance storage engine for flash SSDs.

Engaging overview of the lecture

  • Goal, problem, solution
  • Background
  • Main ideas, methodology, approach
  • Main results
  • Summary

Summary

Written by Quirin Ertz, Quentin Kuth & Francois Thibon

Data Structure Engineering

Data Structure Engineering is about making data structures (search tree, hash table, priority queue) efficient and robust in practice. While in theory there is an emphasis on asymptotic complexity (e.g. O(log n)), it is not enough for good performance. In reality, through differences in hardware specializations of data structures can bring gains. Instead of taking already built implementations they still can be optimized for special use cases and special requirements.

Motivational example:

We have a data structure in which we do 10 million random searches. Let’s compare which kind of tree is the most efficient between Red-Black tree (std::map in C++, this type of tree is optimized for main memory) and the B+ tree (optimized for disk). For Red-Black tree, it would take nearly 9 seconds to do all the 10 million lookups while with the B+tree it would take around 2 seconds. At the end of the lecture we will be able to say why the B+tree is much faster than the Red-Black tree.

 

  1. Understanding Modern CPUs

As Moore’s law shows us, the number of transistors in the CPU for the past 50 years has exploded. During the 70's the number of transistors was around the order of 10^1, in 2020 it was around the order of 10^7. While the frequency of the microprocessors is stagnating, engineers found several ways to increase their speed. Using parallelism, increasing the number of cores or by using a cache.

  • CPU counters

THe CPU counters are really important hardware performance counters that help engineers to improve the speed of the CPUs. CPU counters are able to compute the number of structures that have been executed in a specific code fragment, and count the number of mistakes made by the CPU.

Looking at the results of CPU count between the motivational example (B+tree vs. Red-Black tree), we see that even if there are 100 less instructions for RBtree than for B-tree, there are more misses in the cache level for the RBtree. We can now answer the question why b-tree is much faster than RBtree: the latter has a better lookup access, which leads to fewer cache misses and therefore to a better speed.

  1. Case study: trie data structures
  • The 3 main data structure paradigms according to Pr. Leis

Hash Table: data in unsorted order & fast point lookup but

Trees : data in sorted order & balanced

Trie : data in sorted order & unbalanced

  • Adaptive Radix Tree (ART)

The ART is a very efficient type of trie data structure. The outstanding process behind this tree is that no matter how big our database is, the lookup complexity will only depend on the word’s length. This data structure is simple to understand. It implies no balancing process, it has a wide fanout possibility. In comparison with B-tree and red-Black tree from the motivational example, the lookup with the ART would take less than 1 second according to Pr. Leis.

wo others interesting tries are Height optimized trie and Fast succinct Trie

ART is 10 years old, since then new trie were created including these two :

-        Height optimized trie

-        Fast succinct trie (Read only data structure, optimized for space)

Conclusion of the 3 paradigms part:

Asymptotically speaking with access time, hashing is constant, search trees are logarithms in the number of the keys and trie are linear depending on the length of the key (the longer the value, the more nodes in the trie).

Another interesting assumption can be that the computational model between search trees and trie are completely different. For search-trees, we assume that time is constant, whereas in trie we assume that each key can have a variable length

 

  1. Scalable Synchronization

a.      Concurrent programming

The questions we can ask ourselves:

-        How does hardware synchronize data access ?

-        How does software control synchronization ?

-        How can we synchronize any data structure ?

One thing really important when it comes to synchronization is the cache coherency. If we take a CPU with several cores, each of the cores has its own L1 and L2 cache. These CPUs hide the fact that they have their own private memory. This is the cache coherency protocol.

Cache coherency uses a protocol to do so: the MESI protocol. Each cache line is mapped to a state : modified, exclusive, share or invalid. Performance problem is not if multiple threads read the same cache line. The only issue could be the modifying state. If we have several threads writing into the same cache line, we will have a big loss of efficiency talking about performances. This issue leads to the golden rule of concurrent programming:

“Avoid reading or writing caches lines that are frequently written (more than 10000 times/s) by other cores”

Explaining this golden rule :

-        Both the cpu and programming language have a concurrency model : what is the semantics of your program in the presence of concurrent changes to memory ?*

-        x86 has a strong memory model: the machine code that we give to the cpu is mostly executed as given, except for the writing command that is buffered

o   If we use C++, we don’t need to know how the model works, we just need to know some things, that leads us to  C++ concurrency model on x86

 

b.      C++ concurrency model on x86

Among the main rules about concurrency models, the golden rule is the most important one, if we don’t follow it but we follow the others it is really useless. Always follow the golden rule !

1.      Data races are undefined . C++ defines data races: if we have variables or memory locations that are used concurrently (std::atomic)

2.      Loads that are executed in the order we want (sequentially-consistent) are very fast on x86 models. We don’t need to worry about performance with loads, they are always fast on x86 models.

3.      Stores that are executed in the order we want (sequentially-consistent) are slow. We can therefore ask C++ not to use sequentially-consistent stores but instead to give us the release memory order. These stores can be delayed, which can lead to a gain in memory usage (eg. unlock)

4.      The special atomic operations are always atomic sequentially consistent

 

c.      Synchronization protocols

How do we synchronize data structure ?

Can we design protocols that work  ?

The optimistic lock coupling can be combined with traditional mutex. We can obtain a hybrid lock mode that is flexible and handy The lock free synchronization can answer the issue of physical contention. This method can be used for many data structures but we need to design the entire data structure around this lock-free list !

CONCLUSION

Understanding the asymptotics is not enough to get good performances with our data structures. We need to understand the hardware !

We need to know exactly what we’re looking for to select the right data structure ! Building our own designed data structure (if it’s well implemented) leads to big performance and space gains.

Three trie data structures we need to use : ART, Surf and HOT

The optimistic lookup coupling is really convenient for most of the data structures, so we should not hesitate to use it. It often fits our needs well.