Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

B-trees and how this data structure changed the way of storing data and accessing it efficiently

A summary by Mona Sobhani.

Fifty years ago, Rudolf Bayer and Edward M. McCreight, both working at the Boeing Research Labs, invented a data structure called “B-trees” that is nowadays one of the most used data structures in database systems. It has changed the way of storing data and thereby enormously reduced the IO-costs for searching, deleting and inserting data in a database. This blog post aims to show how the structure of the B-tree is organized, how an algorithm on this structure works and how this data structure has evolved since the invention in 1971, when they published it in a scientific paper which is called “Organization and Maintenance of Large Ordered Indices”.

Structure and characteristics of the B-tree

Before inventing the structure of a B-tree, data was organized with index tables. The more data, the longer the index table got or even became a two-level index. Indexing helped to find data faster. For example, if you are searching for specific data, you would use the index to see what block of data you would have to access (load into memory), to find the data. This allows you to not access all the blocks on the disk but just one block, where you will find the data you are looking for.

If the index table is getting too large, one can use a two-level index where the first index points to another index in the second level. One can even use more than a two-level index. When there are too many levels, a B-tree comes into hand. Having a lot of data, or in this case many primary keys, a B-tree organizes the keys in the nodes of a balanced tree.

Each node consists of an index-key and a pointer to the data block, where the data can be found. If we use the same example as before, how do you we now access the data? Looking for a specific key, we now do a search similarly to a binary search. Start at the root of the key and continue either at the left child node or right child node depending on the key number. The B-tree is balanced and therefore assures us to find every key in O(log n). But before looking at the efficiency of a B-tree, we will have a look at the structure and the conditions of a B-tree.

The following figure should help to understand the structure of the B-tree, which Bayer and McCreight published in their paper.

Figure 1: B-tree data structure
Figure 1: B-tree data structure

Bayer and McCreight describe the organization of the B-tree as follows:

“The index is organized in pages of fixed size capable of holding up to 2k keys, but pages need only be partially filled. Pages are the blocks of information transferred between main store and backup store. The pages themselves are the nodes of a rather specialized tree, a so-called B-tree, […].” [1]

Now that we have understood the structure of the B-tree a little better, we will take a look at the rules and conditions of a B-tree, which I will list in the following.

  • The keys in the node are always sorted If a node has n keys, it has n+1 pointers
  • The root always has a minimum of two pointers
  • The left-sided pointer of a key points to a node with keys that are have a smaller size
  • The right-sided pointer of a key points to a node with keys that are greater or equal to the key size
  • The internal nodes (every node except the root and the leaf’s) have a minimum and maximum number of containing keys
  • The B-tree is always balanced

Insert new key into the B-tree

There are three main algorithms that allow the B-tree to stay balanced. I will explain one of those algorithms, the one that inserts a new key into a B-tree.

Whenever a key is inserted into a B-tree, the node where the key will be inserted will be searched for similarly to a binary search. Let 4 be the key that should be inserted. If the key in the root is smaller than the inserted key, here root here is 2, then the key will be inserted into the child node on the right side, once there is space left (see figure 2). Once one node already contains the maximum number of keys it can have (see insertion of 5 in figure 2), the node will be split in two nodes and the keys will be divided into two new nodes containing each of the keys. Remember, that the nodes are not densely packed with keys, so that the insertion becomes efficient because the nodes do not always need to be split when a key is inserted.

 Figure 2: Insertion of keys 4 and 5
Figure 2: Insertion of keys 4 and 5

This insertion algorithm allows the B-tree to stay balanced and therefore assures that the search of a key will always be in O(log n) because the length of the leaf nodes have the same height.

What has changed since 1971?

There have been many deviations that have been established since the B-tree was invented in 1971. Nowadays the most used B-tree is the so-called B+-tree. The B+-tree does not have pointers from every node like the B-tree. It only has pointers from leaf nodes to data. This is an advantage because more keys fit in one node and therefore the height of the tree will stay shorter.

The leaf nodes of the B+-tree have pointers among themselves and allow a linear full scan of the data by passing through the leaf nodes. Compared to a B-tree, a sequential scan is more efficient using a B+-tree, because a B-tree would have to traverse through all the existing nodes in the tree.

What are the advantages of this data structure?

Now that we have looked at the structure and another variant of the B-tree, you may ask yourself, why this data structure is one of the most used ones. Before the invention of the B-trees, data was organized with multilevel indexing and accessing data with the help of index tables. The B-tree was invented to store large data in a way that reading and writing on big data blocks from disk remains in logarithmic time.

The B-tree data structure is an efficient solution for organizing data and accessing it faster than through index pages. When data is too large to be kept in main memory, a B-tree will be kept in main memory and only specific blocks of data will be accessed on disk and loaded into main memory for further computations. The algorithms for inserting, deleting and searching allow the B-tree to stay balanced and therefore always ensures to find data in logarithmic time, which is the biggest advantage of the B-tree.

References:

If not referenced differently, all figures and quotes are from the paper “Organization and Maintenance of Large Ordered Indices” by Bayer and McCreight
Figure 2: en.wikipedia.org/wiki/B-tree