Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

The R-Tree: A dynamic index structure for spatial searching

This article summarizes Antonin Guttman’s paper “R-Trees: A Dynamic Index Structure for Spatial Searching, SIGMOD 1984”. It names the motivation behind the R-tree and provides an overview of alternatives.

by Wanda Baltzer, bachelor student of the Hasso Plattner Institute, Potsdam

Motivation

Handling spatial data efficiently is an essential technique required by many applications dealing with multi-dimensional data, such as geo-data, multimedia, and spatio-temporal data. This problem has become more and more important as we try to handle more than the common two dimensions, e.g., when working with robotics and virtual reality [1]. Examples of applications that handle such data include computer-aided design, geographical information systems, radio frequency identification, and traffic control systems. All of these applications store large amounts of data, which makes efficient handling of search and update operations very important [2].

The R-tree [6] was proposed in 1984 by Antonin Guttman with the primary goal of handling geometrical data efficiently. It poses a substantial improvement to other data structures, which already tried to handle multi-dimensional point data. Some of these data structures are discussed in Guttman’s paper:

“Cell methods are not good for dynamic structures because the cell boundaries must be decided in advance. Quad trees and k-d trees do not take paging of secondary memory into account. K-D-B trees are designed for paged memory but are useful only for point data. The use of index intervals has been suggested […], but this method cannot be used in multiple dimensions.” [6]

While some of these examples provide better performance than the R-tree under certain circumstances, e.g., the quadtree when creating and updating indices [3], the R-tree has proven to be a good alternative, due to its outstanding overall performance and simple implementation [2].

The Data Structure

Guttman’s data structure represents data objects by intervals in several dimensions and works with a spatial database. This database consists of tuples, which represent spatial objects and have a unique identifier. The R-tree then refers to these identifiers in the index records. Leaf nodes contain index record entries with a tuple-identifier referring to a tuple in the database and an n-dimensional rectangle. Non-leaf nodes contain entries with a child-pointer as the address of a lower node in the tree and an n-dimensional rectangle covering all rectangles in the lower node’s entries.

This n-dimensional rectangle is assembled by n closed-bound intervals. So if you have a two-dimensional rectangle with the two intervals [2, 5], [1, 4], you receive a rectangle that looks like this:

Figure 1: a two-dimensional rectangle with intervals
Figure 1: a two-dimensional rectangle with intervals

Each R-tree has a maximum number of entries in a node. This number is referred to as M. Depending on the maximum number of entries, the minimum number is calculated with m = M/2. This property is used to decide whether the tree has to be reorganized. Another property of the R-tree is that the rectangle in the index records has to be the smallest rectangle possible while containing all corresponding tuples or rectangles.

To summarize the above, the R-tree consists of leaf node entries containing identifiers referring to the data tuples in the database, and of non-leaf node entries, which point to an address of a child node. A leaf node entry includes a rectangle, which encloses the data object, whereas a non-leaf node entry includes a rectangle that encloses all entries of the child node it points to. This structure is illustrated in Figure 2.

Figure 2a: structure of the R-tree [6]
Figure 2a: structure of the R-tree [6]
Figure 2b: structure of the R-tree [6]
Figure 2b: structure of the R-tree [6]

Searching and Updating Algorithms Guttman’s paper gives algorithms for searching and updating the index structure. The following presents an overview of how the data structure works.

Firstly, the search algorithm. When searching for all index records whose rectangles overlap a search rectangle, the algorithm checks each entry of the considered node for whether the rectangle overlaps the search rectangle. If so, the search algorithm again checks the corresponding child node for an overlap. This continues until the considered node is a leaf. If the algorithm then finds a rectangle that overlaps the search rectangle, this entry is returned as a qualifying record.

There are other search algorithms, which might be useful in some applications. Using the described search algorithm, these operations are easy to implement. An example of another search algorithm is FindLeaf, which searches for a certain entry in the tree.

However, there is a disadvantage of the algorithm:

“[…] more than one subtree under a node visited may need to be searched, hence it is not possible to guarantee good worst-case performance. Nevertheless with most kinds of data the update algorithms will maintain the tree in a form that allows the search algorithm to eliminate irrelevant regions of the indexed space, and examine only data near the search area.” [6]

One of these update algorithms is CondenseTree. It is invoked in the deletion algorithm after an entry has been deleted from a leaf node. It then checks whether this leaf node contains too few entries. If it has less than m entries, the node and the pointer in the parent node are eliminated. The eliminated node is stored temporarily in an array, Q. The algorithm then adjusts all covering rectangles, checks the number of entries of the parent node, and, if necessary, repeats all steps. Finally, the entries of the nodes stored in Q are reinserted into the tree.

There is a similar update operation in the B-tree, where two or more adjacent nodes are merged. Such an approach could have been possible. Nevertheless, Guttman chose a different one.

“We chose re-insertion instead for two reasons: first, it accomplishes the same thing and is easier to implement because the Insert routine can be used. Efficiency should be comparable because pages needed during re-insertion usually will be the same ones visited during the preceding search and will already be in memory. The second reason is that reinsertion incrementally refines the spatial structure of the tree, and prevents gradual deterioration that might occur if each entry were located permanently under the same parent node.” [6]

AdjustTree is another updating algorithm, which is required in the insertion algorithm to adjust the covering rectangles and propagate necessary node splits. Node splits are needed when a new entry is added to a node already containing M entries. The M+1 entries are then divided between two nodes. Here it must be prevented that both nodes need to be examined during subsequent searches. The decision on which node to visit is based on whether the covering rectangle overlaps the search rectangle. That is why the area of the two created rectangles after the split has to be minimized, as shown in Figure 3.

Figure 3: Splitting [6]
Figure 3: Splitting [6]

Node Splitting Algorithms

The paper gives three possible algorithms for node splitting. The first one is the exhaustive algorithm, which generates all possible splits and chooses the best. This takes quite long, and thus the other two described algorithms are better alternatives.

The second one is a quadratic-cost algorithm, which tries to find a small-area split while not insisting on the one with the smallest area. This algorithm chooses the two entries of the M+1 entries that would waste the most area if both were put in the same group. These two elements are the first elements of the two new groups. The remaining entries are then assigned to the two groups by calculating the area expansion required to add each entry to each group and then assigning the entry that shows the greatest difference of the two groups.

The third algorithm, a linear-cost algorithm, is quite similar to the quadratic algorithm. The only difference is in the process of choosing the two entries that are supposed to be the first elements of the groups. Instead of calculating the inefficiency of grouping entries together and then choosing the most wasteful pair, the linear algorithm searches for extreme rectangles along all dimensions and selects the most extreme pair. Extreme rectangles, according to Guttman’s interpretation, are rectangles with the highest low side or the lowest high side.

Outlook on Alternatives

Other scientists continuously worked on improving this data structure regarding specific queries or applications. Each variation focuses on certain aspects, such as optimal insertion or range query retrieval [2]. Some combined the R-tree with other index structures to obtain their advantages as well.

The R+-tree [2, 4] avoids that several subtrees have to be searched during the execution of a point location query. It focuses on small rectangles as large rectangles can worsen the performance for range queries due to the increased possibility of overlaps.

In 1990, the R*-tree [2, 4, 5] was published. The R-tree focuses on the minimization of the area of the rectangles. The R*-tree considers additional criteria: the minimization of overlap, the minimization of margins, and the maximization of storage utilization. The main idea is to find the best possible combination while following an engineering approach. Doing so, the data structure achieves much better performance for all queries than the R-tree. It is widespread in both research and commercial areas and is regarded as the most robust variation of the R-tree.

Another variant is the Priority R-tree [2, 4] that has a guaranteed worst-case performance. PR-trees perform similar to the R-tree regarding queries and outperform every variation of the R-tree when handling extreme data. However, there are faster variants, if the main wish is efficient bulk-loading.

To summarize, many applications in different sectors need to handle multi-dimensional data, which is why indexing spatial data is still an important area of research. Due to its excellent performance and simple implementation, the R-tree is widespread in modern applications and has resulted in a large number of variations.

For further research, please regard my source for the data structure in its original form [6] and publications on overviews of the different variations [2, 4].

References:
[1] www.bowdoin.edu/~ltoma/teaching/cs340/spring08/ (04.01.2020)
[2] Balasubramanian, Sugumaran, A State-of-Art in R-Tree Variants for Spatial Indexing, International Journal of Computer Applications 2012
[3] Kothuri, Ravada, Abugov, Quadtree and R-tree Indexes in Oracle Spatial: A Comparison using GIS Data, SIGMOD 2002
[4] www.bowdoin.edu/~ltoma/teaching/cs340/spring08/Papers/Rtree-chap2.pdf (05.01.2020)
[5] Beckmann, Kriegel, Schneider, Seeger, The R*-tree: an efficient and robust access method for points and rectangles, SIGMOD 1990
[6] Guttman, R-Trees: A Dynamic Index Structure for Spatial Searching, SIGMOD 1984