Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Access Path Selection in a Relational Database Management System

By Andrea Nathansen

Access Path Selection in a Relational Database Management System, Selinger et al., SIGMOD 1979

Nowadays lots of database developers enjoy the luxury of SQL statements and do not have to think about low-level query execution plans. The automatic optimizer takes care of choosing the right execution order and methods for the high-level statements. While the optimizer is an essential part of database management systems today, researchers already invented its base concept 40 years ago. The paper of Selinger et al. describes the original ideas and methods of the optimizer that we take for granted in modern systems.

The need for an automatic optimizer is a result of supporting high-level queries and providing automatic query processing. After a user has typed an SQL query, the system processes it in four phases: parsing, optimization, code generation and execution. The optimizer receives a parse tree for the query from the parsing step and creates a minimum cost execution plan, which it passes on to the code generation component. Minimum cost means minimal disk I/O and CPU utilization. As the data is organized in pages on disk, we measure disk I/O by the number of pages the system has to touch during query execution. Today a disk access is way more expensive than a CPU operation, which makes disk I/O the main bottleneck here.

During the optimization process, the optimizer creates multiple execution plans that are equivalent regarding their resulting output. It computes the estimated cost for each of those plans and chooses the cheapest one. For its calculations, it uses statistics on the relations and available access paths.

“For each relation T,
- NCARD(T), the cardinality of relation T.
- TCARD(T), the number of pages in the segment that hold tuples of relation T.
- P(T), the fraction of data pages in the segment that hold tuples of relation T.
P(T) = TCARD(T) / (no. of non-empty pages in the segment).

For each index I on relation T,
- ICARD(I), number of distinct keys in index I.
- NINDX(I) the number of pages in index I.”

There are several parts of a query that require optimizing decisions.

Scanning relations

When executing a query, at first the system needs to scan the referenced relations. A relation is stored on one or more pages on disk. Multiple pages form a segment. Each relation can’t extend one segment. Each segment can contain numerous relations. Tuples of different relations can be mixed up on a page.

The basic scan method is the segment scan. The system reads the relation sequentially from the segment pages. That means, it loads each page of the segment once to obtain every tuple on this page that belongs to the specified relation. This leads to a total cost of one I/O per segment page.

If there are indices for one or more attributes of a relation, the system can also scan the relation via index scan. It loads the index and reads the tuples of the relation by following the pointers stored on the index pages. As the order of tuple keys in the index can differ from the order of tuples on disk, the system likely has to fetch each page multiple times. The cost for the worst case is one I/O per tuple, which means NCARD(T), plus the cost of loading the index.

After scanning a relation via index scan, the resulting tuples are ordered by index attribute. A query can require a certain interesting order, which is defined by GROUP BY and ORDER BY clauses. When executing queries with interesting orders, using an appropriate index saves additional cost for sorting. In this case, an index scan is probably cheaper than a segment scan with sorting.

Index scans can also be useful if the given query contains a selection. In case of a simple selection clause of the form “attribute comparison-operator value”, the selection can take place as the first operation during scan. The system can perform the selection in the index and has to fetch only those tuples from disk which fulfill the selection criterion. The estimated proportion of tuples that remain after selection related to the number of tuples in the relation is called selectivity factor. Depending on the form of the selection clause, the optimizer uses different cost formulas for computing the selectivity factor (see in the paper: Table 1: Selectivity factors).

For each relation specified in the query, the optimizer computes the cost of all available access paths (segment scan and indices) and chooses the cheapest one with regard to required interesting orders.

Join Methods

As joins can result in outputs with high cardinality, they are a major cost factor in query execution. The optimizer has to choose the best join method and, given a join of more than two relations, the best join order.

The default join method is the nested loop join. The optimizer defines one of the two relations as the inner relation, the other one becomes the outer relation. The system scans the outer relation once. For each tuple of the outer relation, the system scans every tuple of the inner relation and finds pairs matching the join predicate.

“Let C-outer(path1) be the cost of scanning the outer relation via path1, and N be the cardinality of the outer relation tuples which satisfy the applicable predicates. N is computed by:
N = (product of the cardinalities of all relations T of the join so far) * (product of the selectivity factors of all applicable predicates).
Let C-inner(path2) be the cost of scanning the inner relation, applying all applicable predicates. [...] Then the cost of a nested loop join is
C-nested-loop-join(path1, path2) = C-outer(path1) + N * C-inner(path2)”

The use of another method, called merge scan join, needs a sorting of both relations in join column order. The system scans the sorted relations in a synchronized way to obtain matching tuple pairs. Due to the sorting, it needs to scan each relation only once, which means the resulting cost includes the I/O for performing the sort, if necessary, and the I/O for one scan of each relation.

Join Order

If a join includes more than two relations, the system decomposes this join into nested joins of two relations. This means, it first joins two relations, then it joins the composite result to the third relation, and so on. While the composite result of different join orders is always the same, the cost for the intermediate steps can vary greatly. To choose the order with minimal cost, the optimizer creates a tree of possible orders and computes the cost for each path.

This tree can have a size exponential to the number of joined relations. To reduce this size, the optimizer does not consider every path that contains unnecessary early Cartesian products, but only the paths that first join all relations with matching join attributes and then include unavoidable Cartesian products. Another method of reducing the tree size uses the fact that different join orders for a set of relations are equivalent regarding the composite result and its cardinality. The optimizer also takes into account different interesting orders that can be used for merge scan joins. The join columns, GROUP BY and ORDER BY clauses specify the interesting orders. For each set of joined relations, the optimizer considers an unordered join and a join for each possible interesting order of the result. Because of the equivalence regarding the result, for each of the joins the optimizer only needs to store the result cardinality and the cheapest solution for obtaining this result (including join order and methods for join and scan) with its cost.

After creating the tree, the optimizer chooses the cheapest solution for the complete join of all relations, with consideration of possible required interesting orders if the query also includes a GROUP BY or ORDER BY clause.

Nested Queries

A nested query, or subquery, is a query that is part of a WHERE clause in a top level query. For example:

“SELECT NAME
FROM EMPLOYEE 3,
WHERE SALARY =
(SELECT AVG(SALARY)
FROM EMPLOYEE)”

The naive approach would execute a subquery once per tuple of the top level query, which can be optimized in some cases.

If a subquery does not reference any tuple or value from the top level query, the system only has to evaluate it once, before executing the top level query. The system can integrate the result of the subquery as scalar value or temporary list into the top level query. In this case, the optimizer saves the information to evaluate the subquery first.

When multiple subqueries are nested without any references to outer queries, the system evaluates them in nesting order, starting from the deepest to the outermost. If a deeper subquery references an outer query that is not its direct predecessor in the nesting, the system executes it only once for each tuple of the referenced query and not for every tuple of its direct predecessor. The optimizer checks those references and saves the optimal execution order for the nested queries.

What has changed until today?

The methods and ideas of the optimizer described by Selinger et al. created the fundament for query optimization in database management systems. Throughout the years, many researchers developed the concept further.

For example, Graefe et al. propose dynamic query execution plans for programs that are usually compiled only once [3]. To adapt to changes in the data, their idea is to move away from saving only the one optimal query execution plan which the optimizer creates during compilation. Instead, their system saves multiple alternative execution plans or recompiles the program, and thus reoptimizes the execution plans, from time to time.

To improve the plans the optimizer creates, Mannino et al. propose taking into account more statistics about the data during optimization [2]. Their system makes less assumptions about the data (like a uniform distribution of every attribute) and uses additional histograms or distribution tables.

Other researchers adapted the concept of the optimizer to modern systems which differ from the traditional database management systems, for example column-based database systems [4] and in-memory database systems [5].

Although the concept details and the implementation of optimizers vary when developed for different systems, automatic query optimization is still an essential aspect of database systems today.

References

[1] P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, T. G. Price, “Access Path Selection in a Relational Database Management System,” in Proceedings of the 1979 ACM SIGMOD international conference on Management of data, Boston, MA, 1979, pp. 23-34.

[2] M.V. Mannino, P. Chu, T. Sager, “Statistical profile estimation in database systems,” ACM Computing Surveys, vol. 20, no. 3, p.191-221, Sep. 1988.

[3] G. Graefe, K. Ward, “Dynamic query evaluation plans,” in Proceedings of the 1989 ACM SIGMOD international conference on Management of data, Portland, OR, 1989, pp. 358–366

[4] N. Tran, A. Lamb, L. Shrinivas, S. Bodagala, J. Dave, “The Vertica Query Optimizer: The Case for Specialized Query Optimizers,” in 2014 IEEE 30th International Conference on Data Engineering, Chicago, IL, 2014, pp. 1108-1119

[5] K.-Y. Whang, R. Krishnamurthy, “Query Optimization in a Memory-Resident Domain Relational Calculus Database System,” ACM Transactions on Database Systems, vol. 15, no. 1, pp. 67-95, Mar. 1990.