Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

PL/SQL and UDFs are Lousy - But we can do something about it

Torsten Grust, Uni Tübingen

Speaker Introduction

Prof. Torsten Grust is a Professor and the head of the Data Research Group at the University of Tübingen since 2008. Before this, Prof. Grust worked as a professor of Database Systems at the TU Muenchen and TU Clausthal. In 1999, he received his Ph.D. in Computer Science from the University of Konstanz, Germany.

He and his team's research is centered around databases and is mostly concerned about query languages, examples for this are query optimization, rewriting, transformation and compilation. While doing their research they often take concepts from procedural and functional languages and bring them closer to the relational database systems [1].

Summary

Written by Kamran Ali, Reiner Stolle & Hendrik Patzlaff

Introduction

Working with large amounts of data stored in databases, there are several ways to handle data processing. We can distinguish two main approaches. One is to fetch the raw data from the database and to do all processing work in the programming language of choice. Another approach is to use SQL to its full capabilities and let the relational database management system (RDBMS) do the major preprocessing work to only directly handle the required, preprocessed data in the programming language of choice.

The first approach seems more simple to many people working with data. They often choose to ignore the advanced features of SQL that RDBMS provides. This leads to ignoring advanced database features, like indexes that help speed up query processing.

Using the full capabilities of SQL reduces the processing time drastically and leads to transferring fewer data between the database and processing unit. As proposed by Stonebreaker, this way the computation is moved close to the data.

SQL as a fundamentally declarative language requires concise programming and efforts to reach the desired result. It is not the easiest programming language to write, and thus, an intermediate language was introduced.

It encourages implementing logic directly on the database and provides an easier interface for users. This intermediate language is called PL/SQL.

PL/SQL is one step toward doing computation directly at the data. Using PL/SQL requires context switches between SQL and PL/SQL which leads to a reduction in processing time that is used for the data manipulation job. Thus, PL/SQL is not as efficient as using raw SQL.

SQL allows writing user-defined functions (UDFs) to define complex operations on data directly on top of the RDBMS. The implementation of such functions in SQL does not handle them as real functions, but as part of the plan, and thus does not optimize function calls and leads to non-optimal performance.

The talk by Prof. Grust coped with both mentioned problems and described approaches on how to handle them. The first one is how to make PL/SQL usage more efficient by compiling PL/SQL to SQL. The second question is how UDFs can be optimized in SQL to use real function calls, especially in recursive functions. To solve both problems, Prof. Grust transfers methods of functional programming into the space of database research.

Background

In this section, we describe the technologies and developments in SQL and PL/SQL Prof. Grust's research is based on.

Turing Completeness of SQL

SQL Recursion was introduced in the late 90s and since its launch, it has played a huge role in data handling on the database side. Before the introduction of recursion, SQL was not as capable as other programming languages. Recursion introduced countless possibilities into programming with SQL for manipulating a large amount of data. SQL:1999 supports recursion and makes the programming language Turing complete.

Internally, recursive calls in SQL use common table expressions and generate a run table to store intermediate recursion data.

As an example, in Figure 1 the Dynamic Time Warping (DTW) algorithm is implemented in SQL. It is used in time series analysis and is an algorithm for measuring similarity between two temporal sequences, which may vary in speed.

Figure 1: SQL representation of DTW algorithm. [1]

Enhancing SQL with PL/SQL

Writing complicated queries is not that simple to execute and understand as we saw with the example of DTW above, so in the struggle of moving the logic closer to the data let's see what other options do we have.

One of the first alternatives would be PL/SQL, in layman terms this is just a mixture of SQL and Programming Language to manipulate data. PL/SQL is an extension to SQL and enhances its features by providing more easily understandable methods and functions to process data. Code written in PL/SQL is not necessarily faster or more efficient but certainly is more comprehensible.Figure 2 shows the same DTW algorithm implemented in PL/SQL.

Figure 2: DTW algorithm implemented in Imperative Programing and SQL (PL/SQL) [2]

Elements of PL/SQL:

  • Stateful variables: Just like in any other imperative programming language such as Python, declaring, reading, and updating variables is one of the fundamentals. PL/SQL also provides this feature in the same way and variables can be used throughout the code.
  • Availability of Iterative control flow: Conditional statements, loops, and iterative methods are a very important part of imperative programming as well, and PL/SQL provides this feature too, constructs like breaking a loop or exiting a loop when conditions are met, are also there in PL/SQL.
  • Embedded SQL statements: Unlike regular imperative programming languages embedded SQL statements are possible, for example, variables can be assigned with direct results from SQL queries.

PL/SQL's flaws

As stated earlier, PL/SQL allows the use of a procedural language in combination with SQL. This lets users implement arbitrary complex data structures such as loops and conditionals. As a result, PL/SQL provides more functionality to SQL. Yet, due to the additional functionality, PL/SQL is also flawed in terms of computational efficiency. This is due to the dual nature of the language, it must be able to switch between the procedural interpreter and SQL as a query language.

For example, an arbitrary user-defined function is designated and invokes PL/SQL. The user-defined function is called to interpret PL/SQL code. Once the function reaches an embedded SQL construct, a switch is done to the SQL database side. As soon as the database side is finished, another switch takes place and the interpretation of the PL/SQL code is resumed. This continues until the operation is completed. The constant back and forth between PL/SQL and SQL leads to inefficient runtime.

Figure 3: example of PL/SQL and SQL switching [3]

Even more complex PL/SQL constructs lead to more compilation and make the performance even worse.

For instance, a query with a binding that uses many invocations of another query will have slow computation. In this specific example, once such a construct reaches an arbitrary control flow such as a while loop, the iterations of the aforementioned loop cause additional switches.

The University of Tübingen Database Research team has performed experiments with PostgreSQL and concluded that each evaluation of queries entails the following stages: planning, optimizing, instantiating, running, and tearing down. Query caching can reduce this process. This minor optimization is done by getting rid of the need to plan and optimize queries if they appear multiple times.

Due to this tedious evaluation process, PL/SQL exhibits inherent inefficiencies at runtime. In the profiling of PostgreSQL, Prof. Grust and his team created a runtime profile for a typical iterative UDF and determined that the overhead of the internal functions ExecutorStart() and ExecutorEnd(), which are responsible for the evaluation process, had an overall runtime of 40%. This runtime accounts only for switching and is therefore inefficient time used.

Due to its runtime inefficiency, PL/SQL is not well received by the developer community.

Problem of working with SQL UDFs

Textbook recursive algorithms, like the Dynamic-time-warping algorithm (DTW) can be defined in SQL in a functional style. Therefore, the CREATE FUNCTION syntax is used where each recursion path is defined. Thus, the function can be used in SQL code.

Invoking recursive UDF functions in SQL is not done efficiently. For each invocation of the recursive function, the query in the UDF body is analyzed and a query plan is built. As a second step, the query plan is instantiated, evaluated, and torn down.

Analyzing the exemplary DTW function with the parameters dtw(5,5), 2524 function calls are generated. Working with DBMS, like PostgreSQL, SQL functions are only inlined to a depth of two. As a result, we only use a fraction of the runtime of UDFs to evaluate the recursive function and most of the work for query plan building, initialization, and teardown.

Due to that, we see that SQL supports UDFs, but does not encourage programming with such kinds of functions. Working with SQL, we treat functions as part of the plan rather than as real functions and thus, do not use UDFs like functions are used in functional programming languages.

Research methods and results

During the lecture, Prof. Grust introduced two research findings. One is an approach to compiling PL/SQL to SQL and the second one is converting recursive SQL UDFs into SQL that uses function calls as they are used in functional programming languages.

Compilation of PL/SQL to SQL

As priorly mentioned, part of Prof. Grust's research is to convert the PL/SQL into more efficient plain SQL.

This is done in multiple steps and with specific goals in mind. Two main goals are to remove the switching between SQL and PL/SQL, as well as translate the arbitrary PL/SQL data structure into plain SQL.

Firstly, a top-level query Q_0 is implemented, as well as a query Q_f which embodies the former arbitrary imperative PL/SQL function. Q_f consists of embedded Queries, as well as the imperative part which is translated as Q_imp. In this form, the evaluation process is only required once. This is because there are no switches from PL/SQL to SQL. The reason this transformation is possible is that SQL is Turing complete since the addition of WITH RECURSIVE. In its transcribed form SQL uses WITH RECURSIVE to map arbitrary control flows. It does this by abstracting complex control flows into a single loop computation.

Figure 4, 5: the left figure demonstrates the code without SQL/ PL/SQL switches and the right figure demonstrates the desired simplified single loop computation which is expressed with WITH RECURSIVE [4], [5]

The computation chain of PL/SQL to SQL

The computation chain of PL/SQL to SQL entails the following steps

  1. Initial imperative function
  2. Translation into Single Static Assignment Form (SSA)
  3. Conversion into Administrative Normal Form (ANF)
  4. Single Recursive function
  5. Plain SQL (using WITH RECURSIVE)

In the computation chain, we start with PL/SQL code, this can be an arbitrary complex control flow for example a nested for loop. In the second step, the imperative code is changed into its Single Static Assignment (SSA) form. The code is translated into an equivalent imperative form which primarily uses the keyword goto to substitute the complex control flows. The resulting code uses simple conditionals. Additionally, the SSA also includes assigning variables once. After this step, the outcome is translated into the Administrative Normal Form (ANF). In the ANF, there are mutually recursive functions, so there are only tail calls. After a form is established where functions recursively call each other. The code is given a single recursive function called run() which evokes the entire used code. The resulting tail can directly be translated into SQL using WITH RECURSIVE, as it suits the restricted control flow that is desired in plain SQL.

Optimization

The main upside of plain SQL is its efficiency in comparison with PL/SQL. The more complex a PL/SQL program is, the more performance is won by utilizing plain SQL. Figure 6 depicts the performance gain reached by using plain SQL compared to PL/SQL.

Figure 6: figure demonstrates PL/SQL vs recursive SQL run time in relation to iterations[6]

Improving WITH RECURSIVE

As seen from a previous subsection, code that was originally created in PL/SQL is translated into SQL in its final step using WITH RECURSIVE. While using WITH RECURSIVE to optimize the runtime dramatically increases the speed of computation, the common table expression computes the entire history of computation that has happened while it did its job. There is though no value in the accumulation of rows in the table expression, because only the answer is of importance, and nothing upwards of the stack is viewed. WITH RECURSIVE, in this sense, does too much, because it returns everything instead of only the desired answer.

To overcome this limitation, Prof. Torsten Grust and his team made a local change to the POSTGRESQL kernel and instantiated the desired function with a recursive construct they called WITH ITERATE. WITH ITERATE works like WITH RECURSIVE without memory. The working table only carries the first row. This way it does not allocate buffer space.

Though the use of WITH ITERATE is also situational. It is suitable for many iterations because there are no large run tables and little buffer space is optimal.

Figure 7: space efficiency of WITH ITERATE in comparison to WITH RECURSIVE [7]

Evaluation of the approach

Once PL/SQL is completely translated into SQL, no traces remain. Generally speaking, it is a source-to-source compilation on top of the RDBMS. This methodology of getting the compilation closer to the Database side is not exclusive to SQL and can be applied to Oracle Servers as well as other database systems. The compilation chain also admits SSA-style imperative languages, for instance, languages like python.

Usage of recursive User-Defined-Functions (UDF) in SQL

In the following paragraphs, we describe how recursive UDFs can be compiled to SQL and why this approach is useful.

Treat UDFs like functions in SQL

The approach of Prof. Grust is to treat UDFs as functions in SQL. To do so, it uses proven techniques out of the space of functional programming.

Transforming UDFs in SQL to use real function calls is a four-step process:

  1. Decomposition of the UDF
  2. Transformation into Continuation-Passing-Style
  3. Defunctionalization
  4. Building a single loop
Decomposition of the UDF

The first step is to decompose the UDF into several subexpressions. Each subexpression is represented by an identifier and can be used individually. To do so, we look at the function from a functional programming perspective. Some parts are used to identify base or recursive cases, others are recursive function invocations and some can be select-from-where blocks. None of them are unwrapped at the moment. An example of the decomposition of the floyd function is shown below. Figure 8 depicts the floyd algorithm written in a functional programming language decomposed into 4 parts.

Figure 8: Transition of function into functional programming [8]
Transformation into Continuation-Passing-Style (CPS)

The second step uses continuations to make the function tail-recursive. It uses the continuation parameter k, shown as the function parameter below. to base, function calls upon each other. Even though we reach tail-recursive functions, the function still is a higher-order function.

Figure 9: Transition into CPS [9]
Defunctionalization

Transforming higher-order functions into first-order functions is done by defunctionalization. Functions can be considered as data, like in the example shown below. Therefore, we represent higher-order function calls as push and pop operations on the data stack. The pushed data mutually relies on each other and on environmental parameters.

Figure 10: Defunctionalization [10]
Building a single loop

Mutual invocation of functions is a problem. WITH RECURSIVE can only handle simple control flow graphs. Here, for building a single loop inlining is required. This is done by a trampoline function that builds a single loop of all function calls described in the earlier representation. A comparison between the different control flows is depicted in Figure 11 Thus, this process works similar to the WITH RECURSIVE approach described earlier. The single loop generation can be seen as an iterative interpreter for recursive UDFs.

Figure 11: Different control flows for function execution [11]

Evaluation of the approach

Using the approach described above, we reach a speedup of one order of magnitude. The average overhead of a classical UDF evaluation in SQL is 95%. The DTW function can be sped up by a factor of 15.6.

The approach treats UDFs as functions and does not require a change in the RDBMS kernel, as it only transforms SQL to SQL.

Algorithms that can be described in SQL only, are among others CYK parsing, distance vector routing, traffic simulation. With this in mind, the approach by Prof. Grust allows large improvements in working with data-intensive workloads. Thus, it moves the computation closer to the data, as proposed by Mike Stonebraker.

Summary

As a part of his work as the lead of the database research group at University Tübingen, Professor Torsten Grust works at the intersection of database research and programming language technology. This is done by taking interesting aspects of programming languages (PLs) and extracting them back into the field of databases. The goal is to improve the execution performance of SQL queries. As SQL is Turing complete since the addition of recursion in SQL:1999, operations are moved as close to the data as possible.

In his lecture, Prof. Grust provides us with insights on how PL/SQL, a procedural language to complement SQL, as well as UDFs cause inefficiencies at runtime, and how we can change them into plain SQL to improve query execution performance. Working with PL/SQL, context switching between PL/SQL and SQL is computationally expensive and needs to be prevented as far as possible. Prof. Grust presents his approach to compile PL/SQL code to pure SQL code which decreases the runtime by more than 50%. One presented approach is to introduce tail recursion for the WITH RECURSIVE SQL function to reduce the amount of memory used. Furthermore, an approach to treat user-defined functions (UDFs) as real function calls is presented, which allows a speedup in one order of magnitude.

Sources

Content is based on Prof. Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute. A recording of the lecture is available at https://www.tele-task.de/lecture/video/8906/.

Further sources:

[1] Grust, T. (2021). Datenbanksysteme Tübingen | Torsten Grust. Accessed on Dec. 18. 2021, https://db.inf.uni-tuebingen.de/team/TorstenGrust.html

Images

[1] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 11

[2] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 14

[3] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 28

[4] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 31

[5] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 32

[6] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 40

[7] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 44

[8] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 55

[9] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 56

[10] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 57

[11] Dr. Torsten Grust's lecture PL/SQL and UDFs are Lousy - But we can do something about it held on Dec. 07 2021 during the Lecture Series on Database Research at Hasso Plattner Institute, page 58