Background
A Common Table Expression (CTE) is a named temporary result set in a query that can be referenced within other statements. It defines a query that can be referred to later within a larger SQL Statement. One can initiate it using the keyword WITH and then provide a name and optionally specify some column names. This is followed by the keyword AS and then the temporary query:
WITH cte_name (column1, column2, ...) AS ( -- CTE query here SELECT column1, column2, ... FROM some_table WHERE some_condition ) -- Main query that references the CTE SELECT * FROM cte_name;
A recursive CTE is a query, that refers to itself. A recursive CTE is structured like a normal CTE and has the keyword WITH RECURSIVE to start it off. It basically consists of two different parts, an anchor member, that is the initial query which is the starting point for the recursion and a recursive member which refers to the CTE itself. The UNION ALL combines these two parts:
WITH RECURSIVE cte_name (column1, column2, ...) AS ( -- Anchor member (initial query) SELECT column1, column2, ... FROM some_table WHERE condition UNION ALL -- Recursive member (refers to the CTE itself) SELECT column1, column2, ... FROM cte_name WHERE condition ) -- Final query that references the CTE SELECT * FROM cte_name;
Problems
While the WITH RECURSIVE keyword is a part of SQL since 1999, it is still considered an obscure and difficult to understand feature by many.
Even some of the most experienced SQL programmers struggle with it. As shown in the lecture, the programmer named ‘snth’ tried to implement a calculation for the fibonacci series with SQLs WITH RECURSIVE and was puzzled by it’s behavior.
A big part of the difficulty comes from the requirement that the recursive member of the CTE (in the following called q∞) has to be monotonic. This excludes many potentially useful operations like aggregations, groupings and more.
If a query developer requires these operations, they use workarounds. Which aren’t always straightforward and often lead to unreadable syntax or unwanted behavior.
Additionaly, the semi-naive evaluation of the query means that the recursive query only ever gets those rows as input, which the previous evaluation returned. While this can be useful for some algorithms, if a query does need to inspect the history of previous results, there is no built in way to do that.
If those results are needed, developers try to build in their own version of history management, for example by adding additional columns to store previous results. This not only makes crafting such queries very difficult, it also leads to inefficiencies.
Solution
Grust proposed a revolutionary solution: the WITH ITERATIVE keyword.
Per default WITH ITERATIVE does not create history of all results anymore. This makes it clear that really only the previous result is available for evaluation. Now, instead of returning a table of all intermediate results, only the last non-zero result is returned. This means, that much less memory is used during the recursion. Additionally, new keywords are specified to allow q∞ to access and maintain some sort of history.
WITH ITERATIVE tab(c1,...cn) AS ( initial_query() UNION ALL recursive_query(tab) ) SELECT * FROM tab;
Goal
The aim is to use the simple and loop-based behavior everybody knows from CTEs and leverage the already existing CTE infrastructur, but also lift some restrictions on q∞, mainly the monotonicity.
Main ideas, methodology, approach
Variant 1: WITH ITERATIVE … KEY
The first proposed variant involves using WITH ITERATIVE in conjunction with an updatable, keyed dictionary. This approach allows the query author to control the distribution of keys within the dictionary. When a new key is assigned for rows in each iteration (for example, based on the cumulative iteration count), it requires a significant amount of space as the dictionary effectively operates in append-only mode. However, if keys from previous iterations are reused, the upsert behavior comes into play, potentially keeping the dictionary compact. This overwriting process bears similarity to imperative programming languages, where the current value of a variable can also be updated through overwriting. In each iteration, q∞ can read the entire dictionary, ensuring continued access to the data.
WITH ITERATIVE tab(c1,...cn) KEY(k1,...,kn) AS ( initial_query() UNION ALL recursive_query(tab, RECURRING(tab)) ) SELECT * FROM tab;
This approach was then used by Professor Grust to showcase a basic Graph-based algorithm. He wanted to map each node in a graph to a component. In a component one node can always reach another node via the edges, if two nodes are not in the same component, they can also not reach each other. This algorithm was then described and the runtime and space consumption of the WITH ITERATIVE approach was compared to the vanilla WITH RECURSIVE function. The runtime of the proposed approach is always one to two magnitues lower than the vanilla one. The same goes for the memory consumption, where the variant with the keyed dictionary always consums less memory.
Variant 2: WITH ITERATIVE … TTL
The second proposed variant is the usage of WITH ITERATIVE in combination with a time to live (TTL) value.
Now the DBMS collects two history-tables. One like it does with WITH RECURSIVE, where it appends every result and which is not accessible by q∞. And a second one, which is accessible inside the query by using RECURRING(<table_name>). After every iteration the ttl value of all rows in there is aged (decremented) and those which reach zero are removed from that table.
WITH ITERATIVE tab(ttl, c1, ...cn) TTL(ttl) AS ( initial_query() UNION ALL recursive_query(tab, RECURRING(tab)) ) SELECT * from tab;
This allows a query developer to make of some previous results in a controlled manner, as they do not have to manage this history inside custom columns of the result. Additionally, because they can control the value of ttl they have some control over the required memory.
Professor Grust then presented an algorithm that parses a string, with the condition that the grammar implemented by the parser must be in Chomsky normal form (CNF). The TTL algorithm demonstrated remarkable efficiency, being 2 to 3 times faster than naive expiry while simultaneously expiring a significantly larger number of rows in the process.
An straightforward example for a use for TTL would be computing the Fibonacci series:
WITH ITERATIVE fib(ttl, n , F) TTL(ttl) AS ( (SELECT 1 AS ttl, 0 AS n, 0 AS F UNION ALL SELECT 2 AS ttl, 1 AS n, 1 AS F) UNION ALL SELECT 2 AS ttl, 1 + MAX(n) AS n, SUM(F) FROM RECURRING(fib) AS F ) TABLE fib LIMIT 10;
Here the anchor member (also called q1) creates the two base cases for Fibonacci as two rows: (1, 0, 0), (2, 1, 1). Then q∞ accesses all currently “alive” rows by selecting from RECURRING(fib) and calculates the next row for the sequence.
By setting ttl to 2 for each new row, this CTE ensures that each row is only available to the next two calls of q∞.
Summary
In summary, Torsten Grust’s lecture proposed a groundbreaking solution to challenges associated with recursive queries in SQL, introducing the WITH ITERATIVE keyword. This innovation addresses the complexity and limitations of WITH RECURSIVE, providing a more intuitive and efficient approach.
The lecture presented two variants of the solution, emphasizing reduced memory consumption and improved runtime. Demonstrations of graph-based and string-parsing algorithms showcased the remarkable advantages of WITH ITERATIVE over traditional methods. Overall, the lecture marks a significant advancement in the field of database query processing, offering a more accessible and powerful tool for developers grappling with recursive queries.