Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

Learned DBMS Components - Carsten Binnig

Carsten Binnig, Data Management Lab, TU Darmstadt

Abstract

Machine learning models have substituted the hand-tuning of database management system components for optimal performance. It started with workload-driven models, which were trained on collected query results and predicted the outputs for new queries. Next, there was data-driven learning. These models do not need training queries as they learn based on data characteristics. Both approaches are limited as they either need to run additional queries and retrain the model or require additional information about workload complexities. In this talk, Carsten Binnig introduces the concept of zero-shot learning as an approach to train models which generalize to DBMS components of unseen databases without the need to retrain the model.

Biography

Carsten Binnig is a Professor at the Data Management Lab, which is a part of the Department of Computer Science at the University of Darmstadt. He received his PhD at the University of Heidelberg in 2008. His current research focus is on Distributed and Parallel Data Management Systems for a next generation Data Management System as well as AI research in connection with Database Management Systems.

A recording of the presentation is available on Tele-Task.

Summary

written by Larissa Hoffäller, Julius von Streit, and Jessica Ziegler

Historically, database management system components such as indexes, cost models and schedulers had to be hand-tuned for optimal performance on a database. Advances in machine learning enabled "Learned DBMS components 1.0". These ML models which have been successfully used for many DBMS components are based on workload-driven learning. Unfortunately, the training phase in workload-driven learning takes very long, has to be rerun when the underlying data changes and does not generalize well. This led to the creation of data-driven learning components that learn the multivariate data distribution over complex relational databases, which can quickly be retrained. This approach is however limited to tasks that do not consider workloads. To overcome that limitation a zero-shot approach was introduced. This yields machine learning models that can easily generalize DBMS components to unseen databases.

Workload-Driven Learning

Workload-driven learning relies on running queries on a database using the classical (hand-tuned) component and observing its behavior. The observed results are eventually used to train a machine learning model, which can then act as a substitute for the classical component.

Application: Cardinality Estimation

Figure 1 and 2 illustrate workload-driven learning for predicting cardinalities of SQL queries, also called the "cardinality estimation problem". To collect training data for the ML model many queries have to be run on a database to observe the resulting cardinalities. The ML model is eventually being fed with both the query and the cardinality results. After sufficient training, the ML model can be used to predict cardinalities of queries instead of the classical component.

Figure 1: Overview of workload-driven learning for databases.
Figure 2: Training data for cardinality estimation using workload-driven learning.

Challenges of Workload-driven Learning

Workload-driven learning has critical drawbacks. While the pure training of the model requires only a few minutes, the initial generation of the training data can take much longer (several hours or days). Moreover, this effort has to be repeated. As soon as the underlying data is updated, queries have to be rerun to cause a change in the behavior of the component. Additionally, ML models do not generalize to new tasks which means a model cannot be transferred to other components, which again requires the re-collection of training data. "Learned DBMS components 2.0" aims to avoid the high cost and inflexibility of workload-driven learning.

Learned DBMS Components 2.0: Data-driven Learning

Data-driven learning is, among zero-shot learning, one of the main contributions to enable "Learned DBMS components 2.0". Unlike in the workload-driven approach, data-driven learning relies solely on the data distribution in the relational database. In the training phase, the model learns the multivariate data distribution, e.g. it learns across multiple dimensions, like attributes and tables, over the relational database. The learned data distribution is then used to solve the dedicated task (such as cardinality estimation). Therefore it is not suitable for tasks which consider workload. For applicable tasks such as cardinality estimation, indexing and data cleaning (unrelated to workload), data-driven learning outperforms workload-driven learning in terms of speed.

At its core, data-driven learning uses Relational Sum-Product Networks (RSPNs), a new class of deep probabilistic models over databases. Relational SPNs derive from Sum-Product Networks (SPNs), tailored for Relational DBMSs. RSPNs take advantage of the property that SPNs capture data characteristics quite accurately. Unlike SPNs, they also provide an update functionality.

RSPNs are created recursively, resulting in a tree data structure. To learn a data-driven model, a table or joined table is clustered into rows (Row Cluster) and columns (Column Cluster), until all clusters are independent from each other. Each time rows or columns are separated into clusters, a "multiply node" (for column clusters) or a "plus node" (for row clusters) is being created with the two new clusters represented as its children. The leaves of the RSPN tree contain the data. Each time a query is performed, it is propagated down from the root to the leaves. The node types and the weights of the edges specify the calculation rule when backpropagating to the root.

Figure 3: Example of how an RSPN is created.

By using partial models and tuple factors, RSPNs can be used for tables with multiple joins. In this case a model is not calculated for every possible join operation. Instead, only parts of the data schema are modeled by an RSPN (partial model) and annotated with tuple factors. If the query requests only a part of the RSPN or requires several RSPNs, the tuple factors are used to enable the calculation of the query result. In fact, the model sizes are deliberately kept small and can be generalized to larger joins.

Figure 4: Example of RSPNs.

Besides cardinality estimation, data-driven learning has multiple applications, such as query answering, indexing, data cleaning, and potentially other tasks where knowledge of data distribution is sufficient. Nevertheless, data-driven learning cannot solve tasks that require workload information - instead, it can be used to enrich the zero-shot learning.

Learned DBMS Components 2.0: Zero-Shot Learning

A major challenge of the previous approaches based on workload- and data-driven learning is the collection of training data repeated for every database that should be supported. The advances in transfer learning of models, e.g. GPT-3 for natural language processing, inspired zero-shot learning. This new approach used on database components tackles the challenge of training data collection. A zero-shot model is trained on a variety of databases and enables the support of new unseen databases without the need to re-train the model. Zero-shot learning combines the advantages of workload-driven learning and data-driven learning.

Figure 5 illustrates the high-level idea of zero-shot learning. In the training phase a representative workload is executed and training data is collected on several different databases. Graph neural networks are used as a machine learning technique. This high cost in the learning phase is a one-time effort to create the model as re-training on new data is not needed. A new method of representing queries - transferable featurization - enables the pre-trained zero-shot model to generalize and make predictions for new queries on unseen databases. The model captures information about the query plans, their runtimes and additional aspects of how changes in database configurations influence the query behavior.

Figure 5: Overview of zero-shot learning for databases.

Zero-shot learning can easily be extended to few-shot learning by fine-tuning the model with few training queries on an unseen database.

Key Challenges of Workload- and Data-driven Learning in Zero-Shot Learning

Making zero-shot learning work for databases faces multiple challenges. First of all, a transferable representation of databases and queries is needed. Query representations of workload-driven models are not transferable to unseen databases as the encodings of data are different from database to database. As one example, attribute names are usually encoded with a one-hot encoding, which causes a different semantics in different databases even though the encoding is the same. This requires a new transferable representation of queries. The query graph encodings with transferable features generalizes across databases. This encodes the entire query as a graph (as seen in Figure 6). The different node types represent the plan operators, predicates, tables, columns and so on. The nodes are then annotated with transferable features (e.g. the data type instead of the one-hot encodings of columns), which remain consistent across databases.

Figure 6: Overview of zero-shot encoding based on graph encodings containing transferable features.

Another key challenge is the collection of sufficient training data as well as the resulting robustness of the trained model. The question now is, how many databases and workloads are necessary to train a model that properly generalizes across databases? A relatively small number of databases is sufficient to generalize robustly to already outperform the existing approaches. This amount of training data can be estimated by evaluating the generalization performance on holdout databases. When the generalization performance stagnates on unseen data the training data should be sufficient. Another question is how to assemble useful training databases and workload, which ideally contain different characteristics.

Lastly, capturing data characteristics (e.g. data distributions) and systems characteristics (e.g. runtime complexity) should be separated and not internalized in one model. As data characteristics are different for new databases and therefore not transferable, this separation is necessary. Hence, these data characteristics (e.g. estimated cardinalities of operators) should be captured in an additional data-driven model, which then informs the zero-shot model. This does not contradict the zero-shot advantage of the one-time model training, as data-driven models are derived from a database without collecting output of training queries, so an update of the data-driven model for the new database is enough.

Example: Zero-Shot Physical Cost Estimation

Zero-shot learning can be applied to multiple database tasks, e.g. physical cost estimation, design tuning, query optimization and planning. In the example of physical cost estimation the runtime of a given query plan should be predicted through transferable query representation. Physical cost estimation was previously done with simple cost models or workload-driven models, which required to run many queries to train a model. Zero-shot learning improves the approach by generalizing across databases and workloads while using the same model on unseen databases to get predictions. Each node of the graph represents a physical operator, which captures the differences in runtime complexity. As transferable features the nodes represent tables, columns, aggregations and predicates. The zero-shot model then learns the system characteristics separate from the data characteristics. This means the model only encodes the predicate structure and uses estimated cardinalities from a data-driven model. The model learning consists of a GNN-like architecture, which in the end enables a runtime prediction for a query plan on a new database.

Results and Evaluation

The zero-shot approach was evaluated in the context of physical cost estimation, envisioned to be a building block for zero-shot models for many other DBMS tasks. The main premise is that a trained zero-shot cost estimation model can predict the query runtimes of a new database out-of-the-box. The prototype system designed to evaluate the approach used the transferable query representation previously introduced.

Unlike in workload-driven learning components, the zero-shot model learns system characteristics separately from data characteristics, and as such only encodes predicate structure and utilizes estimated cardinalities from a data-driven model or the query optimizer as input for the cost estimation.

The evaluation model was trained on 19 databases as it was found that after 19 databases performance stagnated. This indicates that even a moderate number of databases are sufficient training data for the model to generalize to other databases.

The zero-shot model was compared to two workload-driven approaches (MSCN and E2E) as well as a simple linear model using the actual runtimes from the internal cost metric of the Postgres optimizer. Additionally the zero-shot model was tested using estimated cardinalities obtained from learned estimation and exact cardinalities. The workload-driven models were trained on different training sets ranging from "small" training set sizes of 100 queries to large training dataset of 50,000 queries. The training workload for the zero-shot model was comparable to that of the large training sets for the workload-driven models, but had to be only executed once and could then generalize to many databases while the workload-driven training had to be executed for every new database.

The approaches were compared using the median Q-Error which indicates how far the prediction deviated from the actual runtime. As the zero-shot approach is not trained on the actual database used for evaluation while the workload-driven approaches are, the results from different numbers of training queries are compared. The benchmarks utilized for comparison are scale, synthetic and job-light.

Figure 7: Comparison of estimation errors of workload-driven and zero-shot driven cost models.

As seen in Figure 7, the zero-shot approach is very accurate when compared with workload-driven approaches even when the workload-driven approaches are provided with large training sets. E2E models can not match the zero-shot approach, and while MSCN can be quite accurate the median Q-Error does not account for certain peaks in MSCN results when the queries have a high variance.

Conclusion

This talk has presented workload-driven learning and its drawbacks as the first approach of using ML for optimizing the performance of DBMS components. It then shows the potential of “Learned DBMS Components 2.0” by introducing data-driven and zero-shot learning. These new approaches introduce the ability of being trained on a moderate amount of databases and then to generalize on unseen databases.

References

B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, and C. Binnig. Deepdb: Learn from data, not from queries! Proc. VLDB Endow., 13(7):992-1005, Mar. 2020.

B. Hilprecht, and C. Binnig. One Model to Rule them All: Towards Zero-Shot Learning for Databases, arXiv preprint arXiv:2105.00642, May 2021.