Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Carsten Binnig is a Full Professor in the Computer Science department at TU Darmstadt and a Visiting Researcher at the Google Systems Research Group. Carsten received his Ph.D. at the University of Heidelberg in 2008. Afterwards, he spent time as a postdoctoral researcher in the Systems Group at ETH Zurich and at SAP working on in-memory databases. Currently, his research focus is on the design of scalable data systems on modern hardware as well as machine learning for scalable data systems. His work has been awarded a Google Faculty Award, as well as multiple best paper and best demo awards.

About the Talk

Database Management Systems (DBMSs) are the backbone for managing large volumes of data efficiently in the cloud. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems. To tackle such problems, recent work has outlined a new direction of so-called learned DBMSs where core parts of DBMSs are being replaced by machine learning (ML) models which have shown to provide significant performance benefits. However, a major drawback of the current approaches to enabling learned DBMS components is that they not only cause very high overhead for training an ML model to replace a DBMS component but that the overhead occurs repeatedly which renders these approaches far from practical. In the first part talk, I will present our vision of zero-shot learned DBMSs to tackle these issues. In the second part, I will then outline very recent work on ML-augmented DBMSs to extend DBMS with new capabilities such as seamless querying of multimodal data which is composed of tables, text, and images.

Towards ML-Augmented Cloud Database Systems

Summary written by Tendis Mende, Simon Shabo and Sandro Speh

Why is my query taking so long?

Getting the result from a database query can take a lot of time. Traditionally, a user might go to the on-premise database system administrator and complain about the performance. Then, the admin might adapt the database configuration to improve the performance of the required workload. But with cloud vendors like Snowflake or Amazon Redshift offering databases as a service, this manual improvement role no longer exists. Ways to optimize the database configuration to the users’ workload automatically could be of great help to improve databases-as-a-service performance or accuracy. Carsten Binning proposes learned approaches to improve the performance of cloud database systems.

What are learned database systems?

“A learned database, is a database that self adjusts to data and queries of the customer to provide better performance”

Machine learning allows the use of input data to train models to predict a result. In the database context, a fitting example might be the prediction of the runtime of a database query. To gather the data to train the model, one runs queries against the database and collects its query duration. This pair of query statements and runtime is then fed to the model to train. Using unknown queries, the model can then predict, based on the queries it has been trained with, how long the queries are going to take.

This simple illustration of a trained model is called workload-driven training, where a lot of queries are run to train a model on the results. It is one foundational approach of three learning modes Prof. Binning has mentioned in his talk.

Furthermore, researchers have shown that learned approaches have been successfully used for a large spectrum of database components. Not only runtime prediction, but also query operators, query optimizers, and schedulers can be improved using learned approaches. At Google, they have shown that a learned index was able to outperform a traditional B-Tree index by up to 70% in speed, while having saved an order of magnitude of memory. So there is a lot of potential, but how can this be done?

How can models learn?

As already mentioned, Prof. Binning listed three categories of learned models. The first one, workload-driven training, has to be trained with a lot of data and has to be retrained if the underlying data changes. The two other approaches try to avoid these high training costs.

The second approach, data-driven learning, avoids these costs by learning from the data itself, not from the queries. It uses the available data to train models, for example for cardinality estimation. In cardinality estimation, the order of the commutative operators in a query is optimized, so that the smallest intermediate results are created, hence reducing the input cardinality for the next operator. The cardinality estimations are often far off; hence, there is a lot of room for improvement.

The third approach is called Zero-Shot learning. This approach tries to generate more generally applicable models that can work with databases they have not yet seen. The previous models have all one or more databases and their attributes and table names they have been trained on. Therefore, the model needs to be retrained if data changes and can’t be used in the context of other databases. In this last approach, the model receives the training data encoded in a more general way, that it can learn for other databases as well and not only for single database-specific queries.

Presented approaches by Prof. Binning

Are these approaches successful?

For data-driven learning, Prof. Binning showed two works where significant improvements were measured. The multi-set convolutional network by Kipf et al. showed significantly better accuracy than classical techniques for cardinality estimation. Unfortunately, this network required a lot of queries to obtain the training data and had to be retrained if the data changed. Professor Binning’s Chair developed DeepDB, a data-driven approach that learns from the data distribution in and across tables. Traditionally, histograms are used to observe the data distribution, which lacks the inclusion of correlations between columns of a table. With their model, the researchers at Binning's Chair trained models to give cardinality estimations on a table, which can also be used collectively to estimate query cardinalities spanning multiple tables. The results are also significant, being four times more precise than PostgreSQL, only having an error of around 30% all whilst having a small model of a few megabytes.

The core challenge for zero-shot learning is to find ways to create generalized input data with which the model can be trained and later applied to unknown databases. In their approach, Binning and Hilprecht found a way to encode the data for query runtime prediction. Instead of using the query, they used the query plan. It includes more data on table size, like the row count and table width. They had hoped that the model learns, that the query runs longer the larger the input data is. This would then make the model also work on unknown databases, as this metadata is also available for other databases’ query plans. In their tests, they were able to show that their runtime estimation is much more accurate than the ones of PostgreSQL. Furthermore, they have tested this on real-world data instead of synthetically created data that many benchmarks in this field use.

So, what should I take away from this?

As databases move to the cloud, manual optimization will no longer be feasible. Instead, automated ways to adapt to the customers’ workload are needed. Machine learning offers multiple ways to do this, with Prof. Binning having presented three: Workload-driven, Data-driven, and Zero-shot learning. The first requires a lot of training data and, therefore, has too high costs to be trained (and retrained). Data-driven learning works only for use cases where the required knowledge can be learned from the data itself, and Zero-Shot learning uses generalized encoding of the input data to make its models applicable to unknown databases. Significant improvements were found in all approaches compared to traditional database techniques.

What should I be excited about?

Binning is excited about making further steps to use machine learning not only to make databases faster, but now also smarter. Using language models like GPT, further research will try to allow querying data that is not present in the table. As GPT models have a lot of common knowledge encoded in them, they can be seen as databases as well. Using these, queries that overreach the data stored in the table could be queried, or data that is present in datatypes not yet storable for databases (like images) can be queried and interpreted. Lots of exciting things to look forward to and reasons to stay updated on the topic of ML-Augmented Cloud Database Systems.