Why is my query taking so long?
Getting the result from a database query can take a lot of time. Traditionally, a user might go to the on-premise database system administrator and complain about the performance. Then, the admin might adapt the database configuration to improve the performance of the required workload. But with cloud vendors like Snowflake or Amazon Redshift offering databases as a service, this manual improvement role no longer exists. Ways to optimize the database configuration to the users’ workload automatically could be of great help to improve databases-as-a-service performance or accuracy. Carsten Binning proposes learned approaches to improve the performance of cloud database systems.
What are learned database systems?
“A learned database, is a database that self adjusts to data and queries of the customer to provide better performance”
Machine learning allows the use of input data to train models to predict a result. In the database context, a fitting example might be the prediction of the runtime of a database query. To gather the data to train the model, one runs queries against the database and collects its query duration. This pair of query statements and runtime is then fed to the model to train. Using unknown queries, the model can then predict, based on the queries it has been trained with, how long the queries are going to take.
This simple illustration of a trained model is called workload-driven training, where a lot of queries are run to train a model on the results. It is one foundational approach of three learning modes Prof. Binning has mentioned in his talk.
Furthermore, researchers have shown that learned approaches have been successfully used for a large spectrum of database components. Not only runtime prediction, but also query operators, query optimizers, and schedulers can be improved using learned approaches. At Google, they have shown that a learned index was able to outperform a traditional B-Tree index by up to 70% in speed, while having saved an order of magnitude of memory. So there is a lot of potential, but how can this be done?
How can models learn?
As already mentioned, Prof. Binning listed three categories of learned models. The first one, workload-driven training, has to be trained with a lot of data and has to be retrained if the underlying data changes. The two other approaches try to avoid these high training costs.
The second approach, data-driven learning, avoids these costs by learning from the data itself, not from the queries. It uses the available data to train models, for example for cardinality estimation. In cardinality estimation, the order of the commutative operators in a query is optimized, so that the smallest intermediate results are created, hence reducing the input cardinality for the next operator. The cardinality estimations are often far off; hence, there is a lot of room for improvement.
The third approach is called Zero-Shot learning. This approach tries to generate more generally applicable models that can work with databases they have not yet seen. The previous models have all one or more databases and their attributes and table names they have been trained on. Therefore, the model needs to be retrained if data changes and can’t be used in the context of other databases. In this last approach, the model receives the training data encoded in a more general way, that it can learn for other databases as well and not only for single database-specific queries.