In-Database Machine Learning Inference

(Seminar, Summer 2023) - Data Engineering Systems Group

Instructors: Prof. Tilmann Rabl, Ricardo Salazar

Motivation

Currently, companies and organizations of different sizes produce an enormous amount of data, i.e., Big Data, that they store in structured and semi-structured Database Managing Systems (DBMS). These systems allow users to efficiently analyze the data using traditional SQL operators such as GROUP BY, RANK, AVG, etc. Nonetheless, more complex tasks leveraging the power of Machine Learning (ML) to extract insights from the data require iterative computations and custom transformations and therefore are executed in more suitable systems to perform those tasks, such as Scikit-learn, R, Pytorch, or TensorFlow. Some examples of these tasks are customer segmentation, product recommendation, and fraud detection.

While ML training requires iterative computations and several passes on the data, ML inference does not, and therefore pushing the model inference step into the DBMS poses several advantages. Kläbe et al. identified four benefits of executing ML inference in the DBMS: reduced data transfer, exploiting server hardware, scalability, query integration, and control over sensitive data [7]. The poor interplay between DBMS and ML runtimes has led to an increasing interest in the industry and academia to address opportunities to increase performance in the execution of ML training and inference. For instance, some works have focused on reducing or removing the data movement between DBMSs and ML runtimes by extending the SQL language with ML operators [1][2][3][4]. Other works focus on efficiently executing User Defined Functions (UDFs) in guest languages like Python in a DBMS (SQL server, Postgres, DuckDB) [5][6][7][8]. Moreover, some works focus on enabling cross-optimizations between relational and linear algebra operators by optimizing an Intermediate Representation (IR) and mapping tasks to either the data engine or an ML runtime [9][10][11].

Seminar Description

The goal of this seminar is to evaluate the current approaches to performing In-DBMS ML inference. Students will implement a solution for each framework (SQL extension, UDF execution, IR) considering the trade-offs of the different approaches regarding end-to-end latency, resource consumption, and effectiveness. The implemented solutions should allow the execution of complex ML inference pipelines in a DBMS.

Students will gain insights into implementation challenges and design choices in each of the proposed frameworks. Moreover, students will get an understanding of the challenges and opportunities of performing ML inference inside a DBMS.

Skills

This seminar targets students interested in the intersection between data management and ML systems. It is beneficial if students are familiar with ML runtimes (Scikit-learn, PyTorch, TensorFlow, etc.) and ML training and inference. Experience with relational DBMS like Postgres and SQL proficiency is also favorable.