Stefan Hagedorn

Affiliation: TU Ilmenau
Title: Making DataFrames get a move on: DBMS support for DataFrame operations
Slides: PDF

Abstract

SQL is and has been the standard language to query relational databases for decades. The DBMSs are highly optimized for storing and querying large amounts of data, but complex analysis tasks are often difficult or even impossible to express in SQL. For data science and analytics tasks other languages and libraries, such as Python and Pandas, have become increasingly popular. However, as these Python scripts are executed on client PC with much weaker hardware than the database server, a data scientist has to care about buffer management for larger-than-RAM datasets and parallelism for faster execution - problems that are already solved by the DBMS.
In this talk we present Grizzly, an approach to execute operations on DataFrames inside a database system and highlight challenges and opportunities for modern data analytics tasks. Grizzly produces SQL queries for operations on DataFrames, moving complexity from workstations to database servers and allows to not only access data already stored in a database, but also to combine it with external data from files, execute user-defined functions as well as to peform a "model join" to easily apply pre-trained machine learning models to data -- all inside the database system.

Short CV

Stefan Hagedorn is a research associate at the Databases and Information Systems Group at TU Ilmenau. He worked on several projects in the fields of Semantic Web technologies, spatial data processing, and Big Data and received his PhD in 2020 for his thesis on "Efficient Processing of Large Scale Spatio-temporal Data". Since 2020 he has been the managing director of the Thuringian Center for Learning Systems and Robotics.