Instructors: Prof. Tilmann Rabl, Ilin Tolovski
Description
Current open source and industry developments show trends of increased use of collaborative environments such as, Jupyter Notebooks and Google Collab, for the development of data processing and machine learning pipelines. Such tools provide convenient interface for developing and running separate parts of the pipeline at a time. However, the state storage and the locally restricted data access pose a significant challenge when pipelines are meant to be developed and ran on multiple platforms. Limited data sharing capacity hinders the interoperability of data processing pipelines, limiting cooperation, extending the development lifecycle and often providing unreliable results. Having a database backed access allows users to have consistent state of the data and provides shared access point with no unnecessary data transfers required. Additionally, it opens up new opportunities to use the optimizations provided by the RDBMS to transfer parts of the workload away from the user, thus increasing runtime performance.
In this project, you will develop tools to rewrite flat file operators to relational operators and move the execution to the RDBMS server. Specifically, you will look into the implementations of filters, aggregations, joins in commonly used libraries such as, pandas, numpy, scikit-learn and develop rewrite rules to translate them into SQL statements. On the other hand, you will also look into automated execution of such queries on the RDBMS server and caching intermediate results.
The projects will be written in Python and partially in SQL. You will learn more about implementation of database operators and interoperability in collaborative environments.
Structure
Project
This seminar will be structured around working on project topics in the field of interoperability of data processing, and database operator translation. The students can work in groups of 2 to develop a project idea, implement, and evaluate it. At the end of the course, the students should present their findings and hand in a written report on their topic. We offer the possibility to publish the project results at a topic-related conference.
Paper presentations
In this course, the students will have the opportunity to prepare discussion sessions on the state-of-the-art research in data management for data processing pipelines. This includes studying a research paper in detail, presenting it in front of the group, introducing valuable insights, and leading the following discussion. To be adequately prepared for this, we will beforehand discuss the best practices for reading, writing and presenting scientific papers. Ideally, the papers that will be presented in our sessions would cover the related work of the chosen project topics. Every week, each student will need to summarize one of the presented papers in a one-pager.
Grading
- Project + report - 60%
- Final presentation - 20%
- Paper presentations - 20%
Project Topics
As a part of this seminar, we offer the following project ideas. You are welcome to propose your own ideas as well.
Rewrite rules for filter based operators in scikit-learn and pandas
In this project, the students will create rewrite rules for several filtering operators available in scikit-learn and pandas. You will transform the operators to SQL or relational algebra statements that are used to accelerate the preprocessing operations in data management or machine learning pipelines. The students will have a set of curated pipelines that they can use for developing the rules and evaluate against. The evaluation should cover the runtime of the newly created pipelines and showcase the potential performance benefits in the rewritten pipelines.
Rewrite rules for join and aggregation operators in scikit-learn and pandas
In similar fashion, for this project the students will focus on aggregation and join based operators in scikit-learn and pandas. The rewrite rules will be used to accelerate data processing operations in data management and machine learning workloads. A set of curated pipelines will be available for you so you can use it for developing the rules and evaluate against. The evaluation will cover the runtime of the newly created pipelines and showcase the potential performance benefits in the rewritten pipelines.
Execution of data processing pipelines in the RDBMS
Database-backed data access increaseses the interoperability in data processing pipelines by acting as a central data access point. Having the data stored in a relational database opens up the possibility to accelerate certain data processing operations that are usually executed on the side. In this project, the students will work on a system that will push relational operators from the pipeline to be executed on the RDBMS server. Additionally, frequently accessed intermediate results will be cached so that they can be quickly retrieved to accelerate the pipeline runtime. The evaluation will cover the runtime of the operators executed on the client against the server side, as well as the end to end runtime of the original and the rewritten pipelines.