Interoperability in Data Processing Pipelines

Seminar, Winter Semester 2022/23

Instructors: Prof. Tilmann Rabl, Ilin Tolovski

Description

Current open source and industry developments show trends of increased use of collaborative environments such as, Jupyter Notebooks and Google Collab, for the development of data processing and machine learning pipelines. Such tools provide convenient interface for developing and running separate parts of the pipeline at a time. However, the state storage and the locally restricted data access pose a significant challenge when pipelines are meant to be developed and ran on multiple platforms. Limited data sharing capacity hinders the interoperability of data processing pipelines, limiting cooperation, extending the development lifecycle and often providing unreliable results. Having a database backed access allows users to have consistent state of the data and provides shared access point with no unnecessary data transfers required. Additionally, it opens up new opportunities to use the optimizations provided by the RDBMS to transfer parts of the workload away from the user, thus increasing runtime performance.

In this project, you will develop tools to rewrite flat file operators to relational operators and move the execution to the RDBMS server. Specifically, you will look into the implementations of filters, aggregations, joins in commonly used libraries such as, pandas, numpy, scikit-learn and develop rewrite rules to translate them into SQL statements. On the other hand, you will also look into automated execution of such queries on the RDBMS server and caching intermediate results.

The projects will be written in Python and partially in SQL. You will learn more about implementation of database operators and interoperability in collaborative environments.

Structure

Project

This seminar will be structured around working on project topics in the field of interoperability of data processing, and database operator translation. The students can work in groups of 2 to develop a project idea, implement, and evaluate it. At the end of the course, the students should present their findings and hand in a written report on their topic. We offer the possibility to publish the project results at a topic-related conference.

Paper presentations

In this course, the students will have the opportunity to prepare discussion sessions on the state-of-the-art research in data management for data processing pipelines. This includes studying a research paper in detail, presenting it in front of the group, introducing valuable insights, and leading the following discussion. To be adequately prepared for this, we will beforehand discuss the best practices for reading, writing and presenting scientific papers. Ideally, the papers that will be presented in our sessions would cover the related work of the chosen project topics. Every week, each student will need to summarize one of the presented papers in a one-pager.

Grading

Project + report - 60%
Final presentation - 20%
Paper presentations - 20%

Project Topics

As a part of this seminar, we offer the following project ideas. You are welcome to propose your own ideas as well.

Rewrite rules for filter based operators in scikit-learn and pandas

In this project, the students will create rewrite rules for several filtering operators available in scikit-learn and pandas. You will transform the operators to SQL or relational algebra statements that are used to accelerate the preprocessing operations in data management or machine learning pipelines. The students will have a set of curated pipelines that they can use for developing the rules and evaluate against. The evaluation should cover the runtime of the newly created pipelines and showcase the potential performance benefits in the rewritten pipelines.

Rewrite rules for join and aggregation operators in scikit-learn and pandas

In similar fashion, for this project the students will focus on aggregation and join based operators in scikit-learn and pandas. The rewrite rules will be used to accelerate data processing operations in data management and machine learning workloads. A set of curated pipelines will be available for you so you can use it for developing the rules and evaluate against. The evaluation will cover the runtime of the newly created pipelines and showcase the potential performance benefits in the rewritten pipelines.

Execution of data processing pipelines in the RDBMS

Database-backed data access increaseses the interoperability in data processing pipelines by acting as a central data access point. Having the data stored in a relational database opens up the possibility to accelerate certain data processing operations that are usually executed on the side. In this project, the students will work on a system that will push relational operators from the pipeline to be executed on the RDBMS server. Additionally, frequently accessed intermediate results will be cached so that they can be quickly retrieved to accelerate the pipeline runtime. The evaluation will cover the runtime of the operators executed on the client against the server side, as well as the end to end runtime of the original and the rewritten pipelines.

Schedule

Week 1: Introduction to the seminar: Data Management in ML Systems
- Seminar logistics
- Data processing pipelines
- Interoperability & sharing
- Pipeline rewriting
- Project topics
- Literature & references
Week 2: How to read, write and present a scientific paper
Week 3: Paper presentations
Week 4: Paper presentations
Week 5: Proposal presentations
Week 6: Paper presentations
Week 7: Paper presentations
Week 8: Project meeting
Week 9: Project meeting / Paper presentations
Christmas Break: 19.12.2022 - 30.12.2022
Week 10: Project meeting
Week 11: Intermediate presentation
Week 12: Project meeting
Week 13: Project meeting
Week 14: Project meeting
Week 15: Final presentations 06.02.2023
Deadline for reports: 20.02.2023

Announcements

The course will be conducted on-site at HPI. The lectures will take place on Tuesdays at 13:30 in Room F-E.06 (Campus II).
Course management via Moodle. There we will make any announcements and share course materials.
HPI Moodle Course
The course is limited to 12 students.
If you have any questions, please contact me at ilin.tolovski (at) hpi.de

Interoperability in Data Processing Pipelines

Seminar, Winter Semester 2022/23

Instructors: Prof. Tilmann Rabl, Ilin Tolovski

Description

Structure

Project

Paper presentations

Grading

Project Topics

Rewrite rules for filter based operators in scikit-learn and pandas

Rewrite rules for join and aggregation operators in scikit-learn and pandas

Execution of data processing pipelines in the RDBMS

Schedule

Announcements

Chair

News

09.08.2024 | Paper on Query Compilation for GPUs accepted at LWDA '24

18.07.2024 | Stork paper accepted at DATAI '24

08.03.2024 | CXL Buffer Management Paper Accepted at HardBD & Active '24

01.02.2024 | InferDB paper accepted at VLDB '24

01.02.2024 | POLAR paper accepted at VLDB '24

Events

24.03.2022 | FG DB Symposium

Directions