Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Talk

The trend towards data-centric AI leads to increasingly complex, composite machine learning (ML) pipelines with outer loops for data integration and cleaning, data programming and augmentation, model and feature selection, hyper-parameter tuning and cross validation, as well as data validation and ML model debugging. Interestingly, state-of-the-art techniques for data integration, cleaning, and augmentation as well as model debugging are often based on machine learning themselves, which motivates their integration into ML systems. In this talk, we make a case for optimizing compiler infrastructure in Apache SystemDS and DAPHNE as two sibling open-source ML systems. We discuss recent feature highlights and how they all fit together. The covered topics range from linear-algebra-based data cleaning pipeline enumeration and slice finding; over lineage-based reuse and workload-aware redundancy exploitation; to federated learning, vectorized execution on heterogeneous HW devices, and extensibility.
 

About the Speaker

Matthias Boehm is a full professor for large-scale data engineering at Technische Universität Berlin and the BIFOLD research center. His cross-organizational research group focuses on high-level, data science-centric abstractions as well as systems and tools to execute these tasks in an efficient and scalable manner. From 2018 through 2022, Matthias was a BMK-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the co-located Know-Center GmbH. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing.