Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

DAPHNE in a nutshell

The european project "Integrated Data Analysis Pipelines for Large-Scale Data Management, HPC, and Machine Learning" (DAPHNE) aims at creating open and extensible systems support for integrated data analysis pipelines that combine Data Management, High Performance Computing and Machine Learning.

Abstract

Modern data-driven applications leverage large, heterogeneous data collections to find interesting patterns, and build robust machine learning (ML) models for accurate predictions. Large data sizes and advanced analytics spurred the development and adoption of data-parallel computation frameworks like Apache Spark or Flink as well as distributed ML systems like MLlib, TensorFlow, or PyTorch. A key observation is that these new systems share many techniques with traditional high-performance computing (HPC), and the architecture of underlying hardware clusters converges.

Yet, the programming paradigms, cluster resource management, as well as data formats and representations differ substantially across data management, HPC, and ML software stacks. There is a trend though, toward complex data analysis pipelines that combine these different systems. Examples are workflows of distributed data pre-processing, tuned HPC libraries, and dedicated ML systems, but also HPC applications that leverage ML models for more cost-effective simulation. Major obstacles are (1) limited development productivity for integrated analysis pipelines due to different programming models, and separated cluster environments, (2) unnecessary data movement overhead and underutilization due to separate, statically provisioned clusters, and (3) lack of a common system infrastructure with good interoperability.

For these reasons, DAPHNE’s overall objective is the definition of an open and extensible systems infrastructure for integrated data analysis pipelines. It aims at building a reference implementation of language abstractions (i.e., APIs and a domain-specific language), an intermediate representation, as well as compilation and runtime techniques with support for integrating and scheduling heterogeneous accelerator and storage devices. A variety of real-world, high-impact use cases, datasets, and a new benchmark will be used for qualitative and quantitative analysis compared to state-of-the-art.

Main Objectives

System Architecture, APIs and DSL: Improve the productivity for developing integrated data analysis pipelines via appropriate APIs and a domain-specific language, an overall system architecture for seamless integration with existing data processing frameworks, HPC libraries, and ML systems. A major goal is an open, extensible reference implementation of the necessary compiler and runtime infrastructure to simplify the integration of current and future state-of-the-art methods.

Hierarchical Scheduling and Task Planning: Improve the utilization of existing computing clusters, multiple heterogeneous hardware devices, and capabilities of modern storage and memory technologies through improved scheduling as well as static (compile time) task planning. In this context, we also aim to automatically leverage interesting data characteristics such as the sorting order, degree of redundancy, and matrix/tensor sparsity.

Use Cases and Benchmarking: The technological results will be evaluated on a variety of real-world use cases and datasets as well as a new benchmark developed as part of the DAPHNE project. We aim to improve the accuracy and runtime of these use cases combining data management, machine learning, and HPC – this exploratory analysis serves as a qualitative study on productivity improvements. The variety of real-world use cases will further be generalized to a benchmark for integrated data analysis pipelines quantifying the progress compared to state-of-the-art.

End-to-end Benchmarking

As members of DAPHNE, our research group is particularly interested in the benchmarking and analysis aspects of the DAPHNE project. We aim at generalizing DAPHNE's use case studies and similar applications for designing a new benchmark on integrated data analysis pipelines.

Currently, that dimensions is being explored in the bachelor project "End-to-end ML System Benchmarking". Its main objective is to build a platform for benchmarking and analysis of SystemDS and other ML systems taking into consideration the several different stages of ML workflows, such as data preparation, data cleaning, model training, and inference.

Official project website: https://daphne-eu.github.io/