Hasso-Plattner-Institut
Prof. Dr. Tilmann Rabl
 

About the Speaker

Matthias Boehm is a full professor for large-scale data engineering at Technische Universität Berlin and the BIFOLD research center. His cross-organizational research group focuses on high-level, data science-centric abstractions as well as systems and tools to execute these tasks in an efficient and scalable manner. From 2018 through 2022, Matthias was a BMK-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the co-located Know-Center GmbH. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing.
 

About the Talk

The trend towards data-centric AI leads to increasingly complex, composite machine learning (ML) pipelines with outer loops for data integration and cleaning, data programming and augmentation, model and feature selection, hyper-parameter tuning and cross validation, as well as data validation and ML model debugging. Interestingly, state-of-the-art techniques for data integration, cleaning, and augmentation as well as model debugging are often based on machine learning themselves, which motivates their integration into ML systems. In this talk, we make a case for optimizing compiler infrastructure in Apache SystemDS and DAPHNE as two sibling open-source ML systems. We discuss recent feature highlights and how they all fit together. The covered topics range from linear-algebra-based data cleaning pipeline enumeration and slice finding; over lineage-based reuse and workload-aware redundancy exploitation; to federated learning, vectorized execution on heterogeneous HW devices, and extensibility.
 

ML System Infrastructure for Data-centric ML Pipelines

Summary written by Tobias Jordan and Jessica Ziegler

In his talk, Prof. Matthias Boehm presents the work of his research group revolving around system infrastructure for data-centric machine learning (ML) pipelines. First, he introduces the composed data-centric machine learning pipeline. Given that several steps of the ML pipeline are ML operations themselves, he then exemplifies the modeling of several activities of the data science lifecycle as tensor computations based on an ML system. Based on this introduction and the goal to achieve data independence,  the main goal of his research, he presents five contributions and one current project of his research group. Finally, he outlines the necessity of holistic redundancy exploitation.

Data-Centric ML Pipeline

A traditional ML pipeline consists of two reusable functions: train() and predict(). Given a dataset X (the input), with labels y (the numerical or categorical answers) and an iterative training procedure, a model is first trained to learn the data characteristics of the given dataset X and the given labels y (model training). Then, the model is able to predict answers for given, potentially unseen data. The accuracy of these predictions can subsequently be evaluated with a percentage score (model scoring).

A small ML pipeline consists of two reusable functions: train() and predict().

However, real ML pipelines typically include more functionalities, such as data preparation. A data-centric ML pipeline wraps data-centric functionality as outer loops around the ML pipeline. This composition includes steps before the training as well as post-training activities, which are hierarchically composed as library functions on top of ML systems. Taking data engineering tasks into account, data preparation can even be preceded by additional tasks such as aligning multi-modal data.

In particular, the data-centric ML pipeline follows the premise that a high-quality dataset is of higher importance than small algorithmic details of actual models to be built. Thus, steps before training can include data integration, data cleaning, data programming and data augmentation. Post-training activities include validation of data and models, model debugging, deployment, and scoring, whose results can again be used for another round of model training and scoring.

To summarize, the data-centric pipeline can include the following steps:

  1. Data Engineering: Align multi-modal data. Automatically generate custom I/O handlers for custom data formats. [2]
  2. Data Preparation: Prepare input data by transforming strings and input features into a numerical representation. For efficiency: Apply parallel feature transformations or select the top-k cleaning pipelines.
  3. Data Integration & Data Cleaning: Correctly, completely, and efficiently merge data and content from different, heterogeneous sources into a standardized and structures set of information. [3] Detect and correct errors in data sets. [4]
  4. Data Programming & Augmentation: Automated annotation of data [5] and synthetic generation of more labeled data from an already  labeled dataset.
  5. Model and Feature Selection: Choose the relevant features to train models and choose the best model out of multiple available models.
  6. Hyper-parameter Tuning & Cross-validation: Find the best parametrization that generalizes well to the validation data at hand.
  7. Model Training: Learn the data characteristics of the given dataset X and the given labels y.
  8. Prediction: Predict answers for given or unseen data.
  9. Data Validation, Model Validation & Debugging: Prove the correctness of the used data and trained model. Detect and remove faults in the ML pipeline under test (e.g. slice finding).
  10. Deployment & Scoring: Ship the pipeline in potentially resource-constrained environments. Evaluate the accuracy of the trained model.

ML Systems

As a key observation, state-of-the-art algorithms for data cleaning and augmentation (in data-centric ML pipelines) can themselves be based on machine learning. Thus, building those primitives on top of an ML system is advantageous.

In a narrow sense, an ML system can be defined as a system running ML algorithms such as classification, clustering, or neural networks. However, in view of the fact that ML systems use compilation techniques, runtime techniques, and hardware accelerators, this definition is too simple. Thus, in a broad sense, the definition of an ML system comprises the range from high-level ML applications down to low-level compiler, runtime, and accelerator strategies.

An ML system comprises the range from high-level ML applications down to low-level compiler, runtime, and accelerator strategies.

Tensor computations

Various elements of the data science lifecycle such as query processing, data science, simulation and sampling can be mapped to linear algebra operations, also called tensor computations.

Advantages

Combined with the usage of ML systems, this offers several advantages:

  • Simplicity: Coarse-grained data structures and operations (such as frame, matrix, and tensor) are used. The complexity of the system infrastructure is reduced.
  • Reuse of compiler/runtime techniques: Instead of specialized systems and algorithms, there are general solutions. Commonly used and optimized compilers and runtime techniques are in place.
  • Performance and scalability: Thanks to the uniformity, parallelization strategies are simple. In addition, specialized, rapidly developing techniques of ML systems such as hardware accelerators and distributed backends can be used.

Query processing, data science, simulation, and sampling can be expressed as tensor computations, profiting from compiler and runtime optimizations. Libraries for tensor computations on specific hardware can be built once and reused.

 

Activities from the data science lifecycle can be cleanly mapped to linear algebra operations. This includes data augmentation, graph processing, ML algorithms, and considerations of fairness and explainability. Two activities are presented in more detail.

Top-k cleaning pipelines

Based on a library of robust, parameterized data cleaning primitives, cleaning pipelines (such as imputeFD(double) or mice()) consisting of these can be automatically generated.

Therefore, a given dataset and a target application signaling the quality of the generated pipeline is required. Then, different pipelines of data-cleaning primitives can be enumerated as directed acyclic graphs (DAGs). Using an evolutionary algorithm, the parameters of the target application can be adjusted via hyper-parameter optimization. Out of the generated pipelines, the top k pipelines can be selected for further consideration (fine-tuning, debugging, selection, deployment) by the human-in-the-loop.

Based on a library of parameterized data cleaning primitives, cleaning pipelines can be automatically generated.

 

SliceLine for Model Debugging

To determine the accuracy of a model and provide pointers for improvement, the top k worst slices are to be found. A slice is a conjunction of attributes of the dataset, e.g., degree=PhD AND gender=female. Eventually, represented as a linear algebra operation on matrices,

  1. the full lattice of slice combinations is enumerated in exponential runtime with an increasing number of predicates from the top to the bottom
  2. non-frequent slices in the lattice are identified and pruned along with all its reachable nodes.

Encoded as matrices and implemented as matrix multiplication, the full lattice of slice combinations is enumerated and pruned of non-frequent items.

 

System Infrastructure for Data-centric ML Pipelines

Data Independence

(Physical) data independence has already been addressed by Codd, stating that application programs should be independent of the growth in data types and changes in data representations. Hellerstein extends on this: Whenever the environment is changing quicker than the application running on it, declarative infrastructure should be used such that the environment system can recompile without affecting the application. The changing environment refers to four types of variation of data:

  • Data representations: Data can be represented with data structures using density, sparsity, or compression. Sparsity can be exploited by ML pipeline steps ranging from algorithms to hardware.
  • Data placement: Data can be placed locally or in a distributed fashion. A spectrum of hardware accelerators with differing performance, reconfigurability, energy efficiency can be used.
  • Data (value) types: Different vendors introduced specialized data types, e.g., for floating point representations.
  • Data modalities: An increasing number of modalities (such as text, time series, image, and speech) and their associated data structures have to be dealt with.

Whenever the environment (representation, placement, types, modalities of data) is changing quicker than the application running on it, declarative infrastructure should be used.

1. Apache SystemDS

Apache SystemDS serves as an example of a data-independent machine-learning system. The system embraces changing backends, such as transitioning from MapReduce to Spark, by implementing algorithms in an implementation-agnostic manner. This adaptability is facilitated by a domain-specific language (DML) with an R-like syntax and abstract data types, shielding algorithms from the impact of backend changes. Additionally, the system features an optimizing compiler that generates hybrid runtime plans, seamlessly combining local and distributed operations for optimal performance. SystemDS extends its versatility by supporting various backends like Flink, GPUs, and emerging federated learning.

The optimizing compiler translates hardware-independent DML scripts to hybrid query plans in a “Write once, run anywhere” fashion.

 

It was forked from IBM's open-sourced SystemML that went through the Apache Incubator to an Apache Top-Level Project. The fork was rescoped to meet the end-to-end data science lifecycle requirements and was later invited to be merged back into SystemML, which was then renamed to SystemDS. Serving as an umbrella project, SystemDS integrates research efforts. It introduces built-in functions through its domain-specific language, offering users enhanced abstractions and routing to superior algorithms, with the compiler efficiently collapsing these abstractions into efficient execution plans.

2. Multi-level Lineage Tracing & Reuse

To enable multi-level lineage tracing and reuse in data-centric ML pipelines, the framework dynamically traces the lineage during the execution of individual operations, capturing the provenance of how intermediate results were computed, even in the cases of non-determinism. A lineage graph is constructed for every live variable, providing a comprehensive representation of the sequence of operations involved and offering insights into the computational history of variables. The generated lineage graph can be serialized into a log, enabling easy sharing and collaboration among team members to simplify debugging by capturing the entire history of computations leading to specific results. Furthermore, the serialized lineage graph can be deserialized and used to reconstruct the program, ensuring that it yields the same intermediates with the same inputs, including stored seeds, thus fostering reproducibility. This approach not only enables full reuse of intermediates, eliminating the need for redundant computations but also allows for selective partial reuse. Particularly valuable in scenarios where larger models were already trained for some features, partial reuse facilitates training additional features without recomputing the entire model, optimizing computational efficiency in data-centric ML pipelines.

3. Compressed Linear Algebra Extended

Compressed Linear Algebra Extended plays an essential role in data-centric ML pipelines. By introducing a sophisticated lossless compression framework across multiple tiers, it addresses the challenge of accommodating large datasets into memory. This versatility extends to single-node configurations, distributed clusters with distributed caching, and hardware accelerators. Notably, the system executes linear algebra operations directly on the compressed data representation, similar to query processing on compressed data structures. Emphasizing redundancy exploitation, it efficiently manages both data redundancy, exemplified by distinct values, and structural redundancy, evident in recurring patterns within the data. Operating in a workload-aware manner, the system chooses its compression strategy based on user scripts executing linear algebra primitives. When data fits into memory, a lighter compression is applied to enhance computation speed, while stronger compression is utilized when data exceeds cache capacities for optimal caching efficiency. An example of such compression is dense dictionary coding, a technique adept at encoding data, reducing redundancy, and improving compression ratios.

4. Federated Learning in SystemDS

Federated Learning in SystemDS emerged as a collaborative project involving Siemens, DFKI, TU Berlin, and other contributors. It enables the training of machine learning models and complete end-to-end data science pipelines on federated data. In this paradigm, data remains securely with its owners, ensuring data ownership and privacy. Matrix representations adopt a federated metadata object approach, distributing slices of matrices across different nodes. The implementation allows iterative machine-learning algorithms to run seamlessly. When operations are executed on these federated metadata objects, subqueries are initiated on corresponding servers and their federated workers. These workers execute computations on data, calculate, and return intermediates, which can be retained for future operations and follow-ups.

The coordinator (e.g., a laptop) executes computation on federated metadata objects, initiating workers on corresponding nodes.

The system supports federated hierarchies, for example, allowing federated objects to point to production sites that point to other entities, creating a hierarchical structure inspired by organizational setups, as seen at Siemens. A diverse range of experiments has been conducted involving traditional machine learning tasks like regression, classification, clustering, and dimensionality reduction, as well as advanced techniques such as feedforward and convolutional neural networks. Tests extended to both local-area networks (LAN) and wide-area networks (WAN), with distances reaching up to 1000 km. Local execution serves as the baseline, revealing that while one worker incurs overhead compared to the baseline, the federated setting outperforms the baseline due to increased compute resources. Notably, the system performs favorably compared to popular tools like scikit-learn and TensorFlow.

 

Crucially, leveraging a mapping to linear algebra, the federated environment extends its capabilities to include data-cleaning pipelines with minimal effort, showcasing the flexibility and efficiency of the Federated Learning in SystemDS paradigm.

5. Fine-grained Device Placement in DAPHNE

The DAPHNE project is a collaborative effort involving numerous partners across Europe. The focus lies on constructing an open and extensible system infrastructure tailored for integrated data analysis pipelines. These pipelines, characterized by their complexity and encompassing data management, query processing, high-performance computing (HPC), simulations, and the training and scoring of multiple machine learning models, constitute the project’s core objectives.

DAPHNE vectorized execution engine fuses sequential operations into a single pipeline, reducing overhead.

 

At the heart of this initiative is a vectorized execution engine that employs operator fusion, combining pipelines rather than executing operations sequentially. This fusion mitigates overhead, reducing the necessity to materialize extensive intermediate matrices and minimizing reads and writes to memory. A distinctive feature of this execution engine is its ability to select the granularity at which data passes through the pipelines. This flexibility allows a single data object to be partitioned across multiple devices such as CPUs, GPUs, and FPGAs. Notably, the abstractions derived from federated learning also find application here, providing control over allocating data partitions on specific devices.

LAURYN: Towards Holistic Redundancy Exploitation

Despite its rejection as an ERC consolidator grant proposal in 2023, LAURYN remains an essential project for the TU Berlin Data Science Laboratory (DAMS Lab), which is committed to refining its concepts. The overarching motivation comes from the fact that diverse redundancy-exploiting techniques exist for data-centric machine learning pipelines. These techniques include resource allocation and elasticity, data sampling and composition, sparsity exploitation, lossy and lossless compression, weight pruning, and connection sampling. Each area presents formidable challenges and is typically tackled independently, involving extensive trial and error.

 

The proposed LAURYN project aims to transcend this fragmented approach by jointly incorporating multiple strategies into a unified framework. It addresses the complexities of resource allocation and elasticity, leveraging sampling, data distillation, augmentation-as-a-kernel, sparsity exploitation, and both lossy and lossless compression. Its innovative approach involves making lossy decisions of strategies and seamlessly integrating them into the pipeline training process. It automates the application of lossless approaches, such as sparsity exploitation and compression, at the system level.