Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Description

In this project seminar, we develop a prototype system that runs end-to-end machine learning pipelines on clusters of edge devices with limited compute power to encourage sustainable hardware usage.

IMPORTANT NOTE: Due to the current COVID-19 situation, we need to start this seminar in online-mode. This means that we use jitsi web-meetings for on-boarding and our regular group sessions. Please find the detailes about our organization below.

Background

Artificial intelligence plays an increasingly important role in industry. Many successful modern companies draw their strength from data analytics and machine learning to optimize their business processes, risk management, and decisionmaking. Meanwhile, the topic is so relevant that it became a political concern and the subject of various national AI strategies. Large companies already leverage AI or they have the resources to do so. Small and medium-sized companies, on the other hand, cannot afford powerful servers or expensive GPU clusters for data analytics and are often reluctant to use cloud services for data protection and privacy reasons. However, in our experience, they often have spare, low-spec hardware to their disposal, such as disused computers, laptops or phones. In this project, we develop a distributed system that runs an entire data analytics pipeline, including data preparation, feature extraction and model learning, on such devices. Edge devices in the context of our project are low-spec, potentially aged commodity computers and smartphones as well as resource-saving low-energy devices. The resulting prototype should enable small companies to use their spare hardware for state-of-the-art machine learning and data analytics. The project should also encourage a more sustainable hardware management, where functioning hardware does not needlessly get replaced with more powerful hardware just for the sake of analytical experimenting.

Machine Learning

A machine learning model and, hence, artificial intelligence is the product of a multi-step data engineering and training process. Each step usually requires a significant amount of resources, i.e., CPU power and memory. To execute this process without high-performance hardware, we need to find resource-saving solutions for each of the following steps:

  1. Data Preparation & Cleaning
    Machine learning is a process that follows the GIGO principle: “Garbage In, Garbage Out”. If the training data is flawed, the derived model will probably be flawed as well. For this reason, the first step is to automatically fix the structural integrity, i.e., table shape, data types, and value formats, (data preparation) and to solve data quality issues, such as duplicates and missing values (data cleaning). Some of these operations, such as duplicate detection, are complex tasks (in O(n2)) and, therefore, a challenge for edge device hardware.
  2. Data Profiling & Analysis
    Feature selection and feature generation are two important preparation steps for machine learning. In many cases, the raw data does not represent these features explicitly so that they need to be extracted from it. We, therefore, discover implicit metadata, such as functional dependencies and value constraints, (data profiling) and supplementary statistics, such as aggregates and histograms, (data analysis) that may serve as features in subsequent training steps. Many data profiling tasks are in O(2n) and thus particularly challenging for edge device hardware.
  3. Data Classification & Prediction
    The actual training process is the third and last step of the pipeline. Thereby, different machine learning models, such as Logistic Regression, Random Forests, Support Vector Machines or Naive Bayes, can be trained for different purposes, such as classification or prediction. All these training processes are data-intensive and, therefore, have high resource requirements. Fitting them on low-spec hardware devices is a challenge and might require alternative training approaches.

Project Description

For most of the tasks in the multi-step training process, parallel and distributed algorithms exist. These algorithms are, however, usually optimized for powerful systems and struggle with low-spec hardware, i.e., they cannot cope with heterogeneity, starve on slow processors and quickly exhaust available main memory. We therefore aim to find solutions for these tasks that are more resource-aware and, hence, run on heterogeneous edge device clusters. For this, we exploit reactive programming approaches that dynamically adapt to resource bottlenecks and special data characteristics. Reactive strategies can dynamically change the data engineering and training processes based on intermediate results and they can be used to perform hyperparameter-tuning at runtime. We also look into data compression and summarization techniques to fit our processes on weaker systems and we use instance selection and approximation techniques to cope with very large input datasets. The result will be a working end-to-end model training prototype that runs distributedly on edge device clusters. We measure its execution time, evaluate how far we can reduce the resource consumption, and assess the quality of the trained models.

Hardware

The project is planned for a duration of one semester and up to eight participants. For the development and evaluation of the planned machine learning prototype for edge devices, we use three systems:

  1. Server cluster
    The server cluster is a traditional, homogeneous server cluster as being used by many research institutes. Ours has 12 nodes á 10 physical cores and 32 GB RAM. This cluster will produce baseline measurements, with which we can compare the results produced on the low-spec clusters.
  2. Commodity cluster
    The commodity cluster is a heterogeneous cluster of (currently 8) de-commissioned desktop computers. The machines are about 5-15 years old and cover dual core to quad core CPUs as well as 2-6 GB of RAM. The commodity cluster represents exactly the target system of our use case, namely spare hardware that is usually disposed of.
  3. Raspberry Pi cluster
    The Raspberry Pi cluster is a cluster composed 12 Raspberry Pi 4 model B. The Pi’s have dual cores and 4 GB RAM. Every Pi is extremely energy efficient and can represent the capabilities of a typical edge device in our experiments. The Pi cluster is used to systematically evaluate the planned prototype on truly low-spec hardware without heterogeneity biases.

Techniques

We utilize reactive programming (as implemented by the Orleans and Akka libraries) and state-of-the-art machine learning tools (such as PyTorch or SparkML). We also consider standard techniques for data partitioning, workload distribution, dynamic hyperparameter tuning, federated learning, parameter server, and distributed system development.

Goals

The goal of this project is to develop a prototype that runs end-to-end machine learning pipelines on edge devices. Ideally, each team of á two students contributes a set of modules to this pipeline. We summarize our results in form of a scientific paper “Sustainable Machine Learning on Edge Device Clusters” where each team contributes about three to four pages about their modules and evaluations. The development of a single prototype requires regular meetings and a close collaboration with all teams. We will have weekly meetings and at least two larger presentations to share our progress, solutions and potential obstacles.

Organization

The organizational details for the seminar are as follows:

  • Project seminar for master students
  • 6 credit points, 4 SWS
  • Weekly meetings
  • Supervisors: Phillip Wenig and Thorsten Papenbrock
  • Time: Tuesdays, 9:15 - 10:45 AM
  • Location: F-2.10, Building F, 2nd Floor, Campus II
    During the COVID-19 lockdown, these meetings will take place online!

Please consider this page as an introduction to the topic and the seminar. Please send an informal email to Thorsten Papenbrock until Tuesday 28.04.2020 to register for the course. The email should include the topic(s) of the course that you are interested in (1., 2. and/or 3.), students that you would like to team up with in a project in case you want to join as a team (both of you need to register), and any prior knowledge of you that is relevant to this course (e.g. HPI courses in the data engineering or machine learning area). We will let you know if you can officially register for the course as soon as possible (on Tuesday 21.04.2020 at latest). Note that because the semester starts very quickly and without proper introductory week, we might need to choose the up to eight participants first-come-first-serve if we receive more than eight applications.

To start the project, our first meeting will be on Tuesdays 28.04.2020, 9:15 - 10:45 AM via jitsi meet. We will send the conference link to all accepted participants via email on the same day before the meeting starts. In this online session, we answer first questions, discuss the organization of the seminar, and assign tasks and teams. Please check prior to our first meeting that you can run jitsi in your web browser and that you have a working microphone. It would also be great if you have a webcam so that we can see each other.

Slides:

The grading will be based on the following tasks:

  • (10%) Active participation during all seminar events.
  • (60%) Research and development success w.r.t. your pipeline modules including:
    • (20%) Implementation
    • (20%) Evaluation
    • (20%) Paper writing (~1.5 pages per person)
  • (30%) Presentations including:
    • (15%) Midterm presentation
    • (15%) Final presentation