Reliable Distributed Systems Engineering

Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. But this free lunch is over! Today's CPUs stall at around 3 GHz and software developers need to break new grounds to make their products faster. The most popular approach for this is to design software with parallelization and distributed computing in mind, because the number of computing elements (transistors, cores, CPUs, GPUs, cluster nodes etc.) in modern computer systems still increases constantly. This paradigm shift in software development, however, introduces various challenges:

How do we change a particular algorithm to efficiently execute on various, possibly independent and heterogeneous computing elements?
How do we utilize all available resources in an optimal way?
How does the algorithm deal with the increased error susceptibility of a parallel/distributed system?
How can the algorithm support elasticity, i.e., sets of computing resources that change at runtime?
How does a parallel/distributed system control its resource consumption in terms of overall memory and CPU usage?
How do we start and terminate such systems?
How do we debug, monitor, and profile them?

Certain frameworks for parallel/distributed programming, such as Spark, Flink, and Storm, solve a couple of these questions already, but they enforce a certain programming model that does not fit for all algorithmic tasks. Other frameworks, such as Akka and Orleans, leave these questions to the programmer but also offer much more flexibility for the algorithmic design.

In this seminar, we investigate four specific algorithmic tasks that have no trivial distributed solutions. Each of these tasks is assigned to a team of two students, who will design a novel, distributed algorithm for it. These algorithms should be implemented using the actor programming model in Akka, Orleans, or Erlang and they should be able to achieve the following tasks:

Optimize resource utilization: All nodes should contribute to the overall task. Situations where cluster nodes are idle while others are not should be avoided.
(work scheduling, load balancing, cluster metrics, ...)
Tolerate node failures: Failing nodes should not cause the computation to fail. Instead, their subtasks should be re-scheduled and, if needed, their states be recovered.
(reliable message sending, master-worker pattern, consensus techniques, ...)
Enable cluster growth: Nodes should be able to enter the cluster and the algorithmic computations. An already running algorithm should dynamically provide them with work.
(dynamic membership management, work stealing, work scheduling, ...)
Resolve memory overflows: Memory issues should be dynamically detected and resolved. Applying drastical measures to avoid OOM situations is allowed.
(disk spilling, result pruning, dynamic compression, ...)
Start and stop in a clean way: The distributed processing should be easy to start. When it ends, all nodes shutdown cleanly and the result is stored safely.
(scripted startup, reaper pattern, coordinated shutdown, ...)

The individual implementations should be evaluated w.r.t. their performance, elasticity, robustness, and scalability. The results of this project, i.e., the algorithm descriptions and experimental evaluation should be written down in a paper style seminar report by each team.

Project Results

Prerequisites

For this seminar, participants require the following prerequisites (or must catch up on these at the beginning of the seminar):

Database knowledge (ideally Database System I and Database Systems II)
Actor programming knowledge (ideally Distributed Data Analytics or Distributed Data Management)

Organisation

Extent: 4 SWS
Location: F-E.06, Campus II
Dates: Tuesdays, 09:15 - 10:45 AM
Maximal 8 participants (4 teams á 2 students)
The first date serves as an introduction to the topic and the seminar. Subsequently, you can register for the course through an informal email by April 12 to Thorsten Papenbrock. In case of more than eight registrations, we have to pick slots randomly.
Your tasks
- (10%) Active participation at all meetings
- (20%) Implementation of a distributed algorithm for your team-specific challenge
- (10%) Evaluation of scalability and robustness of your distributed algorithm
- (40%) A midterm and a final presentation
- (20%) A written summary/documentation of your distributed algorithm
  (2-4 pages per person according to ACM template)

Time Table

Date	Topic
09. April	Introduction: Topic Presentation
16. April	Kick-off: Topic Selection & Team Building
23. April	Pitches: Mini Topic and Approach Presentations & Discussion
30. April	Start of by-weekly meeting process
11. June	Intermediate Presentations
09. July	Final Presentations