Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Distributed Data Management

Description

The free lunch is over! Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. This trend has changed and today's CPUs stall at around 3 GHz. The size of modern computer systems in terms of contained transistors (cores in CPUs/GPUs, CPUs/GPUs in compute nodes, compute nodes in clusters), however, still increases constantly. This caused a paradigm shift in writing software: instead of optimizing code for a single thread, applications now need to solve their given tasks in parallel in order to expect noticeable performance gains. Distributed computing, i.e., the distribution of work on (potentially) physically isolated compute nodes is the most extreme method of parallelization.

Big data analytics and management are a multi-million dollar markets that grow constantly! The ability to control and utilize large amounts of data is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data analytics, i.e., the ability of extracting any kind of information from the data becomes increasingly difficult. As data analytics systems cannot hope for their hardware getting any faster to cope with performance problems, they need to embrace new software trends that let their performance scale with the still increasing number of processing elements.

In this lecture, we take a look at various technologies involved in building distributed, data-intensive systems. We start by discussing fundamental concepts in distributed computing, such das data models, encoding formats, messaging, data replication and partitioning, fault tollerance, and batch- and stream processing. In between, we consider different practical systems from the Big Data Landscape, such as Akka and Spark. In the end, we concentrate on data management aspects, such as distributed database management system architectures and distributed query optimization.

Prerequisites

To take this class, a basic understanding of object-oriented programming, data structures and relational databases is required. If you cannot programm fluently in at least one object-oriented language (e.g. Java, C#, C++, Python, Ruby etc.) taking this class will be hard.

During the exercises, participants will need to program in Java and Scala. In preparation of the course, we therefore recomment to get familiar with these two languages or refresh your knowledge. We expect no deep knowledge, but you should know the basic language constructs and be able to create, for instance, a Maven Java programm and an SBT Scala programm that both read a file and count all words in these two files.

Attendence

Due to the ongoing COVID-19 situation, the Distributed Data Management lecture will be streamed live and recorded on tele-TASK. We highly recommend to attend the live sessions to ask questions and actively contribute to the course. Once the registration periode is over, though, we will also create a mailing list with all participants. Please use the list to actively ask and answer questions.

The DDM lecture is also hosted in the HPI Moodle System. Please also register for the course in Moodle so you can use the forum to find an exercise partner.

The recording will be available on TeleTask. The live stream can be joined with the following two links:

 Montags: (link)Mittwochs: (link)
Time:11:00 - 12:3013:30 - 15:00
Meeting ID:929 1188 0838961 8615 5958
Entrance code:791970491884

Schedule

The slides of the lectures will be posted in the schedule below. Please check the schedule regularly. If a lecture needs to be cancled, it will be marked in the list below. We will also have two sessions, in which we discuss your homework submissions. For these sessions, it will be necessary that at least one person of each homework-team is present either in person or online to present your solution. So please make sure that you don't miss these dates!

DateSubject
12.04.2021Introduction
14.04.2021Foundations
19.04.2021Encoding
21.04.2021Communication
26.04.2021Communication
28.04.2021Communication
03.05.2021Hands-on Akka Actor Programming
05.05.2021Hands-on Akka Actor Programming
10.05.2021<cancled>
12.05.2021Hands-on Akka Actor Programming
17.05.2021Data Models and Query Languages
19.05.2021Storage and Retrieval
24.05.2021<public holiday>
26.05.2021Replication <at 9:15 !>
31.05.2021Partitioning
02.06.2021Distributed Systems
07.06.2021Distributed Systems
09.06.2021Consistency and Consensus
14.06.2021Transactions
16.06.2021Batch Processing
21.06.2021Batch Processing
23.06.2021Batch Processing
28.06.2021Hands-on Spark Batch Processing (tpch.zip)
30.06.2021Hands-on Spark Batch Processing (tpch.zip)
05.07.2021Homework Evaluation Akka
07.07.2021Stream Processing
12.07.2021Stream Processing
14.07.2021Distributed Database Management Systems
19.07.2021Distributed Query Optimization
21.07.2021Homework Evaluation Spark and Exam Preparation

Exam

The final grade will be determined in a written exam. The exam will take place on 26.07.2021 in the time from 13:00 to 16:00 in all three HPI lecture halls. Note that the precise exam planning needs to wait until we have a clearer picture of the corona situation in July. If the situation remains tense, the exam might need to be written in multiple rounds or we write an open-book-exam, which everyone can take at home.

The prerequisite for admission to the exam is the successful completion of all exercises! An exercise is completed by implementing and, then, submitting a functioning algorithm that solves a given task within given rules; an ecercise is failed by not submitting a solution, disobeying the rules, or clearly not solving the task; a sub-optimal performance and small programming mistakes do not let an exercise fail.

To practice for the exam, you can use the exams of WS_2017/18 (solution), WS_2018/19 (solution), and WS_2019/20 (solution). You can also find check-yourself questions in the slides (solution). This is the exam that you took at the end of this course SS_2021 (solution)

Literature

Course books:

  • Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017, 978-1449373320
  • Distributed Systems, Maarten van Steen and Andrew S. Tanenbaum, 2017, 978-1543057386
  • Principles of Distributed Database Systems, M. Tamer Özsu and Patrick Valduriez, 2011, 978-1441988331

Further reading:

  • Web-Scale Data Management for the Cloud, Wolfgang Lehner and Kai-Uwe Sattler, 2013, 1489997717
  • Introduction to Parallel Computing, Zbigniew J. Czech, 2017, 978-1107174399
  • Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services, Brendan Burns, 2017, 978-1491983645
  • Spark: Big Data Cluster Computing in Production, Ilya Ganelin and Ema Orhian and Kai Sasaki and Brennon York, 2016, 978-1119254010
  • Reactive Messaging Patterns with the Actor Model, Vaughn Vernon, 2015, 978-0133846836
  • Mining Massive Datasets, Jure Leskovec and Anand Rajaraman and Jeffrey David Ullman, 2014, 978-1107077232
  • Algorithmische Geometrie, Rolf Klein, 2005, 978-3540209560