Distributed Data Management

Description

The free lunch is over! Computer systems up until the turn of the century became constantly faster without any particular effort simply because the hardware they were running on increased its clock speed with every new release. This trend has changed and today's CPUs stall at around 3 GHz. The size of modern computer systems in terms of contained transistors (cores in CPUs/GPUs, CPUs/GPUs in compute nodes, compute nodes in clusters), however, still increases constantly. This caused a paradigm shift in writing software: instead of optimizing code for a single thread, applications now need to solve their given tasks in parallel in order to expect noticeable performance gains. Distributed computing, i.e., the distribution of work on (potentially) physically isolated compute nodes is the most extreme method of parallelization.

Big data analytics and management are a multi-million dollar markets that grow constantly! The ability to control and utilize large amounts of data is the most valuable ability of today's computer systems. Because data volumes grow so rapidly and with them the complexity of questions they should answer, data analytics, i.e., the ability of extracting any kind of information from the data becomes increasingly difficult. As data analytics systems cannot hope for their hardware getting any faster to cope with performance problems, they need to embrace new software trends that let their performance scale with the still increasing number of processing elements.

In this lecture, we take a look at various technologies involved in building distributed, data-intensive systems. We start by discussing fundamental concepts in distributed computing, such das data models, encoding formats, messaging, data replication and partitioning, fault tollerance, and batch- and stream processing. In between, we consider different practical systems from the Big Data Landscape, such as Akka and Spark. In the end, we concentrate on data management aspects, such as distributed database management system architectures and distributed query optimization.

Prerequisites

To take this class, a basic understanding of object-oriented programming, data structures and relational databases is required. If you cannot programm fluently in at least one object-oriented language (e.g. Java, C#, C++, Python, Ruby etc.) taking this class will be hard.

During the exercises, participants will need to program in Java and Scala. In preparation of the course, we therefore recomment to get familiar with these two languages or refresh your knowledge. We expect no deep knowledge, but you should know the basic language constructs and be able to create, for instance, a Maven Java programm and an SBT Scala programm that both read a file and count all words in these two files.

Attendence

Due to the ongoing COVID-19 situation, the Distributed Data Management lecture will be streamed live and recorded on tele-TASK. We highly recommend to attend the live sessions to ask questions and actively contribute to the course. Once the registration periode is over, though, we will also create a mailing list with all participants. Please use the list to actively ask and answer questions.

The DDM lecture is also hosted in the HPI Moodle System. Please also register for the course in Moodle so you can use the forum to find an exercise partner.

The recording will be available on TeleTask. The live stream can be joined with the following two links:

	Montags: (link)	Mittwochs: (link)
Time:	11:00 - 12:30	13:30 - 15:00
Meeting ID:	929 1188 0838	961 8615 5958
Entrance code:	791970	491884

Schedule

The slides of the lectures will be posted in the schedule below. Please check the schedule regularly. If a lecture needs to be cancled, it will be marked in the list below. We will also have two sessions, in which we discuss your homework submissions. For these sessions, it will be necessary that at least one person of each homework-team is present either in person or online to present your solution. So please make sure that you don't miss these dates!

Date	Subject
12.04.2021	Introduction
14.04.2021	Foundations
19.04.2021	Encoding
21.04.2021	Communication
26.04.2021	Communication
28.04.2021	Communication
03.05.2021	Hands-on Akka Actor Programming
05.05.2021	Hands-on Akka Actor Programming
10.05.2021	<cancled>
12.05.2021	Hands-on Akka Actor Programming
17.05.2021	Data Models and Query Languages
19.05.2021	Storage and Retrieval
24.05.2021	<public holiday>
26.05.2021	Replication <at 9:15 !>
31.05.2021	Partitioning
02.06.2021	Distributed Systems
07.06.2021	Distributed Systems
09.06.2021	Consistency and Consensus
14.06.2021	Transactions
16.06.2021	Batch Processing
21.06.2021	Batch Processing
23.06.2021	Batch Processing
28.06.2021	Hands-on Spark Batch Processing (tpch.zip)
30.06.2021	Hands-on Spark Batch Processing (tpch.zip)
05.07.2021	Homework Evaluation Akka
07.07.2021	Stream Processing
12.07.2021	Stream Processing
14.07.2021	Distributed Database Management Systems
19.07.2021	Distributed Query Optimization
21.07.2021	Homework Evaluation Spark and Exam Preparation

Exam

The final grade will be determined in a written exam. The exam will take place on 26.07.2021 in the time from 13:00 to 16:00 in all three HPI lecture halls. Note that the precise exam planning needs to wait until we have a clearer picture of the corona situation in July. If the situation remains tense, the exam might need to be written in multiple rounds or we write an open-book-exam, which everyone can take at home.

The prerequisite for admission to the exam is the successful completion of all exercises! An exercise is completed by implementing and, then, submitting a functioning algorithm that solves a given task within given rules; an ecercise is failed by not submitting a solution, disobeying the rules, or clearly not solving the task; a sub-optimal performance and small programming mistakes do not let an exercise fail.

To practice for the exam, you can use the exams of WS_2017/18 (solution), WS_2018/19 (solution), and WS_2019/20 (solution). You can also find check-yourself questions in the slides (solution). This is the exam that you took at the end of this course SS_2021 (solution)

Literature

Course books:

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, Martin Kleppmann, 2017, 978-1449373320
Distributed Systems, Maarten van Steen and Andrew S. Tanenbaum, 2017, 978-1543057386
Principles of Distributed Database Systems, M. Tamer Özsu and Patrick Valduriez, 2011, 978-1441988331