Prof. Dr. Tilmann Rabl , Dr. Pedro Silva , Ilin Tolovski
The amount of data that can be generated and stored in academic and industrial projects and applications is increasing rapidly. Big data analytics technologies have established themselves as a solution for big data challenges to the scalability problems of traditional database systems. The vast amounts of new data that is collected, however, usually is not as easily analyzed as curated, structured data in a data warehouse is. Typically, these data are noisy, of varying format and velocity, and need to be analyzed with techniques from statistics and machine learning rather than pure SQL-like aggregations and drill-downs. Moreover, the results of the analyzes frequently are models that are used for decision making and prediction. The complete process of big data analysis is described as a pipeline, which includes data recording, cleaning,
In this lecture, we will discuss big data systems, ie, infrastructures that are used to handle all steps in typical big data processing pipelines. We will learn about data center infrastructure and scale-out software systems. The software discussed will cover the full big data stack, ie, distributed file systems, Map Reduce, key value stores, stream processing, graph processing, ML systems.
- Course management will be done using the HPI Moodle.
- Non-HPI participants : please send us an email to get access to the Moodle
- All lectures are recorded and will be available in Tele-Task here
- The on-site part of the course will be held on Tuesday and Thursday, 11:00 - 12:30, in Building F, Room E-0.6.
- There is no requirement for attending the on-site sessions, all material and exercises will be available online. However, we recommend the on-site courses for Q / A, etc.
- The first week will be fully online.
- Due to the Corona situation all course contents will be online until further notice.
|-||Welcome ( Tele-Task )|
Introduction (online, link is available on the Moodle course page)
In this course, we want to provide an overview of big data technology. We will discuss the big data software stack, which forms the basis for most big data systems and then give an overview of the variety of big data processing systems. You will learn the composition of big data systems as well as the inner architecture of each system.
The course will be held in a hybrid flipped classroom setup. All lectures will be recorded and provided as videos in Tele-Task. We will have weekly meetings, during the lecture hours (Tuesday and Thursday, 11:00 - 12:30, in Building F, Room E-0.6), which we will also stream and record. The weekly meeting will be Q&A sessions and exercise discussions. The attendees will be split into two groups in the first week, one group for Tuesday, one for Thursday. You can participate online or on site. We will monitor the Corona situation closely and adapt the organization as required. Generally, if you feel sick or uneasy with on site teaching, we recommend virtual participation.
There is no course book, the slides contain several pointers to reference literature. In general, we can recommend the following books that cover aspects of the course:
- Database Systems: The Complete Book, Garcia-Molina, Ullman, and Widom
- Streaming System, Akidau, Chernyak, Lax
- The Art of Computer Systems Performance Analysis, Jain
The course requires basic understanding of database systems (e.g., DBS I) and Java programming skills.
The grade will determined to 100% in the final exam. The time and location of the exam will be anounced at least 6 weeks in advance. The prerequisite for admission to the exam is the successful completion of the exercises and programming exercises. In case of low participation, the exam can be replaced by an oral examination.
Asking questions is greatly encouraged, in the live sessions, in the Moodle forums, by mail, etc. If you feel the question can be relevant to anybody else in the course, which it usually is, please use the forum so everybody can benefit from the answer. You are free to discuss our exercises with each other, but do not share solutions.
We have a zero tolerance policy for plagiarism, copying, or other forms of dishonesty. Plagiarism will result in immediately failing the course. We use code checkers for your programming assignment submissions. Giving other students access to your solutions also constitutes as academic misconduct.
Especially in an online setting, it is important to be mindful about your communication. We want this course to be a safe and fun learning experience. Please be respectful and considerate in your communication to your peers and instructors.