Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Distributed Big Data Analytics (Sommersemester 2015)

Dozent: Prof. Dr. Felix Naumann (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/current-courses/ss-15/distributed-big-data-analytics.html

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist: 24.04.2015
  • Lehrform: Seminar
  • Belegungsart: Wahlpflichtmodul
  • Maximale Teilnehmerzahl: 12

Studiengänge & Module

IT-Systems Engineering BA


This seminar aims to give its participating students the opportunity to gain experience with the development of distributed data analysis program. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is an already widely adopted means to cope with some of the challenges that are coined by the term Big Data: The Map/Reduce paradigm is flexible enough to deal with arbitrary data formats (Variety) and its programs can be transparently distributed across computer clusters to handle large amounts of data (Volume).

In this seminar, we want to have a look at the two more recent frameworks Apache Spark and Apache Flink. Both of them extend the classical Map/Reduce paradigm and prevail over Hadoop both in terms of flexibility and performance. Accordingly, the older Apache Spark has already gained considerable attention. However, both platforms are eligible for more or less the same types of data analysis tasks, yet they differ in their execution strategies. Thus, our goal is to obtain an objective comparison between Flink and Spark by implementing and optimizing a representative set of different problems on both platforms and comparing these implementations in terms of performance.

Each team, consisting of two students, will be responsible for one problem class, implement appropriate algorithms on both platforms, and evaluate their performance. During the course of the seminar, the students will get to know important concepts of distributed computing, e.g. the Map/Reduce paradigm and data locality, but also get in touch with different technologies, e.g. the aforementioned Spark and Flink and the distributed filesystem HDFS. To further boost the learning curve, we encourage the students to share their insights and help each other during the regular group meetings.

Problem classes

  • Business Analytics
  • Data Cleansing
  • Data Profiling
  • Data Mining
  • Graph Mining
  • Text Mining
  • Machine Learning


  • Programming experience in Java and/or Scala

Lern- und Lehrformen

The course shall teach the participants to develop distributed data analysis algorithms within the prominent Map/Reduce paradigm. This involves accomplishing the following tasks:

  • implement a simple and a complex problem from a specific domain on both Apache Flink and Apache Spark (both require either Java or Scala programming skills)
  • evaluate your implementations in terms of performance
  • employ obtained insights to compare both platforms
  • actively participate in group meetings so as to learn from other teams and let other teams learn from you


  • individual: active participation during group meetings and individual consolidations
  • each team: an intermediate presentation demonstrating your implementation on the first platform
  • each team: a final presentation presentation demonstrating your implemenation on the second platform, including a comparsion to the first implementation
  • all teams together: a submission-ready paper that compares Flink and Spark by combining each team's individual insights


Please find the maintained schedule on the course page.