Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Mining Massive Datasets (Sommersemester 2016)

Dozent: Prof. Dr. Felix Naumann (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/teaching/ss-16/mining-massive-datasets.html

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist: 22.04.2016
  • Lehrform: Seminar
  • Belegungsart: Wahlpflichtmodul
  • Maximale Teilnehmerzahl: 12

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA
  • IT-Systems Engineering A
  • IT-Systems Engineering B
  • IT-Systems Engineering C
  • IT-Systems Engineering D
IT-Systems Engineering BA


In this course we will develop large scale data mining techniques and research prototypes. Each team, consisting of three students, must identify a challenging BigData problem and solve the problem using a distributed computing framework. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is already widely adopted. In this seminar, each group will apply one of the two more recent frameworks Apache Spark or Apache Flink to solve their challenge. Students will have access to the Amazon EC2 computing cluster. Hence, we will be able to work with only a small number of students, and enrollment will be limited.

This is a project course. There will be only a few weekly lectures, and only one or two introductory lectures. We will spend the quarter working in teams on different large scale data mining related projects. Teams will frequently meet with the assigned mentor.


Programming experience in Java and/or Scala

Lern- und Lehrformen

The course shall teach the participants to solve a self-chosen data mining problem by developing distributed algorithms for example with the prominent Map/Reduce paradigm.


  • individual: active participation during group meetings and individual consolidations
  • each team: a pitch of the selected data and identified problem
  • each team: a presentation of the Big-Data challenge (incl. proof of concept)
  • each team: an intermediate presentation demonstrating the your first insights regarding your distributed implementation
  • each team: a final presentation presentation demonstrating your solution
  • each team: a report (5 pages) on your solutions. The report should document, discuss and evaluate your solutions, showing strengths and weaknesses, your suggestions and comments


Please find the maintained schedule on the course page.