Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Project seminar on Mining Massive Datasets

Organization

  • Project seminar for master students
  • 6 credit points, 4 SWS
  • Weekly meetings: either as group meetings or individual team meetings with a supervisor
  • Supervisors
    • Toni Grütze
    • Sebastian Kruse
    • Dr. Alexander Albrecht and Dr. Christoph Böhm will join the seminar as supervisors. Both are big data experts with many years of industrial/research experience and co-founders of bakdata. Bakdata is an independent IT service provider with a strong focus on data-oriented software solutions. Since its foundation in 2013, bakdata has attracted major international companies, such as GfK, Elsevier, and OTTO.

Description

In this course we will develop large scale data mining techniques and research prototypes. Each team, consisting of three students, must identify a challenging BigData problem and solve the problem using a distributed computing framework. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is already widely adopted. In this seminar, each group will apply one of the two more recent frameworks Apache Spark or Apache Flink to solve their challenge. Students will have access to the Amazon EC2 computing cluster. Hence, we will be able to work with only a small number of students, and enrollment will be limited.

This is a project course. There will be only a few weekly lectures, and only one or two introductory lectures. We will spend the quarter working in teams on different large scale data mining related projects. Teams will frequently meet with the assigned mentor.

Projects

  • Improving Shared Knowledge - GitHub
  • Mining Amazon Reviews for better Product Placement - GitHub
  • NYC Taxi Prediction - GitHub
  • Twitter News Feed - GitHub

Learning objective

Solve a self-chosen data mining problem by developing distributed algorithms for example with the prominent Map/Reduce paradigm.

Tasks

  • Find an interesting dataset for a typical data mining problem
  • Identify a challenge to be solved based on your dataset (profiling)
  • Implement your solution either on Apache Flink or Apache Spark (both require either Java or Scala programming skills)
  • Evaluate your implementations in terms of performance and result quality
  • Actively participate in group meetings so as to learn from other teams and let other teams learn from you

Deliverables

  • individual: active participation during group meetings and individual consolidations
  • each team: a pitch of the selected data and identified problem
  • each team: a presentation of the Big-Data challenge (incl. proof of concept) -- 10+5 min
  • each team: an intermediate presentation demonstrating the your first insights regarding your distributed implementation -- 15+5 min
  • each team: a final presentation presentation demonstrating your solution -- 15+5 min
  • each team: code & documentation (on GitHub). The documentation should contain information on how to execute and evaluate your solution. Furthermore, it should also show strengths and weaknesses of the implementation.

Schedule

Date|Topic
2016-04-15|introduction lecture
2016-04-18|application due
2016-04-22|project pitch (each student), voting, and team building
2016-04-29|bakdata
2016-05-06|individual meeting
2016-05-13|individual meeting
2016-05-20|proof of concept: What is the potential of the data, and how will that affect your task?
2016-05-27|Hands-on Spark / Flink & AWS
2016-06-03|individual meeting
2016-06-10|individual meeting
2016-06-17|individual meeting
2016-06-24|individual meeting
2016-07-01|intermediate presentation - pragmatic solution (postponed)
2016-07-08|individual meeting
2016-07-15|individual meeting
2016-07-22|final presentation - tweaked solution
tbd.|finalize your documentation (GitHub)