Project seminar on Mining Massive Datasets

Organization

Project seminar for master students
6 credit points, 4 SWS
Weekly meetings: either as group meetings or individual team meetings with a supervisor
Supervisors
- Toni Grütze
- Sebastian Kruse
- Dr. Alexander Albrecht and Dr. Christoph Böhm will join the seminar as supervisors. Both are big data experts with many years of industrial/research experience and co-founders of bakdata. Bakdata is an independent IT service provider with a strong focus on data-oriented software solutions. Since its foundation in 2013, bakdata has attracted major international companies, such as GfK, Elsevier, and OTTO.

Description

In this course we will develop large scale data mining techniques and research prototypes. Each team, consisting of three students, must identify a challenging BigData problem and solve the problem using a distributed computing framework. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is already widely adopted. In this seminar, each group will apply one of the two more recent frameworks Apache Spark or Apache Flink to solve their challenge. Students will have access to the Amazon EC2 computing cluster. Hence, we will be able to work with only a small number of students, and enrollment will be limited.

This is a project course. There will be only a few weekly lectures, and only one or two introductory lectures. We will spend the quarter working in teams on different large scale data mining related projects. Teams will frequently meet with the assigned mentor.

Projects

Improving Shared Knowledge - GitHub
Mining Amazon Reviews for better Product Placement - GitHub
NYC Taxi Prediction - GitHub
Twitter News Feed - GitHub

Learning objective

Solve a self-chosen data mining problem by developing distributed algorithms for example with the prominent Map/Reduce paradigm.

Tasks

Find an interesting dataset for a typical data mining problem
Identify a challenge to be solved based on your dataset (profiling)
Implement your solution either on Apache Flink or Apache Spark (both require either Java or Scala programming skills)
Evaluate your implementations in terms of performance and result quality
Actively participate in group meetings so as to learn from other teams and let other teams learn from you

Deliverables

individual: active participation during group meetings and individual consolidations
each team: a pitch of the selected data and identified problem
each team: a presentation of the Big-Data challenge (incl. proof of concept) -- 10+5 min
each team: an intermediate presentation demonstrating the your first insights regarding your distributed implementation -- 15+5 min
each team: a final presentation presentation demonstrating your solution -- 15+5 min
each team: code & documentation (on GitHub). The documentation should contain information on how to execute and evaluate your solution. Furthermore, it should also show strengths and weaknesses of the implementation.

Schedule

Date|Topic

2016-04-15|introduction lecture

2016-04-18|application due

2016-04-22|project pitch (each student), voting, and team building

2016-04-29|bakdata

2016-05-06|individual meeting

2016-05-13|individual meeting

2016-05-20|proof of concept: What is the potential of the data, and how will that affect your task?

2016-05-27|Hands-on Spark / Flink & AWS

2016-06-03|individual meeting

2016-06-10|individual meeting

2016-06-17|individual meeting

2016-06-24|individual meeting

2016-07-01|intermediate presentation - pragmatic solution (postponed)

2016-07-08|individual meeting

2016-07-15|individual meeting

2016-07-22|final presentation - tweaked solution

tbd.|finalize your documentation (GitHub)