Project seminar on Mining Massive Datasets
Organization
- Project seminar for master students
- 6 credit points, 4 SWS
- Weekly meetings: either as group meetings or individual team meetings with a supervisor
- Supervisors
- Toni Grütze
- Sebastian Kruse
- Dr. Alexander Albrecht and Dr. Christoph Böhm will join the seminar as supervisors. Both are big data experts with many years of industrial/research experience and co-founders of bakdata. Bakdata is an independent IT service provider with a strong focus on data-oriented software solutions. Since its foundation in 2013, bakdata has attracted major international companies, such as GfK, Elsevier, and OTTO.
Description
In this course we will develop large scale data mining techniques and research prototypes. Each team, consisting of three students, must identify a challenging BigData problem and solve the problem using a distributed computing framework. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is already widely adopted. In this seminar, each group will apply one of the two more recent frameworks Apache Spark or Apache Flink to solve their challenge. Students will have access to the Amazon EC2 computing cluster. Hence, we will be able to work with only a small number of students, and enrollment will be limited.
This is a project course. There will be only a few weekly lectures, and only one or two introductory lectures. We will spend the quarter working in teams on different large scale data mining related projects. Teams will frequently meet with the assigned mentor.
Projects
Learning objective
Solve a self-chosen data mining problem by developing distributed algorithms for example with the prominent Map/Reduce paradigm.
Tasks
- Find an interesting dataset for a typical data mining problem
- Identify a challenge to be solved based on your dataset (profiling)
- Implement your solution either on Apache Flink or Apache Spark (both require either Java or Scala programming skills)
- Evaluate your implementations in terms of performance and result quality
- Actively participate in group meetings so as to learn from other teams and let other teams learn from you
Deliverables
- individual: active participation during group meetings and individual consolidations
- each team: a pitch of the selected data and identified problem
- each team: a presentation of the Big-Data challenge (incl. proof of concept) -- 10+5 min
- each team: an intermediate presentation demonstrating the your first insights regarding your distributed implementation -- 15+5 min
- each team: a final presentation presentation demonstrating your solution -- 15+5 min
- each team: code & documentation (on GitHub). The documentation should contain information on how to execute and evaluate your solution. Furthermore, it should also show strengths and weaknesses of the implementation.
Schedule
| Date | Topic |
| 2016-04-15 | introduction lecture |
| 2016-04-18 | application due |
| 2016-04-22 | project pitch (each student), voting, and team building |
| 2016-04-29 | bakdata |
| 2016-05-06 | individual meeting |
| 2016-05-13 | individual meeting |
| 2016-05-20 | proof of concept: What is the potential of the data, and how will that affect your task? |
| 2016-05-27 | Hands-on Spark / Flink & AWS |
| 2016-06-03 | individual meeting |
| 2016-06-10 | individual meeting |
| 2016-06-17 | individual meeting |
| 2016-06-24 | individual meeting |
| 2016-07-01 | intermediate presentation - pragmatic solution (postponed) |
| 2016-07-08 | individual meeting |
| 2016-07-15 | individual meeting |
| 2016-07-22 | final presentation - tweaked solution |
| tbd. | finalize your documentation (GitHub) |