Mining Massive Datasets (Sommersemester 2016)

Dozent: Prof. Dr. Felix Naumann (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/teaching/ss-16/mining-massive-datasets.html

Allgemeine Information

Semesterwochenstunden: 4
ECTS: 6
Benotet: Ja
Einschreibefrist: 22.04.2016
Lehrform: Seminar
Belegungsart: Wahlpflichtmodul
Maximale Teilnehmerzahl: 12

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA

IT-Systems Engineering A
IT-Systems Engineering B
IT-Systems Engineering C
IT-Systems Engineering D

IT-Systems Engineering BA

Beschreibung

In this course we will develop large scale data mining techniques and research prototypes. Each team, consisting of three students, must identify a challenging BigData problem and solve the problem using a distributed computing framework. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is already widely adopted. In this seminar, each group will apply one of the two more recent frameworks Apache Spark or Apache Flink to solve their challenge. Students will have access to the Amazon EC2 computing cluster. Hence, we will be able to work with only a small number of students, and enrollment will be limited.

This is a project course. There will be only a few weekly lectures, and only one or two introductory lectures. We will spend the quarter working in teams on different large scale data mining related projects. Teams will frequently meet with the assigned mentor.

Voraussetzungen

Programming experience in Java and/or Scala

Lern- und Lehrformen

The course shall teach the participants to solve a self-chosen data mining problem by developing distributed algorithms for example with the prominent Map/Reduce paradigm.

Leistungserfassung

individual: active participation during group meetings and individual consolidations
each team: a pitch of the selected data and identified problem
each team: a presentation of the Big-Data challenge (incl. proof of concept)
each team: an intermediate presentation demonstrating the your first insights regarding your distributed implementation
each team: a final presentation presentation demonstrating your solution
each team: a report (5 pages) on your solutions. The report should document, discuss and evaluate your solutions, showing strengths and weaknesses, your suggestions and comments

Termine

Please find the maintained schedule on the course page.

Zurück