Large-Scale Data Analysis in the Cloud

At a Glance

A very large RDF dataset
multiple Data Analysis Tasks
Map/Reduce and Hadoop on Amazon's Elastic Compute Cloud
100 dollars in Amazon credit per student
A competition

Description

In this seminar the student's task is to implement analyses for a very large RDF dataset. We will use the well-known Map/Reduce paradigm for parallelization such that initial computations and testing on a data subset can be performed on our in-house Hadoop cluster. Final results will be computed on Amazon's Elastic Compute Cloud (EC2).

The seminar will be organized as a competition. We will form teams. Each team deals with the same set of (ranked) problems. The team that solves the most higher-ranking problems most efficiently will win the competition.

Organization

12 master students, 6 teams (2 members each)
we provide a dataset and a ranked set of tasks
you work in teams to solve these tasks
your team may need to trade off quality of individual solutions for solving all tasks during the course of the seminar
teams compete against each other
teams must not contain more than one student of a former MapReduce seminar
interested students are required to attend the first meeting
Supervisors: Johannes Lorey, Christoph Böhm
date: 22.04.2010 (first session)
place: A-1.1.

Educational Objective

understand to deal with and preprocess very large datasets
implement parallel algorithms using Map/Reduce
find most efficient solutions
work with real-world cloud, deal with resource constraints
autonomous teamwork

Requirements

You are expected to show up in all sessions and personal meetings.
You have to design and implement Map/Reduce solutions in Java.
Give a talk about your solutions. Your fellow students are asked to discuss and comment your work and results.
Submit a report (5 pages) on your solutions. The report should document, discuss and evaluate your solutions, showing strengths and weaknesses, your suggestions and comments ...
Your final grade is affected by your implementation, its efficiency, your talk, your report, your participation in discussions and your attendance.

Schedule

Date	Topic	Slides
22.04.	organizational issues introduction to Hadoop and Map/Reduce problem presentation
26.04.	deadline for team structure and topic request (via e-mail to respective lecturer)
06.05.	introduction to AWS, EC2, S3, Elastic MapReduce demo	ReferenceSheet
10.06.	presentation of intermediate results
17.06.	guest lecture "Nephele/PACTs" by Stephan Ewen (TU Berlin) - start: 14:00
22.07.	final presentation 2 (teams 3,4 Informationssysteme, team 2 Betriebssysteme) - start: 13:30 in room A-1.1	Karran/Metzler Linkhorst/Wehrmeyer Richter/Thiele
23.07.	final presentation 1 (teams 1,2 Informationssysteme, team 1 Betriebsysteme) - start: 09:15 in room A-1.2	Jacob/Kny Fenz/Pohl
31.08.	submission of reports

Hadoop Cluster Assignment

Date	Team
17.-20.5.	Dandy Fenz, Matthias Pohl
21.-24.5.	Matthias Jacob, Eyk Kny
25.-28.5.	Benjamin Karran, Richard Metzler
29.5.-1.6.	Martin Linkhorst, Stefan Wehrmeyer

2.-5.6	Dandy Fenz, Matthias Pohl
6.-9.6.	Matthias Jacob, Eyk Kny
10.-13.6.	Benjamin Karran, Richard Metzler
14.-17.6.	Martin Linkhorst, Stefan Wehrmeyer

18.-19.6.	Dandy Fenz, Matthias Pohl
20.-21.6.	Matthias Jacob, Eyk Kny
22.-23.6.	Benjamin Karran, Richard Metzler
24.-25.6.	Martin Linkhorst, Stefan Wehrmeyer

Feel free to ask your fellow students for swapping time slots.