Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

At a Glance

  • A very large RDF dataset
  • multiple Data Analysis Tasks
  • Map/Reduce and Hadoop on Amazon's Elastic Compute Cloud
  • 100 dollars in Amazon credit per student
  • A competition

Description

In this seminar the student's task is to implement analyses for a very large RDF dataset. We will use the well-known Map/Reduce paradigm for parallelization such that initial computations and testing on a data subset can be performed on our in-house Hadoop cluster. Final results will be computed on Amazon's Elastic Compute Cloud (EC2).

The seminar will be organized as a competition. We will form teams. Each team deals with the same set of (ranked) problems. The team that solves the most higher-ranking problems most efficiently will win the competition.

Organization

  • 12 master students, 6 teams (2 members each)
  • we provide a dataset and a ranked set of tasks
  • you work in teams to solve these tasks
  • your team may need to trade off quality of individual solutions for solving all tasks during the course of the seminar
  • teams compete against each other
  • teams must not contain more than one student of a former MapReduce seminar
  • interested students are required to attend the first meeting
  • Supervisors: Johannes Lorey, Christoph Böhm
  • date: 22.04.2010 (first session)
  • place: A-1.1.

Educational Objective

  • understand to deal with and preprocess very large datasets
  • implement parallel algorithms using Map/Reduce
  • find most efficient solutions
  • work with real-world cloud, deal with resource constraints
  • autonomous teamwork

Requirements

  • You are expected to show up in all sessions and personal meetings.
  • You have to design and implement Map/Reduce solutions in Java.
  • Give a talk about your solutions. Your fellow students are asked to discuss and comment your work and results.
  • Submit a report (5 pages) on your solutions. The report should document, discuss and evaluate your solutions, showing strengths and weaknesses, your suggestions and comments ...  
  • Your final grade is affected by your implementation, its efficiency, your talk, your report, your participation in discussions and your attendance.

Schedule

DateTopicSlides
22.04.
  • organizational issues
  • introduction to Hadoop and Map/Reduce
  • problem presentation
26.04.
  • deadline for team structure and topic request (via e-mail to respective lecturer)
06.05.
  • introduction to AWS, EC2, S3,
  • Elastic MapReduce
  • demo
ReferenceSheet
10.06.
  • presentation of intermediate results
17.06.
  • guest lecture "Nephele/PACTs" by Stephan Ewen (TU Berlin) - start: 14:00
22.07.
  • final presentation 2 (teams 3,4 Informationssysteme, team 2 Betriebssysteme) -
    start: 13:30 in room A-1.1

Karran/Metzler

Linkhorst/Wehrmeyer

Richter/Thiele

23.07.
  • final presentation 1 (teams 1,2 Informationssysteme, team 1 Betriebsysteme) -
    start: 09:15 in room A-1.2

Jacob/Kny

Fenz/Pohl

31.08.
  • submission of reports

Hadoop Cluster Assignment

DateTeam
17.-20.5.Dandy Fenz, Matthias Pohl
21.-24.5.Matthias Jacob, Eyk Kny
25.-28.5.Benjamin Karran, Richard Metzler
29.5.-1.6.Martin Linkhorst, Stefan Wehrmeyer
2.-5.6Dandy Fenz, Matthias Pohl
6.-9.6.Matthias Jacob, Eyk Kny
10.-13.6.Benjamin Karran, Richard Metzler
14.-17.6.Martin Linkhorst, Stefan Wehrmeyer
18.-19.6.Dandy Fenz, Matthias Pohl
20.-21.6.Matthias Jacob, Eyk Kny
22.-23.6.Benjamin Karran, Richard Metzler
24.-25.6.Martin Linkhorst, Stefan Wehrmeyer

Feel free to ask your fellow students for swapping time slots.