Large-Scale Data Analysis in the Cloud
At a Glance
- A very large RDF dataset
- multiple Data Analysis Tasks
- Map/Reduce and Hadoop on Amazon's Elastic Compute Cloud
- 100 dollars in Amazon credit per student
- A competition
Description
In this seminar the student's task is to implement analyses for a very large RDF dataset. We will use the well-known Map/Reduce paradigm for parallelization such that initial computations and testing on a data subset can be performed on our in-house Hadoop cluster. Final results will be computed on Amazon's Elastic Compute Cloud (EC2).
The seminar will be organized as a competition. We will form teams. Each team deals with the same set of (ranked) problems. The team that solves the most higher-ranking problems most efficiently will win the competition.
Organization
- 12 master students, 6 teams (2 members each)
- we provide a dataset and a ranked set of tasks
- you work in teams to solve these tasks
- your team may need to trade off quality of individual solutions for solving all tasks during the course of the seminar
- teams compete against each other
- teams must not contain more than one student of a former MapReduce seminar
- interested students are required to attend the first meeting
- Supervisors: Johannes Lorey, Christoph Böhm
- date: 22.04.2010 (first session)
- place: A-1.1.
Educational Objective
- understand to deal with and preprocess very large datasets
- implement parallel algorithms using Map/Reduce
- find most efficient solutions
- work with real-world cloud, deal with resource constraints
- autonomous teamwork
Requirements
- You are expected to show up in all sessions and personal meetings.
- You have to design and implement Map/Reduce solutions in Java.
- Give a talk about your solutions. Your fellow students are asked to discuss and comment your work and results.
- Submit a report (5 pages) on your solutions. The report should document, discuss and evaluate your solutions, showing strengths and weaknesses, your suggestions and comments ...
- Your final grade is affected by your implementation, its efficiency, your talk, your report, your participation in discussions and your attendance.
Schedule
| Date | Topic | Slides |
| 22.04. |
| |
| 26.04. |
| |
| 06.05. |
| ReferenceSheet |
| 10.06. |
| |
| 17.06. |
| |
| 22.07. |
| Richter/Thiele |
| 23.07. |
| |
| 31.08. |
|
Hadoop Cluster Assignment
| Date | Team | |
|---|---|---|
| 17.-20.5. | Dandy Fenz, Matthias Pohl | |
| 21.-24.5. | Matthias Jacob, Eyk Kny | |
| 25.-28.5. | Benjamin Karran, Richard Metzler | |
| 29.5.-1.6. | Martin Linkhorst, Stefan Wehrmeyer | |
| 2.-5.6 | Dandy Fenz, Matthias Pohl | |
| 6.-9.6. | Matthias Jacob, Eyk Kny | |
| 10.-13.6. | Benjamin Karran, Richard Metzler | |
| 14.-17.6. | Martin Linkhorst, Stefan Wehrmeyer | |
| 18.-19.6. | Dandy Fenz, Matthias Pohl | |
| 20.-21.6. | Matthias Jacob, Eyk Kny | |
| 22.-23.6. | Benjamin Karran, Richard Metzler | |
| 24.-25.6. | Martin Linkhorst, Stefan Wehrmeyer |
Feel free to ask your fellow students for swapping time slots.