Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Scalable Data Analysis Algorithms

In this seminar, we want to implement large-scale data processing tasks on two MapReduce [1] platforms. Each team of two students will learn about and implement a large-scale data analysis problem in Hadoop [2] and Stratosphere [3] and compare the efficiency and scale-out properties of both solutions.

The maximum number of students is 8, resulting in 4 teams.

Time schedule

To participate, please join us at the first meeting on Oct 18 in H-2.58.

  • October 18: Topic introduction
  • October 22: Submission of topic wishlist
  • October 24: Notification
  • November 15: Paper presentation and implementation ideas (15+5 min)
  • December 20: Intermediate presentation (15+5 min)
  • February 10, 9.15: Final presentation (30+10 min)
  • March 25: Final report (6-8 pages)
  • A few mandatory (before presentations/submission of report) and some optional consultations once per week

Topics

We offer 4 topics that are well-described in [4].

Link Analysis

Calculate PageRank on a cluster efficiently (Chapter 5.2) and implement one extension countering link spam (either TrustRank (5.4.4) or SpamMass (5.4.5)).

Clustering in Non-Euclidian Space

Clustering groups similar items according to a distance measure. Chapter 7.5 introduces clustering in non-euclidian space and 7.6.6 outlines briefly a parallel implementation.

Frequent Itemsets

Frequent itemsets (Chapter 6) represent often co-occurring items in a large data set, e.g. books that are regularly bought together at Amazon. The SON algorithm can be well parallelized with MapReduce as described in Chapter 6.4.4.

Collaborative Filtering

Collaborative filtering is a technique to recommend items to users using a large knowledge base of previous user-item relations, e.g., purchase or ratings. Chapter 9 covers recommendation systems in general; a parallel implementation is the parallel stochastic gradient descent.

Slides

DateTopicSlides
October 18, 2011Introductionpdf
November 15, 2011

Topic Presentation

Page Rank

Collaborative Filtering

Clustering

Frequent Itemsets

December 23, 2011

Intermediate Presentation

Page Rank

Collaborative Filtering

Clustering

Frequent Itemsets

January 3, 2012Stratosphere Introductionpdf
February 10, 2012

Final Presentation

Page Rank

Collaborative Filtering

Clustering

Frequent Itemsets

Grading Process

  • 6 LP
  • Paper presentation and implementation ideas (15+5 min)
  • Intermediate presentation (15+5 min)
  • Final presentation (30+10 min)
  • Final report (6-8 pages)
  • Participation in seminars and consultations

Literature

Overview

[1] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Communications of the ACM 51.

[2] http://hadoop.apache.org/

[3] http://www.stratosphere.eu/

[4] Anand Rajaraman and Jeff Ullman. 2010. Mining of Massive Datasets. http://infolab.stanford.edu/~ullman/mmds.html

[5] Jens Dittrich, Jorge-Arnulfo Quiané-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, and Jörg Schad. 2010. Hadoop++: making a yellow elephant run like a cheetah (without it even noticing)In Proceedings of the VLDB Endowment.