Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Distributed Big Data Analytics


Organization

  • Project seminar for Master students
  • 6 credit points, 4 SWS
  • Weekly meetings: either group meeting or individual team meetings with a supervisor
  • Supervisors

Description

This seminar aims to give its participating students the opportunity to gain experience with the development of distributed data analysis program. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is an already widely adopted means to cope with some of the challenges that are coined by the term Big Data [1]: The Map/Reduce paradigm is flexible enough to deal with arbitrary data formats (Variety) and its programs can be transparently distributed across computer clusters to handle large amounts of data (Volume).

In this seminar, we want to have a look at the two more recent frameworks Apache Spark and Apache Flink. Both of them extend the classical Map/Reduce paradigm and prevail over Hadoop both in terms of flexibility and performance. Accordingly, the older Apache Spark has already gained considerable attention [2]. However, both platforms are eligible for more or less the same types of data analysis tasks, yet they differ in their execution strategies. Thus, our goal is to obtain an objective comparison between Flink and Spark by implementing and optimizing a representative set of different problems on both platforms and comparing these implementations in terms of performance.

Each team, consisting of two students, will be responsible for one problem class, implement appropriate algorithms on both platforms, and evaluate their performance. During the course of the seminar, the students will get to know important concepts of distributed computing, e.g. the Map/Reduce paradigm and data locality, but also get in touch with different technologies, e.g. the aforementioned Spark and Flink and the distributed filesystem HDFS. To further boost the learning curve, we encourage the students to share their insights and help each other during the regular group meetings.

Course Objectives

Learning objective

  • learn to develop distributed data analysis algorithms within the prominent Map/Reduce paradigm

Tasks

  • implement a simple and a complex problem from a specific domain on both Apache Flink and Apache Spark (both require either Java or Scala programming skills)
  • evaluate your implementations in terms of performance
  • employ obtained insights to compare both platforms
  • actively participate in group meetings so as to learn from other teams and let other teams learn from you

Deliverables

  • individual: active participation during group meetings and individual consolidations
  • each team: an intermediate presentation demonstrating your implementation on the first platform
  • each team: a final presentation presentation demonstrating your implemenation on the second platform, including a comparsion to the first implementation
  • all teams together: a submission-ready paper that compares Flink and Spark by combining each team's individual insights

Problem classes

  • Business Analytics
  • Data Cleansing
  • Data Profiling
  • Data Mining
  • Graph Mining
  • Text Mining
  • Machine Learning

Schedule

Date|Topic
Apr 13, 2015|Course presentation
Apr 20, 2015|Introduction to distributed data processing
Apr 27, 2015|Group meeting
around May 4, 2015|Individual team meetings
May 11, 2015|Group meeting
May 18, 2015|Group meeting
around May 25, 2015|Individual team meetings
Jun 1, 2015|Intermediate presentation
Jun 8, 2015|Group meeting
around Jun 15, 2015|Individual team meetings
Jun 22, 2015|Group meeting
around Jun 29, 2015|Individual team meetings
around Jul 6, 2015|Individual team meetings
Jul 13, 2015|Final presentation
tbd|Individual team meetings/group meeting
tbd|Hand-in of the paper