Distributed Big Data Analytics

Organization

Project seminar for Master students
6 credit points, 4 SWS
Weekly meetings: either group meeting or individual team meetings with a supervisor
Supervisors
- Toni Grütze
- Sebastian Kruse

Description

This seminar aims to give its participating students the opportunity to gain experience with the development of distributed data analysis program. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is an already widely adopted means to cope with some of the challenges that are coined by the term Big Data [1]: The Map/Reduce paradigm is flexible enough to deal with arbitrary data formats (Variety) and its programs can be transparently distributed across computer clusters to handle large amounts of data (Volume).

In this seminar, we want to have a look at the two more recent frameworks Apache Spark and Apache Flink. Both of them extend the classical Map/Reduce paradigm and prevail over Hadoop both in terms of flexibility and performance. Accordingly, the older Apache Spark has already gained considerable attention [2]. However, both platforms are eligible for more or less the same types of data analysis tasks, yet they differ in their execution strategies. Thus, our goal is to obtain an objective comparison between Flink and Spark by implementing and optimizing a representative set of different problems on both platforms and comparing these implementations in terms of performance.

Each team, consisting of two students, will be responsible for one problem class, implement appropriate algorithms on both platforms, and evaluate their performance. During the course of the seminar, the students will get to know important concepts of distributed computing, e.g. the Map/Reduce paradigm and data locality, but also get in touch with different technologies, e.g. the aforementioned Spark and Flink and the distributed filesystem HDFS. To further boost the learning curve, we encourage the students to share their insights and help each other during the regular group meetings.

Course Objectives

Learning objective

learn to develop distributed data analysis algorithms within the prominent Map/Reduce paradigm

Tasks

implement a simple and a complex problem from a specific domain on both Apache Flink and Apache Spark (both require either Java or Scala programming skills)
evaluate your implementations in terms of performance
employ obtained insights to compare both platforms
actively participate in group meetings so as to learn from other teams and let other teams learn from you

Deliverables

individual: active participation during group meetings and individual consolidations
each team: an intermediate presentation demonstrating your implementation on the first platform
each team: a final presentation presentation demonstrating your implemenation on the second platform, including a comparsion to the first implementation
all teams together: a submission-ready paper that compares Flink and Spark by combining each team's individual insights

Problem classes

Business Analytics
Data Cleansing
Data Profiling
Data Mining
Graph Mining
Text Mining
Machine Learning

Schedule

Date|Topic

Apr 13, 2015|Course presentation

Apr 20, 2015|Introduction to distributed data processing

Apr 27, 2015|Group meeting

around May 4, 2015|Individual team meetings

May 11, 2015|Group meeting

May 18, 2015|Group meeting

around May 25, 2015|Individual team meetings

Jun 1, 2015|Intermediate presentation

Jun 8, 2015|Group meeting

around Jun 15, 2015|Individual team meetings

Jun 22, 2015|Group meeting

around Jun 29, 2015|Individual team meetings

around Jul 6, 2015|Individual team meetings

Jul 13, 2015|Final presentation

tbd|Individual team meetings/group meeting

tbd|Hand-in of the paper

References

[1] http://blog.syncsort.com/2014/09/hadoop-market-adoption-survey-asks-big-data-analytics-ready-prime-time/
[2] http://jaxenter.de/news/umfrage-nutzung-apache-spark-178659