This seminar aims to give its participating students the opportunity to gain experience with the development of distributed data analysis program. Distributed computing, especially with the Map/Reduce framework Apache Hadoop, is an already widely adopted means to cope with some of the challenges that are coined by the term Big Data [1]: The Map/Reduce paradigm is flexible enough to deal with arbitrary data formats (Variety) and its programs can be transparently distributed across computer clusters to handle large amounts of data (Volume).
In this seminar, we want to have a look at the two more recent frameworks Apache Spark and Apache Flink. Both of them extend the classical Map/Reduce paradigm and prevail over Hadoop both in terms of flexibility and performance. Accordingly, the older Apache Spark has already gained considerable attention [2]. However, both platforms are eligible for more or less the same types of data analysis tasks, yet they differ in their execution strategies. Thus, our goal is to obtain an objective comparison between Flink and Spark by implementing and optimizing a representative set of different problems on both platforms and comparing these implementations in terms of performance.
Each team, consisting of two students, will be responsible for one problem class, implement appropriate algorithms on both platforms, and evaluate their performance. During the course of the seminar, the students will get to know important concepts of distributed computing, e.g. the Map/Reduce paradigm and data locality, but also get in touch with different technologies, e.g. the aforementioned Spark and Flink and the distributed filesystem HDFS. To further boost the learning curve, we encourage the students to share their insights and help each other during the regular group meetings.