Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Description

In many big data scenarios, data comes with high speed as a never-ending stream of events. For these data streams, decisions need to be made often on the fly. Unfortunately, common algorithms are rarely applicable in scenarios with streaming data. Most algorithms were designed for offline settings, i.e., the entire data set needs to be scanned and processed (multiple times), before a decision can be made. Therefore, novel algorithms on data streams are needed.

In this seminar, we implement, evaluate (and at best improve) streaming algorithms from current research projects. We will look at data stream mining, recommendations for data streams, and algorithms for graph streams where edges and vertices arrive in a streaming fashion. Students will develop data streaming techniques and implement prototypes based on current research projects: Each team, consisting of two students, chooses and presents a challenging research task and implements the proposed solution using the streaming framework ​Apache Kafka​ with ​Kafka Streams​. Students may select one of the papers listed here:

  • Chaitanya Manapragada, Geoffrey I. Webb, and Mahsa Salehi, ​Extremely Fast Decision Tree​, KDD 2018.
    See also ​https://github.com/chaitanya-m/kdd2018.git
  • Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant, ​Sketching Linear Classifiers over Data Streams​, SIGMOD 2018.
    See also ​https://github.com/stanford-futuredata/wmsketch
  • Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig, Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing​, SIGMOD 2018.
    See also ​https://github.com/streamclassifier/ColdFilter
  • Aneesh Sharma, Jerry Jiang, Praveen Bommannavar, Brian Larso, and Jimmy Lin, GraphJet: Real-Time Content Recommendations at Twitter​, VLDB 2016.
    See also ​https://github.com/twitter/GraphJet
  • Xiangmin Zhou, Dong Qin, Xiaolu Lu, Lei Chen, and Yanchun Zhang, ​Online Social Media Recommendation over Streams​, ICDE 2019.
  • Dhivya Eswaran, Christos Faloutsos, Sudipto Guha, and Nina Mishra, ​SpotLight: Detecting Anomalies in Streaming Graphs​. KDD 2018.

This is a first selection of current research about data streams (tbc). We welcome further suggestions.

This is a project seminar: There will be a few weekly lectures including an introductory lecture and an invited talk from industry about Stream Processing with Apache Kafka. Teams will frequently meet with the assigned supervisor.

Now that the project finished successfully, the teams' implementations are available online:

Topic list

In teams, with team size is two students, you will be completing the following tasks:

  • Active participation during all seminar events.
  • Short presentation of the selected research paper.
  • Intermediate presentations demonstrating insights regarding your research prototype.
  • Regular meetings with advisor.
  • Implementation of a research prototype with Kafka and Kafka Streams.
  • Final presentation demonstrating your solution.
  • Code & documentation (on GitHub). The documentation should contain information on how to execute and evaluate your solution. Furthermore, it should also show strengths and weaknesses of the implementation.

Organization

  • Project seminar for master students
  • 6 credit points, 4 SWS
  • Weekly meetings: either as group meetings or individual team meetings with a supervisor
  • Supervisors: ​Dr. Alexander Albrecht and Dr. Thorsten Papenbrock
  • The first date serves as an introduction to the topic and the seminar. Subsequently, you can register for the course through an informal email by April 12 to Thorsten Papenbrock. In case of more than eight registrations, we have to pick slots randomly.

In teams, with team size is two students, you will be completing the following tasks:

  • (10%) Active participation during all seminar events.
  • (10%) Short presentation of the selected research paper.
  • (15%) Intermediate presentation demonstrating insights regarding your research prototype.
  • (00%) Regular meetings with advisor.
  • (20%) Implementation of a research prototype with Kafka and Kafka Streams.
  • (15%) Final presentation demonstrating your solution.
  • (30%) Code & documentation (on GitHub). The documentation should contain information on how to execute and evaluate your solution. Furthermore, it should also show strengths and weaknesses of the implementation.

Time Table

When: Wednesdays, 11 - 12:30 PM

Where: Campus II, Building F, Room F-2-10

DateTopic
10. AprilIntroduction: Paper Presentations & Kafka
17. AprilKick-off: Paper Selection & Team Building
24. AprilGuest Speaker Michael Noll(Confluent): "Kafka in Theory and Practice" (G-3.E.15/16)
01. May-
08. MayGuest Speaker Arvid Heise (bakdata): "Kafka Streams with Q&A"
15. MayFirst Presentations: Paper & Implementation Approach
12. JuneIntermediate Presentations
17. JulyFinal Presentations