Mining streaming data (Sommersemester 2019)

Dozent: Prof. Dr. Felix Naumann (Information Systems) , Dr. Thorsten Papenbrock (Information Systems) , Dr. Alexander Albrecht (Information Systems)
Website zum Kurs: https://hpi.de/naumann/teaching/teaching/ss-19/mining-streaming-data.html

Allgemeine Information

Semesterwochenstunden: 4
ECTS: 6
Benotet: Ja
Einschreibefrist: 26.04.2019
Lehrform: Vorlesung / Seminar
Belegungsart: Wahlpflichtmodul
Lehrsprache: Englisch
Maximale Teilnehmerzahl: 8

Studiengänge, Modulgruppen & Module

IT-Systems Engineering MA

OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-K Konzepte und Methoden
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-S Spezialisierung
OSIS: Operating Systems & Information Systems Technology
- HPI-OSIS-T Techniken und Werkzeuge
IT-Systems Engineering
- HPI-ITSE-A Analyse
IT-Systems Engineering
- HPI-ITSE-E Entwurf
IT-Systems Engineering
- HPI-ITSE-K Konstruktion
IT-Systems Engineering
- HPI-ITSE-M Maintenance

Data Engineering MA

SCAL: Scalable Data Systems
- HPI-SCAL-K Konzepte und Methode
SCAL: Scalable Data Systems
- HPI-SCAL-T echniken und Werkzeuge
SCAL: Scalable Data Systems
- HPI-SCAL-S Spezialisierung
DATA: Data Analytics
- HPI-DATA-T Techniken und Werkzeuge
DATA: Data Analytics
- HPI-DATA-S Spezialisierung
DATA: Data Analytics
- HPI-DATA-K Konzepte und Methoden

Beschreibung

In many big data scenarios, data comes with high speed as a never-ending stream of
events. For these data streams, decisions need to be made often on the fly.
Unfortunately, common algorithms are rarely applicable in scenarios with streaming data.
Most algorithms were designed for offline settings, i.e., the entire data set needs to be
scanned and processed (multiple times), before a decision can be made.
Therefore, novel algorithms on data streams are needed: In this seminar, we implement,
evaluate (and at best improve) streaming algorithms from current research projects. We will
look at data stream mining, recommendations for data streams, and algorithms for graph
streams where edges and vertices arrive in a streaming fashion.
Students will develop data streaming techniques and implement prototypes based on current
research projects: Each team, consisting of two students, chooses and presents a
challenging research task and implements the proposed solution using the streaming
framework Apache Kafka with Kafka Streams .
Students may select one of the papers listed here. This is a first selection of current research
about data streams (tbc). We welcome further suggestions.
● Chaitanya Manapragada, Geoffrey I. Webb, and Mahsa Salehi, Extremely Fast
Decision Tree , KDD 2018.
See also github.com/chaitanya-m/kdd2018.git
● Kai Sheng Tai, Vatsal Sharan, Peter Bailis, and Gregory Valiant, Sketching Linear
Classifiers over Data Streams , SIGMOD 2018.
See also github.com/stanford-futuredata/wmsketch
● Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve Uhlig,
Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing ,
SIGMOD 2018.
See also github.com/streamclassifier/ColdFilter
● Aneesh Sharma, Jerry Jiang, Praveen Bommannavar, Brian Larso, and Jimmy Lin,
GraphJet: Real-Time Content Recommendations at Twitter , VLDB 2016.
See also github.com/twitter/GraphJet
● Xiangmin Zhou, Dong Qin, Xiaolu Lu, Lei Chen, and Yanchun Zhang, Online Social
Media Recommendation over Streams , ICDE 2019.
● Dhivya Eswaran, Christos Faloutsos, Sudipto Guha, and Nina Mishra, SpotLight:
Detecting Anomalies in Streaming Graphs . KDD 2018.
This is a project seminar: There will be a few weekly lectures including an introductory
lecture and an invited talk from industry about Stream Processing with Apache Kafka.
Teams will frequently meet with the assigned supervisor.

Voraussetzungen

For this seminar, participants require the following prerequisites:

Database knowledge (ideally Database System I and Database Systems II)
Data streaming and distributed programming knowledge (ideally Distributed Data Analytics or Distributed Data Management)

Leistungserfassung

In teams, with team size is two students, you will be completing the following tasks:

(10%) Active participation during all seminar events.
(10%) Short presentation of the selected research paper.
(15%) Intermediate presentation demonstrating insights regarding your research prototype.
(00%) Regular meetings with advisor.
(20%) Implementation of a research prototype with Kafka and Kafka Streams.
(15%) Final presentation demonstrating your solution.
(30%) Code & documentation (on GitHub). The documentation should contain information on how to execute and evaluate your solution. Furthermore, it should also show strengths and weaknesses of the implementation.

Termine

Siehe Webseite des Fachgebiets

Zurück