Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Practical Applications of Multimedia Retrieval (Wintersemester 2016/2017)

Dozent: Prof. Dr. Christoph Meinel (Internet-Technologien und -Systeme) , Dr. Haojin Yang (Internet-Technologien und -Systeme)
Tutoren: Dr. Haojin Yang Christian Bartz

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist: 28.10.2016
  • Lehrform: SP
  • Belegungsart: Wahlpflichtmodul
  • Maximale Teilnehmerzahl: 12

Studiengänge, Modulgruppen & Module

IT-Systems Engineering BA
IT-Systems Engineering MA
  • IT-Systems Engineering A
  • IT-Systems Engineering B
  • IT-Systems Engineering C
  • IT-Systems Engineering D
  • IT-Systems Engineering Analyse


In the last decade digital libraries and web video portals have become more and more popular. The amount of video data available on the World Wide Web (WWW) is growing rapidly. According to the official statistic-report of the popular video portal YouTube more than 400 hours of video are uploaded every minute. Therefore, how to efficiently retrieve video data on the web or within large video archives has become a very important and challenging task.

In our current research we focus on video analysis and multimedia information retrieval (MIR) by using Deep-Learning techniques. Deep Learning (DL), as a new area of machine learning (since 2006), has already been impacting a wide range of multimedia information processing. Recently, the techniques developed based on DL achieved substantial progress in fields including Computer Vision, Speech Recognition, Image Classification and NLP etc.

Topics in this seminar:

  • Human identity verification using deep facial representation In modern face recognition, the conventional pipeline consists of four stages: face detection -> frontal face alignment -> facial representation -> classification. Convolutional Neural Networks (CNNs) have taken the computer vision community by significantly improving the state-of-the-art in many applications. In this project, we will work on developing a solution for face verification based on Deep Facial Model. The existing frontal face alignment methods should be studied and an efficient implementation is expected.
  • Indoor human activities recognition The number of surveillance cameras, importance of video analytics, storage time for surveillance data and strategic value of video surveillance are increasing significantly. Indoor human activities recognition is also one important part of event detection in surveillance videos. LIRIS provides a typical human activities recognition dataset which contains (gray/rgb/depth) videos showing people performing various activities taken from daily life (discussing, telephone calls, giving an item etc.)
  • German word vector generation and potential applications "Word Vector" is a kind of distributed representation of words, which is deriving from Deep Learning techniques and popular in various Natural Language Processing (NLP) applications recently. By far, the majority of success in word vectors are based on English. In this seminar topic, we aim to train our own German word vectors, evaluate their quality and attempt to apply them in real applications.
  • DRAW: Deep network for image generation Deep Learning approaches are "data driven" machine learning approaches that need huge amounts of data in order to successfully be trained on a specific task. It is said that a deep neural network needs at least 1000 labelled samples per class to achieve acceptable performance and around 1 million labelled samples to outperform humans on the task in question. Getting hold of enough data for training is a very challenging problem as it is not feasible to manually label millions of real world samples. A solution is to use artificially generated samples that are indistinguishable from real world samples. We want to have a look at the so called "DRAW" network that is capable of generating samples containing text. We want implement this architecture and evaluate whether it is possible to use this architecture to generate labelled examples that can be used for the task of scene text recognition.


  • Strong interests in video/image processing, machine learning (Deep Learning) and/or computer vision

  • Software development in C/C++ or Python

  • Experience with OpenCV and machine learning applications as a plus


  • Yoshua Bengio and Ian J. Goodfellow and Aaron Courville, "Deep Learning", online version: http://www.deeplearningbook.org/
  • cs231n tutorials: Convolutional Neural Networks for Visual Recognition
  • Caffe: Deep learning framework by the BVLC
  • Chainer: A flexible framework of neural networks
  • ENCP/CNNdroid: Open Source Library for GPU-Accelerated Execution of Trained Deep Convolutional Neural Networks on Android
  • Taigman, Y.;  Ming Yang ; Ranzato, M. ; Wolf, L. "DeepFace: Closing the Gap to Human-Level Performance in Face Verification",  Facebook AI Res., Menlo Park, CA, USA [pdf]
  • Dong Yi, Zhen Lei, Shengcai Liao and Stan Z. Li, "Learning Face Representation from Scratch",[pdf]
  • DRAW: A recurrent neural network for image generation. [pdf
  • Survey on the attention based RNN model and its applications in computer vision. [pdf]


The final evaluation will be based on:

  • Initial implementation / idea presentation, 10%

  • Final presentation, 20%

  • Report/Documentation, 12-18 pages, 30%

  • Implementation, 40%

  • Participation in the seminar (bonus points)


Donnerstag, 13.30-15.00

Room H-2.58

20.10.2016 13:30-15:00

Vorstellung der Themen (PDF)

bis 27.10.2016 

Wahl der Themen  (Anmelden on Doodle)


Bekanntgabe der Themen- und Gruppenzuordnung


Individuelle Meetings mit dem Betreuer

Anfang Dezember

Technologievorträge und geführte Diskussion (je 15+5min)


Präsentation der Endergebnisse (je 15+5min)

bis Ende Februar

Abgabe von Implementierung und Dokumentation

bis Ende März

Bewertung der Leistungen