Hasso-Plattner-InstitutSDG am HPI
Hasso-Plattner-InstitutDSG am HPI

Practical Video Analyses (Sommersemester 2016)

Dozent: Prof. Dr. Christoph Meinel (Internet-Technologien und -Systeme) , Dr. Haojin Yang (Internet-Technologien und -Systeme)
Tutoren: Dr. Haojin Yang

Allgemeine Information

  • Semesterwochenstunden: 4
  • ECTS: 6
  • Benotet: Ja
  • Einschreibefrist: 22.04.2016
  • Lehrform: Seminar
  • Belegungsart: Wahlpflichtmodul
  • Maximale Teilnehmerzahl: 12

Studiengänge, Modulgruppen & Module

IT-Systems Engineering BA
IT-Systems Engineering MA
  • IT-Systems Engineering A
  • IT-Systems Engineering B
  • IT-Systems Engineering C
  • IT-Systems Engineering D


In the last decade digital libraries and web video portals have become more and more popular. The amount of video data available on the World Wide Web (WWW) is growing rapidly. According to the official statistic-report of the popular video portal YouTube  more than 300 hours of video are uploaded every minute. Therefore, how to efficiently retrieve video data on the web or within large video archives has become a very important and challenging task.

In our current research, we focus on video analysis and multimedia information retrieval (MIR) by using Deep-Learning techniques. Deep Learning (DL), as a new area of machine learning (since 2006), has already been impacting a wide range of multimedia information processing. Recently, the techniques developed based on DL achieved substantial progress in fields including Speech Recognition, Image Classification and Language Processing etc.

Topics in this seminar:

  • Multi-Lingual Automatic Subtitle Generator The demand of subtitles for online lectures is high. In this seminar, we aim to develope an automatic subtitle generator which is available for multiple input languages, such as English, German, Chinese, etc. The task begins with implementing a ASR (Automated Speech Recognition) tool to transcribe the lecture speech, and then using "Neural Network + Word Vector" approach to segment the transcript into sentence units with proper length, which can be used as subtitle items. Finally, a user-assisted subtitle editing platform or an automatic translation tool can be chosen as the last part of the project.
  • Human Identity Verification Using Deep Facial Representation In modern face recognition, the conventional pipeline consists of four stages: face detection -> frontal face alignment -> facial representation -> classification. Convolutional Neural Networks (CNNs) have taken the computer vision community by significantly improving the state-of-the-art in many applications. In this project, we will work on developing a robust Deep Facial Model based on CNNs. This model should provide higher-level facial representation. Furthermore a human identity verification demo system is expected, which will be based on an existing prototype system.
  • Neural Visual Translator: A Real-time System for Image Captioning In this project, students are expected to implement a real-time system that we called "neural visual translator" for generating descriptions for input image automatically. It will be based on Caffe framework and our previous research work "Image Captioning with Deep Bidirectional LSTMs" (check it on arXiv: http://arxiv.org/abs/1604.00790). The whole task can be divided into two sub-tasks: (1) A required task: “translate” image to sentence (input: image, output: sentence) (2) An optional task: “translated” sentence further to audio speech (input: image, output: sentence accompanied with audio). We provide: data, GPU machine, pre-trained recognition models, and the python scripts for prediction.


  • Strong interests in video/image processing, machine learning (Deep Learning) and/or computer vision

  • Software development in C/C++ or Python

  • Experience with OpenCV and machine learning applications as a plus


  • Yoshua Bengio and Ian J. Goodfellow and Aaron Courville, "Deep Learning", online version: http://www.iro.umontreal.ca/~bengioy/dlbook/
  • Caffe: Deep learning framework by the BVLC
  • Taigman, Y.;  Ming Yang ; Ranzato, M. ; Wolf, L. "DeepFace: Closing the Gap to Human-Level Performance in Face Verification",  Facebook AI Res., Menlo Park, CA, USA [pdf]
  • Dong Yi, Zhen Lei, Shengcai Liao and Stan Z. Li, "Learning Face Representation from Scratch",[pdf]
  • Cheng Wang et al., "Image Captioning with Deep Bidirectional LSTMs" [pdf]


The final evaluation will be based on:

  • Initial implementation / idea presentation, 10%

  • Final presentation, 20%

  • Report/Documentation, 12-18 pages, 30%

  • Implementation, 40%

  • Participation in the seminar (bonus points)


Monday, 13.30-15.00

Room H-2.57

11.04.2016 13:30-15:00

Vorstellung der Themen (PDF)

18.04.2016 bis 23:59

Wahl der Themen (Anmelden on Doodle)


Bekanntgabe der Themen- und Gruppenzuordnung


Individuelle Meetings mit dem Betreuer


Technologievorträge und geführte Diskussion (je 15+10min)

18.07.2016 (12:45-14:15)

Präsentation der Endergebnisse (je 15+10min)

Anfang August

Abgabe von Implementierung und Dokumentation

bis Ende August

Bewertung der Leistungen