The Hasso Plattner Institute offers a practically-oriented computer science study program at an internationally recognized institute. This study includes the Germany-wide unique IT-Systems Engineering program and the five master programs Cybersecurity, Data Engineering, Digital Health, IT-Systems Engineering and Software Systems Engineering.

Our researchers at HPI benefit from an inspiring scientific environment as well as a collaborative and inclusive atmosphere. In this environment, they obtain insights and findings that achieve societal impact. Our scientific work is structured within research clusters. In addition, we work together with scientific institutions, companies, and public institutions in numerous research programs worldwide.

The Hasso Plattner Institute in Potsdam is unique on the German academic landscape. The institute's program continues to grow with the support of its founder Hasso Plattner and through international cooperation. Find out more about the founder, events and studies at HPI.

The Hasso Plattner Institute has educational programs for both high school students and working professionals. It operates its own IT learning platform - openHPI - which provides free online courses. The Youth Academy organizes computer science camps and events for high school students. Professionals can take advantage of educational opportunities in the field of Design Thinking at the HPI Academy.

The press area of the Hasso Plattner Institute provides you with the latest press material, news, information on our social media channels and contact details.

Deep Learning of Multimodal Representations

A collection of different types of data such as text, image, video and audio etc., is called multimodal data, which can comprehensively illustrate the common semantic meaning of information from multiple sources. In recent years, the amount of such multimodal data has grown rapidly, posing a great challenge to multimedia analysis. There is a pressing need to intelligently process multimodal data and to extract different types of knowledge from this data. The goal of this thesis is to develop deep learning models that automatically learn representations from multimodal data in order to solve high-level tasks. The major tasks this thesis explores include the ranking task (multimodal and cross-modal retrieval), the discriminative task (human action recognition) and the generative task (image captioning).

There has been some progress made in delivering machine learning techniques for multimodal data. Existing approaches are often based on either well-designed features for representing data or on shallow models for capturing the correlations between different modalities. However, these models encounter diffculties in establishing mapping relationships across modalities in a high-level semantic space. To address the aforementioned shortcomings in conventional methods, in this thesis, we develop deep learning architectures and models. Through them we cannot only automatically learn deep semantic representations from multiple modalities but can also explore the latent relationships across modalities. We also investigate the learning of joint representations for mutlimodal data, which is beneﬁcial in boosting the performance of a single modality.

The models introduced in this thesis are primarily built by combining multiple basic deep neural networks, such as multilayer perceptrons (MLP), convolutional neural networks (CNN) and recurrent neural networks (RNN), or extending these networks to multimodal data scenarios that involve text, image, video and audio. The three major chapters of this thesis respectively explore: (1) Visual-textual representation learning. This chapter aims to learn the relationship between images and their associated textual descriptions or tags. Such visual-textual correlations are essential in multimodal and cross-modal retrieval problems. (2) Video representation learning. Here we propose two approaches to learn video representations from multiple modalities, such as spatial, temporal and auditory information. In the ﬁrst approach, we propose to use metric learning which leverages video-level similarity to learn discriminative video representations. The second approach explores the fusion of deep learning representations from spatial, temporal and auditory information and proves that such a fusion is able to boost action recognition performance. (3) Visual-language representation learning. This chapter designs an encoder-decoder architecture to connect images and word sequences. The learned visual-language models have the capability to generate novel sentence descriptions for a given input image.

In this thesis, the effectiveness and generality of our proposed models are evaluated on multiple benchmark datasets. The extensive experiments show that our methods achieve highly competitive or state-of-the-art performance.

Ombudsperson

Ombudspersons serve as neutral and qualified advisors in questions of good scientific practice and in suspected cases of scientific misconduct.

As far as possible, they contribute to solution-oriented conflict mediation.

If you have any questions, please contact:

Prof. Dr. Tilmann Rabl

Tel.: +49 (0)331 5509-280
E-Mail: tilmann.rabl(at)hpi.de

Future SOC Lab

The “HPI Future SOC Lab” is a cooperation of the Hasso-Plattner-Institut (HPI) and industrial partners. Its mission is to enable and promote exchange and interaction between the research community and the industrial partners.

Further Information

Research Schools

The HPI Research Schools for "Service-Oriented Systems Engineering" and "Data Science and Engineering" have branches in Cape Town, Haifa, Irvine and Nanjing.

Further Information

Digital Health Cluster

The Digital Health Cluster of the Hasso Plattner Institut (HPI) brings together individuals from health sciences, human sciences, data sciences, digital engineering and society with a shared goal to improve health and wellbeing.

Further Information