Cheng Wang

Deep Learning of Multimodal Representations

A collection of different types of data such as text, image, video and audio etc., is called multimodal data, which can comprehensively illustrate the common semantic meaning of information from multiple sources. In recent years, the amount of such multimodal data has grown rapidly, posing a great challenge to multimedia analysis. There is a pressing need to intelligently process multimodal data and to extract different types of knowledge from this data. The goal of this thesis is to develop deep learning models that automatically learn representations from multimodal data in order to solve high-level tasks. The major tasks this thesis explores include the ranking task (multimodal and cross-modal retrieval), the discriminative task (human action recognition) and the generative task (image captioning).

There has been some progress made in delivering machine learning techniques for multimodal data. Existing approaches are often based on either well-designed features for representing data or on shallow models for capturing the correlations between different modalities. However, these models encounter diffculties in establishing mapping relationships across modalities in a high-level semantic space. To address the aforementioned shortcomings in conventional methods, in this thesis, we develop deep learning architectures and models. Through them we cannot only automatically learn deep semantic representations from multiple modalities but can also explore the latent relationships across modalities. We also investigate the learning of joint representations for mutlimodal data, which is beneficial in boosting the performance of a single modality.

The models introduced in this thesis are primarily built by combining multiple basic deep neural networks, such as multilayer perceptrons (MLP), convolutional neural networks (CNN) and recurrent neural networks (RNN), or extending these networks to multimodal data scenarios that involve text, image, video and audio. The three major chapters of this thesis respectively explore: (1) Visual-textual representation learning. This chapter aims to learn the relationship between images and their associated textual descriptions or tags. Such visual-textual correlations are essential in multimodal and cross-modal retrieval problems. (2) Video representation learning. Here we propose two approaches to learn video representations from multiple modalities, such as spatial, temporal and auditory information. In the first approach, we propose to use metric learning which leverages video-level similarity to learn discriminative video representations. The second approach explores the fusion of deep learning representations from spatial, temporal and auditory information and proves that such a fusion is able to boost action recognition performance. (3) Visual-language representation learning. This chapter designs an encoder-decoder architecture to connect images and word sequences. The learned visual-language models have the capability to generate novel sentence descriptions for a given input image.

In this thesis, the effectiveness and generality of our proposed models are evaluated on multiple benchmark datasets. The extensive experiments show that our methods achieve highly competitive or state-of-the-art performance.