Hasso-Plattner-Institut
 

Haojin Yang

Deep Representation Learning for Multimedia Data Analysis

In the last decade, due to the rapid development of digital devices, Internet bandwidth and social networks, an enormous amount of multimedia data have been created in the WWW (World Wide Web). According to the publicly available statistics, more than 400 hours of video are uploaded to YouTube every minute. 350 million photos are uploaded to Facebook every day; By 2021, video trac is expected to make up more than 82% of all consumer Internet trac. There is thus a pressing need to develop automated technologies for analyzing and indexing those \big multimedia data" more accurately and eciently. One of the current approaches is Deep Learning. This method is recognized as a particularly ecient machine learning method for multimedia data.
Deep learning (DL) is a sub- eld of Machine Learning and Arti cial Intelligence, and is based on a set of algorithms that attempt to learn representations of data and model their high-level abstractions. Since 2006 DL has attracted more and more attention in both academia and industry. Recently DL has produced break-record results in a broad range of areas, such as beating human in strategic game systems like Go (Googles AlphaGo), autonomous driving, and achieving dermatologist-level classi cation of skin cancer, etc.
In this Habilitationsschrift, I mainly address the following research problems: Nature scene text detection and recognition with deep learning. In this work, we developed two automatic scene text recognition systems: SceneTextReg and SEE by following the supervised and semi-supervised processing scheme, respectively. We designed novel neural network architectures and achieved promising results in both recognition accuracy and eciency.
Deep representation learning for multimodal data. We studied two sub-topics: visual-textual feature fusion in multimodal and cross-modal document retrieval task; Visual-language feature learning with its use case image captioning. The developed captioning model is robust to generate the new sentence descriptions for a given image in a very ecient way.
We developed BMXNet, an open-source Binary Neural Network (BNN) implementation based on the well-known deep learning framework Apache MXNet. We further conducted an extensive study on training strategy and executive eciency of BNN on image classi cation task. We showed meaningful scientic insights and made our models and codes publicly available; these can serve as a solid foundation for the future research work. Operability and accuracy of all proposed methods have been evaluated using publicly available benchmark data sets. While designing and developing theoretical algorithms, we also work on exploring how to apply these algorithms to practical applications. We investigated the applicability of two use cases, namely automatic online lecture analysis and medical image segmentation. The result demonstrates that such techniques might signi cantly impact or even subvert traditional industries and our daily lives.