

英语原文共 8 页,剩余内容已隐藏,支付完成后下载完整资料
原文:
Multimodal Deep Learning
Jiquan Ngiam1 jngiam@cs.stanford.edu
Aditya Khosla1 aditya86@cs.stanford.edu
Mingyu Kim minkyu89@cs.stanford.edu
Juhan Nam1 juhan@ccrma.stanford.edu
Honglak Lee honglak@eecs.umich.edu
Andrew Y. Ng ang@cs.stanford.edu
1 Computer Science Department, Stanford University, Stanford, CA 94305, USA
2 Computer Science and Engineering Division, University of Michigan, Ann Arbor, MI 48109, USA
Abstract
Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our models are validated on the CUAVE and AVLetters datasets on audio-visual speech classification, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.
1.Introduction
In speech recognition, humans are known to integrate audio-visual information in order to understand speech. This was first exemplified in the McGurk effect (McGurk amp; MacDonald, 1976) where a visual /ga/ with a voiced /ba/ is perceived as /da/ by most subjects. In particular, the visual modality provides information on the place of articulation and muscle movements (Summerfield, 1992) which can often help to disambiguate between speech with similar acoustics (e.g., the unvoiced consonants /p/ and /k/ ).
Multimodal learning involves relating information from multiple sources. For example, images and 3-d depth scans are correlated at first-order as depth discontinuities often manifest as strong edges in images. Conversely, audio and visual data for speech recognition have correlations at a “mid-level”, as phonemes and visemes (lip pose and motions); it can be difficult to relate raw pixels to audio waveforms or spectrograms.
In this paper, we are interested in modeling “midlevel” relationships, thus we choose to use audio-visual speech classification to validate our methods. In particular, we focus on learning representations for speech audio which are coupled with videos of the lips.
We will consider the learning settings shown in Figure 1. The overall task can be divided into three phases – feature learning, supervised training, and testing. A simple linear classifier is used for supervised training and testing to examine different feature learning models with multimodal data. In particular, we consider three learning settings – multimodal fusion, cross modality learning, and shared representation learning.
In the multimodal fusion setting, data from all modalities is available at all phases; this represents the typical setting considered in most prior work in audiovisual speech recognition (Potamianos et al., 2004). In cross modality learning, data from multiple modalities is available only during feature learning; during the supervised training and testing phase, only data from a single modality is provided. For this setting, the aim is to learn better single modality representations given unlabeled data from multiple modalities. Last, we consider a shared representation learning setting, which is unique in that different modalities are presented for supervised training and testing. This setting allows us to evaluate if the feature representations can capture correlations across different modalities. Specifically, studying this setting allows us to assess whether the learned representations are modality-invariant.
In the following sections, we first describe the building blocks of our model. We then present different multimodal learning models leading to a deep network that is able to perform the various multimodal learning tasks. Finally, we report experimental results and conclude.
2.Background
Recent work on deep learning (Hinton amp; Salakhutdinov, 2006; Salakhutdinov amp; Hinton, 2009) has examined how deep sigmoidal networks can be trained to produce useful representations for handwritten digits and text. The key idea is to use greedy layer-wise training with Restricted Boltzmann Machines (RBMs) followed by fine-tuning. We use an extension of RBMs with sparsity (Lee et al., 2007), which have been shown to learn meaningful features for digits and natural images. In the next section, we review the sparse RBM, which is used as a layer-wise building block for our models
2.1Sparse restricted Boltzmann machines
The RBM is an undirected graphical model with hidden variables (h) and visible variables (v) (Figure 2a).
剩余内容已隐藏,支付完成后下载完整资料
资料编号:[259599],资料为PDF文档或Word文档,PDF文档可免费转换为Word
