Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Attention-based models are useful for multimodal processing.
Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
The resulting models are able to perform unimodal inference on both video and audio benchmarks.

Paper Content

Introduction

Humans and other animals integrate multiple modalities to build their view of the world
Humans can process information and perform tasks with only one modality
Most multimodal models need all modalities to operate
Zorro is a multimodal Transformer architecture that can operate with single or multiple modalities
Zorro has separate unimodal and multimodal representation streams
Zorro can be pre-trained with audio-visual contrastive loss
Zorro achieves state-of-the-art performance on relevant benchmarks

Multimodal perception is challenging
Convolutional neural networks have been used to fuse activations from different modalities
Self-supervised audio-visual learning has been used to employ cross-modality similarity
Transformer architectures have been used to process different modalities
Zorro masking produces unimodal outputs without running the model multiple times
Zorro regulates cross-modality communication by masking latent connections
Zorro is used for supervised and self-supervised training

Architecture

Zorro architecture consists of three main blocks
Masking binary tensor used to specify which vectors are connected
Mask applied to self-attention and decoding cross-attention
Four outputs: audio-specific, video-specific, fusion-specific, global
Zorro-Swin and Zorro-HiP variants proposed
Contrastive audio-visual methods learn representations by aligning audio and video into common embedding space
Self-supervised loss contrasting unimodal representations with multimodal one

Experiments

Evaluated Zorro architecture on multiple settings
Training and evaluation procedures, as well as main datasets used
Evaluated against state-of-the-art models on 3 audiovisual and 1 vision benchmark
Ablated main design decisions and compared different architectures

Experimental details

Zorro is pre-trained using self-supervision and standard supervision
Four datasets are used for pre-training: AudioSet, YouTube-8M, ACAV-100M, and ImageNet-21k
Zorro is evaluated on AudioSet, VGGSound, and Kinetics-400
Zorro is also evaluated on unimodal fine-tuning tasks: Kinetics-400 for vision and ESC-50 for audio
Inputs to the model are video and audio

State-of-the-art comparison

Evaluated Zorro against state-of-the-art methods
Trained Zorro from scratch on AudioSet-2M using audio and visual modalities
Zorro matches or outperforms other methods trained on AudioSet-2M
Zorro performs similarly to supervised state-of-the-art when pre-trained only with self-supervision
Zorro performs comparably with SOTA models when initialized with ViT pre-trained on ImageNet-21k
Zorro performs similarly to MBT when pre-trained on YouTube-8M
Zorro can be trained using unimodal self-supervised methods
Zorro performs only 2.4% worse than when using audio-visual input when fine-tuned on Kinetics-400 (video only)
Compared Zorro with state-of-the-art in two settings: when labels are not used in pre-training or when labels are used

Architecture comparison

Zorro-Swin performs best when trained from scratch
Swin trains 25% faster than ViT
HiP is the fastest of the three
ViT is the best when finetuned after contrastive pre-training
Swin and HiP are faster and retain most of the performance

Zorro model flexibility

Zorro can produce meaningful unimodal outputs when fed with unimodal data
Models without unimodal output suffer when one modality is missing
Zorro and using two separate modality streams perform well when only one modality is provided
Zorro can adapt to unimodal training and provide useful initialization for multi-modal fine-tuning

Masking configurations

Four different types of attention masking are studied
Data independent streams are evaluated, where both models share weights but modalities are not connected
Input level fusion is evaluated, which consists of no masking in the model
Bottleneck masking is evaluated, where the fusion tokens can attend to each modality’s tokens
Zorro masking is compared to the other masking strategies
Modality independent streams are crucial for self-supervised training
Separate modality streams are useful for supervised learning
Zorro performs better when modality streams are independently treated
Zorro is trained using equation 3, which produces a slight decrease in performance

Conclusion

Introduced Zorro, a novel Transformer masking configuration for simultaneous unimodal and multimodal training and inference
Generates both unimodal and multimodal outputs
Improves performance when trained with supervised loss
Can be self-supervised with a contrastive loss
Evaluated on multimodal tasks
Zorro-ViT based on ViT-B/16 architecture, adapted for multi-modal processing
Zorro-Swin based on Swin architecture, adapted for video data
Zorro-HiP based on Hierarchical Perceiver architecture
Inputs are video frames and audio spectrograms, patched using 3D-convolutions
Each modality and fusion tokens processed through independent HiP blocks
Information blocked from flowing towards unimodal hidden representation
Queries cross-attend to unimodal and multimodal representation
Masking at decoding stage to produce unimodal and multimodal outputs
Zorro-ViT, Zorro-Swin and Zorro-HiP compared in terms of training speed
Zorro-ViT performs best when fine-tuning
Augment audio with SpecAugment and frequency jittering
Evaluate using 8 frame clips with 3 crops and average predictions
Optimizers and hyperparameters for models in Table 5
Train all models for 50 epochs, except ACAV-100M models which are trained for 10 epochs
Contrastive learning with 3 projectors and temperature τ = 0.08

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Architecture#

Experiments#

Experimental details#

State-of-the-art comparison#

Architecture comparison#

Zorro model flexibility#

Masking configurations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Architecture

Experiments

Experimental details

State-of-the-art comparison

Architecture comparison

Zorro model flexibility

Masking configurations

Conclusion