Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Attention-based models are useful for multimodal processing.
- Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
- Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
- The resulting models are able to perform unimodal inference on both video and audio benchmarks.
Paper Content
Introduction
- Humans and other animals integrate multiple modalities to build their view of the world
- Humans can process information and perform tasks with only one modality
- Most multimodal models need all modalities to operate
- Zorro is a multimodal Transformer architecture that can operate with single or multiple modalities
- Zorro has separate unimodal and multimodal representation streams
- Zorro can be pre-trained with audio-visual contrastive loss
- Zorro achieves state-of-the-art performance on relevant benchmarks
Related work
- Multimodal perception is challenging
- Convolutional neural networks have been used to fuse activations from different modalities
- Self-supervised audio-visual learning has been used to employ cross-modality similarity
- Transformer architectures have been used to process different modalities
- Zorro masking produces unimodal outputs without running the model multiple times
- Zorro regulates cross-modality communication by masking latent connections
- Zorro is used for supervised and self-supervised training
Architecture
- Zorro architecture consists of three main blocks
- Masking binary tensor used to specify which vectors are connected
- Mask applied to self-attention and decoding cross-attention
- Four outputs: audio-specific, video-specific, fusion-specific, global
- Zorro-Swin and Zorro-HiP variants proposed
- Contrastive audio-visual methods learn representations by aligning audio and video into common embedding space
- Self-supervised loss contrasting unimodal representations with multimodal one
Experiments
- Evaluated Zorro architecture on multiple settings
- Training and evaluation procedures, as well as main datasets used
- Evaluated against state-of-the-art models on 3 audiovisual and 1 vision benchmark
- Ablated main design decisions and compared different architectures
Experimental details
- Zorro is pre-trained using self-supervision and standard supervision
- Four datasets are used for pre-training: AudioSet, YouTube-8M, ACAV-100M, and ImageNet-21k
- Zorro is evaluated on AudioSet, VGGSound, and Kinetics-400
- Zorro is also evaluated on unimodal fine-tuning tasks: Kinetics-400 for vision and ESC-50 for audio
- Inputs to the model are video and audio
State-of-the-art comparison
- Evaluated Zorro against state-of-the-art methods
- Trained Zorro from scratch on AudioSet-2M using audio and visual modalities
- Zorro matches or outperforms other methods trained on AudioSet-2M
- Zorro performs similarly to supervised state-of-the-art when pre-trained only with self-supervision
- Zorro performs comparably with SOTA models when initialized with ViT pre-trained on ImageNet-21k
- Zorro performs similarly to MBT when pre-trained on YouTube-8M
- Zorro can be trained using unimodal self-supervised methods
- Zorro performs only 2.4% worse than when using audio-visual input when fine-tuned on Kinetics-400 (video only)
- Compared Zorro with state-of-the-art in two settings: when labels are not used in pre-training or when labels are used
Architecture comparison
- Zorro-Swin performs best when trained from scratch
- Swin trains 25% faster than ViT
- HiP is the fastest of the three
- ViT is the best when finetuned after contrastive pre-training
- Swin and HiP are faster and retain most of the performance
Zorro model flexibility
- Zorro can produce meaningful unimodal outputs when fed with unimodal data
- Models without unimodal output suffer when one modality is missing
- Zorro and using two separate modality streams perform well when only one modality is provided
- Zorro can adapt to unimodal training and provide useful initialization for multi-modal fine-tuning
Masking configurations
- Four different types of attention masking are studied
- Data independent streams are evaluated, where both models share weights but modalities are not connected
- Input level fusion is evaluated, which consists of no masking in the model
- Bottleneck masking is evaluated, where the fusion tokens can attend to each modality’s tokens
- Zorro masking is compared to the other masking strategies
- Modality independent streams are crucial for self-supervised training
- Separate modality streams are useful for supervised learning
- Zorro performs better when modality streams are independently treated
- Zorro is trained using equation 3, which produces a slight decrease in performance
Conclusion
- Introduced Zorro, a novel Transformer masking configuration for simultaneous unimodal and multimodal training and inference
- Generates both unimodal and multimodal outputs
- Improves performance when trained with supervised loss
- Can be self-supervised with a contrastive loss
- Evaluated on multimodal tasks
- Zorro-ViT based on ViT-B/16 architecture, adapted for multi-modal processing
- Zorro-Swin based on Swin architecture, adapted for video data
- Zorro-HiP based on Hierarchical Perceiver architecture
- Inputs are video frames and audio spectrograms, patched using 3D-convolutions
- Each modality and fusion tokens processed through independent HiP blocks
- Information blocked from flowing towards unimodal hidden representation
- Queries cross-attend to unimodal and multimodal representation
- Masking at decoding stage to produce unimodal and multimodal outputs
- Zorro-ViT, Zorro-Swin and Zorro-HiP compared in terms of training speed
- Zorro-ViT performs best when fine-tuning
- Augment audio with SpecAugment and frequency jittering
- Evaluate using 8 frame clips with 3 crops and average predictions
- Optimizers and hyperparameters for models in Table 5
- Train all models for 50 epochs, except ACAV-100M models which are trained for 10 epochs
- Contrastive learning with 3 projectors and temperature τ = 0.08