Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Attention-based models are useful for multimodal processing.
  • Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
  • Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
  • The resulting models are able to perform unimodal inference on both video and audio benchmarks.

Paper Content

Introduction

  • Humans and other animals integrate multiple modalities to build their view of the world
  • Humans can process information and perform tasks with only one modality
  • Most multimodal models need all modalities to operate
  • Zorro is a multimodal Transformer architecture that can operate with single or multiple modalities
  • Zorro has separate unimodal and multimodal representation streams
  • Zorro can be pre-trained with audio-visual contrastive loss
  • Zorro achieves state-of-the-art performance on relevant benchmarks
  • Multimodal perception is challenging
  • Convolutional neural networks have been used to fuse activations from different modalities
  • Self-supervised audio-visual learning has been used to employ cross-modality similarity
  • Transformer architectures have been used to process different modalities
  • Zorro masking produces unimodal outputs without running the model multiple times
  • Zorro regulates cross-modality communication by masking latent connections
  • Zorro is used for supervised and self-supervised training

Architecture

  • Zorro architecture consists of three main blocks
  • Masking binary tensor used to specify which vectors are connected
  • Mask applied to self-attention and decoding cross-attention
  • Four outputs: audio-specific, video-specific, fusion-specific, global
  • Zorro-Swin and Zorro-HiP variants proposed
  • Contrastive audio-visual methods learn representations by aligning audio and video into common embedding space
  • Self-supervised loss contrasting unimodal representations with multimodal one

Experiments

  • Evaluated Zorro architecture on multiple settings
  • Training and evaluation procedures, as well as main datasets used
  • Evaluated against state-of-the-art models on 3 audiovisual and 1 vision benchmark
  • Ablated main design decisions and compared different architectures

Experimental details

  • Zorro is pre-trained using self-supervision and standard supervision
  • Four datasets are used for pre-training: AudioSet, YouTube-8M, ACAV-100M, and ImageNet-21k
  • Zorro is evaluated on AudioSet, VGGSound, and Kinetics-400
  • Zorro is also evaluated on unimodal fine-tuning tasks: Kinetics-400 for vision and ESC-50 for audio
  • Inputs to the model are video and audio

State-of-the-art comparison

  • Evaluated Zorro against state-of-the-art methods
  • Trained Zorro from scratch on AudioSet-2M using audio and visual modalities
  • Zorro matches or outperforms other methods trained on AudioSet-2M
  • Zorro performs similarly to supervised state-of-the-art when pre-trained only with self-supervision
  • Zorro performs comparably with SOTA models when initialized with ViT pre-trained on ImageNet-21k
  • Zorro performs similarly to MBT when pre-trained on YouTube-8M
  • Zorro can be trained using unimodal self-supervised methods
  • Zorro performs only 2.4% worse than when using audio-visual input when fine-tuned on Kinetics-400 (video only)
  • Compared Zorro with state-of-the-art in two settings: when labels are not used in pre-training or when labels are used

Architecture comparison

  • Zorro-Swin performs best when trained from scratch
  • Swin trains 25% faster than ViT
  • HiP is the fastest of the three
  • ViT is the best when finetuned after contrastive pre-training
  • Swin and HiP are faster and retain most of the performance

Zorro model flexibility

  • Zorro can produce meaningful unimodal outputs when fed with unimodal data
  • Models without unimodal output suffer when one modality is missing
  • Zorro and using two separate modality streams perform well when only one modality is provided
  • Zorro can adapt to unimodal training and provide useful initialization for multi-modal fine-tuning

Masking configurations

  • Four different types of attention masking are studied
  • Data independent streams are evaluated, where both models share weights but modalities are not connected
  • Input level fusion is evaluated, which consists of no masking in the model
  • Bottleneck masking is evaluated, where the fusion tokens can attend to each modality’s tokens
  • Zorro masking is compared to the other masking strategies
  • Modality independent streams are crucial for self-supervised training
  • Separate modality streams are useful for supervised learning
  • Zorro performs better when modality streams are independently treated
  • Zorro is trained using equation 3, which produces a slight decrease in performance

Conclusion

  • Introduced Zorro, a novel Transformer masking configuration for simultaneous unimodal and multimodal training and inference
  • Generates both unimodal and multimodal outputs
  • Improves performance when trained with supervised loss
  • Can be self-supervised with a contrastive loss
  • Evaluated on multimodal tasks
  • Zorro-ViT based on ViT-B/16 architecture, adapted for multi-modal processing
  • Zorro-Swin based on Swin architecture, adapted for video data
  • Zorro-HiP based on Hierarchical Perceiver architecture
  • Inputs are video frames and audio spectrograms, patched using 3D-convolutions
  • Each modality and fusion tokens processed through independent HiP blocks
  • Information blocked from flowing towards unimodal hidden representation
  • Queries cross-attend to unimodal and multimodal representation
  • Masking at decoding stage to produce unimodal and multimodal outputs
  • Zorro-ViT, Zorro-Swin and Zorro-HiP compared in terms of training speed
  • Zorro-ViT performs best when fine-tuning
  • Augment audio with SpecAugment and frequency jittering
  • Evaluate using 8 frame clips with 3 crops and average predictions
  • Optimizers and hyperparameters for models in Table 5
  • Train all models for 50 epochs, except ACAV-100M models which are trained for 10 epochs
  • Contrastive learning with 3 projectors and temperature τ = 0.08