Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Multimodal models are becoming more effective due to unified components
  • CLIPPO uses a single encoder to process both regular images and text rendered as images
  • CLIPPO performs image-based tasks with half the number of parameters and no text-specific tower or embedding
  • CLIPPO can perform well on natural language understanding tasks without any word-level loss
  • CLIPPO can achieve strong performance on multilingual multimodal retrieval without a tokenizer

Paper Content

Introduction

  • Large-scale multimodal training of Transformer-based models has improved performance in different domains.
  • A single large pre-trained model can outperform task-specific expert models.
  • Large multimodal models often use modality or dataset-specific encoders and decoders.
  • Transformer architecture works well on text, vision, audio, and other domains.
  • Mapping different modalities into a single shared embedding space simplifies the input/output interface.
  • Alternative representations of modalities allow harnessing in one domain neural architectures or training procedures designed for another domain.
  • CLIPPO performs similarly to CLIP on image classification and text/image retrieval.
  • CLIPPO can perform complex language understanding tasks without any left-to-right language modelling.
  • CLIPPO can obtain good performance on VQA when simply rendering the image and text together.
  • CLIPPO is related to CLIP and ALIGN which use contrastive training on noisy web data
  • Follow-ups have scaled further and used image representation learning
  • Model unification via weight-sharing has been explored
  • Co-training distinct tasks is a popular strategy
  • Self-supervised learning algorithms are used to unify task training
  • Discriminative tasks are used to learn representations for downstream modalities
  • Generative approaches to multimodal modelling have been scaled to billions of parameters
  • Document and user interface understanding models are trained on multimodal data sets
  • Contrastive pretraining on sentence pairs is explored as an auxiliary objective
  • Augmentations to generate text pairs involve word deletion, reordering, etc.
  • PIXEL is a closely related method from the NLP domain
  • Tokenizer models include WordPiece, Byte-Pair Encoding, and SentencePiece

Contrastive language-image pretraining with pixels

  • Contrastive language-image pretraining is a powerful, scalable paradigm to train versatile vision models on web-scale data sets
  • Image/alt-text pairs are automatically collected from the web and are usually noisy
  • Two encoders are jointly trained, a text encoder and an image encoder, to embed images and alt-texts into a shared latent space
  • Encoders are trained with a contrastive loss to make corresponding image and alt-text embeddings similar and dissimilar from all other embeddings
  • Once trained, encoder pair can be used for zero-shot classification, image/text retrieval, and supervised transfer learning
  • CLIPPO is a single vision transformer model that can understand both images and text and provides a single representation for image, image-language, and language understanding tasks
  • CLIPPO alleviates common hurdles with text processing, like tokenizer and vocabulary development
  • CLIPPO is competitive with strong baseline language models on the GLUE benchmark

Experiments

Training details and models

  • We use a single training setup for all our baselines and visual text models.
  • We use ViT-B/16 and ViT-L/16 architectures with a MAP head.
  • Representation dimension is 768.
  • Batch size is 10,240 and training steps are 250k.
  • Adafactor optimizer with learning rate of 10-3 and decoupled weight decay of 10-4.
  • CLIP-style models use T5en SentencePiece tokenizer.
  • Sequence length is 196.
  • We use WebLI data set with 10 billion images and 12 billion alt-texts in 109 languages.
  • Text/text data is from Colossal Clean Crawled Corpus (C4).
  • We also experiment with WMT19 data set and back-translated English sentences.

Evaluations and metrics

  • Evaluated vision and vision/language understanding capabilities of models using standard metrics
  • Reported classification accuracy on ImageNet-1k and recall@1 for cross-modal retrieval on MS-COCO and Flickr30k
  • Tested low-data transfer performance with 10shot accuracy on ImageNet-1k
  • Evaluated on VQA benchmark VQAv2
  • Evaluated multilingual capabilities with zero-shot retrieval on CrossModal3600
  • Evaluated language understanding capabilities on GLUE benchmark

Vision and vision-language understanding

  • CLIPPO and 1T-CLIP incur a drop of 2-3 percentage points compared to CLIP*
  • CLIPPO has 25-10% fewer parameters than 1T-CLIP
  • Multilingual CLIPPO performs worse than 1T-CLIP
  • Performance decreases when adding sentence pairs to training mix
  • CLIPPO outperforms CLIP*, 1T-CLIP and ViT-B/16
  • CLIPPO outperforms Pythia and MCAN models from [61]
  • CLIPPO performs competitively with ME-TER from [16]
  • CLIPPO performs comparably to ViLT, VisualBERT, PixelBERT
  • Separate patch embeddings and heads do not improve performance

Multilingual vision-language understanding

  • Tokenizer choice can be challenging for language models.
  • Tokenizers used for English don’t work well for non-Latin scripts.
  • CLIPPO removes language-related bias from tokenizers.
  • CLIPPO compared to other tokenizers on Crossmodal3600.
  • CLIPPO is more efficient than other tokenizers for most languages.
  • CLIPPO outperforms other models on GLUE score.

Language understanding

  • Training CLIPPO without image/alt-text data on pairs of parallel translated sentences
  • Evaluating the resulting text representations on GLUE

Modality gap analysis

  • Liang et al. found that text and image embeddings of CLIP-style models form two distinct clusters.
  • Training with sentence pairs in addition to image/alt-text pairs causes the clustering structure to disappear and the modality gap to decrease significantly.

Discussion and limitations

  • Proposed and evaluated CLIPPO which uses images as sole input modality
  • Matches performance of 1T-CLIP baseline
  • Has less than half the parameters of comparable CLIP*
  • Needs contrastive co-training with text pairs to achieve competitive language understanding performance
  • Adding 25% C4 data to batch strikes good balance across tasks, but induces drop in zero-shot image classification and image/text retrieval
  • Relies on cleanly rendered text as input
  • Uses encoder-only design and lacks ability to generate text outputs
  • Obtains strong multilingual image/text retrieval performance without requiring tokenizer

Conclusion

  • Introduced CLIPPO, a joint model for processing image and text through the lens of vision
  • Reduces design choices and parameter count
  • Can improve language understanding and increase generality across multiple languages
  • Explored methods of enhancing language understanding
  • CLIPPO models outperform strong NLP baselines while maintaining solid image understanding capabilities
  • Unified contrastive training algorithm
  • CLIPPO suffers somewhat when co-training on multiple tasks
  • Future work to further harmonize the co-training setup to ameliorate the trade-off
  • Deeper understanding of design choices made in rendering text as images and impact on performance
  • Examples of consecutive sentences from C4 corpus rendered using Unifont renderer
  • Example images from VQAv2 training set with rendered text
  • Single training setup for all baselines and visual text models
  • Adafactor optimizer with learning rate of 10-3
  • Batch size of 10,240 and train main models for 250k steps
  • Contrastive loss computed across full batch
  • Initialize learned temperature parameter in contrastive loss with value of 10
  • Reciprocal square root schedule with 10k steps linear warmup and 10k steps linear cooldown
  • Fine-tuning protocol inspired by [16, Sec. 4.1.1]
  • Results on vision and vision-language benchmarks and GLUE benchmark
  • Image classification and retrieval results
  • Visualize top 30 principal components of patch embedding kernel
  • CLIPPO produces smaller sequences for majority of languages compared to 1T-CLIP with alternative tokenizers
  • Visualization of modality gap for CLIP* and CLIPPO
  • CLIPPO has slightly smaller modality gap than CLIP*