Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Multimodal models are becoming more effective due to unified components
CLIPPO uses a single encoder to process both regular images and text rendered as images
CLIPPO performs image-based tasks with half the number of parameters and no text-specific tower or embedding
CLIPPO can perform well on natural language understanding tasks without any word-level loss
CLIPPO can achieve strong performance on multilingual multimodal retrieval without a tokenizer

Paper Content

Introduction

Large-scale multimodal training of Transformer-based models has improved performance in different domains.
A single large pre-trained model can outperform task-specific expert models.
Large multimodal models often use modality or dataset-specific encoders and decoders.
Transformer architecture works well on text, vision, audio, and other domains.
Mapping different modalities into a single shared embedding space simplifies the input/output interface.
Alternative representations of modalities allow harnessing in one domain neural architectures or training procedures designed for another domain.
CLIPPO performs similarly to CLIP on image classification and text/image retrieval.
CLIPPO can perform complex language understanding tasks without any left-to-right language modelling.
CLIPPO can obtain good performance on VQA when simply rendering the image and text together.

CLIPPO is related to CLIP and ALIGN which use contrastive training on noisy web data
Follow-ups have scaled further and used image representation learning
Model unification via weight-sharing has been explored
Co-training distinct tasks is a popular strategy
Self-supervised learning algorithms are used to unify task training
Discriminative tasks are used to learn representations for downstream modalities
Generative approaches to multimodal modelling have been scaled to billions of parameters
Document and user interface understanding models are trained on multimodal data sets
Contrastive pretraining on sentence pairs is explored as an auxiliary objective
Augmentations to generate text pairs involve word deletion, reordering, etc.
PIXEL is a closely related method from the NLP domain
Tokenizer models include WordPiece, Byte-Pair Encoding, and SentencePiece

Contrastive language-image pretraining with pixels

Contrastive language-image pretraining is a powerful, scalable paradigm to train versatile vision models on web-scale data sets
Image/alt-text pairs are automatically collected from the web and are usually noisy
Two encoders are jointly trained, a text encoder and an image encoder, to embed images and alt-texts into a shared latent space
Encoders are trained with a contrastive loss to make corresponding image and alt-text embeddings similar and dissimilar from all other embeddings
Once trained, encoder pair can be used for zero-shot classification, image/text retrieval, and supervised transfer learning
CLIPPO is a single vision transformer model that can understand both images and text and provides a single representation for image, image-language, and language understanding tasks
CLIPPO alleviates common hurdles with text processing, like tokenizer and vocabulary development
CLIPPO is competitive with strong baseline language models on the GLUE benchmark

Experiments

Training details and models

We use a single training setup for all our baselines and visual text models.
We use ViT-B/16 and ViT-L/16 architectures with a MAP head.
Representation dimension is 768.
Batch size is 10,240 and training steps are 250k.
Adafactor optimizer with learning rate of 10-3 and decoupled weight decay of 10-4.
CLIP-style models use T5en SentencePiece tokenizer.
Sequence length is 196.
We use WebLI data set with 10 billion images and 12 billion alt-texts in 109 languages.
Text/text data is from Colossal Clean Crawled Corpus (C4).
We also experiment with WMT19 data set and back-translated English sentences.

Evaluations and metrics

Evaluated vision and vision/language understanding capabilities of models using standard metrics
Reported classification accuracy on ImageNet-1k and recall@1 for cross-modal retrieval on MS-COCO and Flickr30k
Tested low-data transfer performance with 10shot accuracy on ImageNet-1k
Evaluated on VQA benchmark VQAv2
Evaluated multilingual capabilities with zero-shot retrieval on CrossModal3600
Evaluated language understanding capabilities on GLUE benchmark

Vision and vision-language understanding

CLIPPO and 1T-CLIP incur a drop of 2-3 percentage points compared to CLIP*
CLIPPO has 25-10% fewer parameters than 1T-CLIP
Multilingual CLIPPO performs worse than 1T-CLIP
Performance decreases when adding sentence pairs to training mix
CLIPPO outperforms CLIP*, 1T-CLIP and ViT-B/16
CLIPPO outperforms Pythia and MCAN models from [61]
CLIPPO performs competitively with ME-TER from [16]
CLIPPO performs comparably to ViLT, VisualBERT, PixelBERT
Separate patch embeddings and heads do not improve performance

Multilingual vision-language understanding

Tokenizer choice can be challenging for language models.
Tokenizers used for English don’t work well for non-Latin scripts.
CLIPPO removes language-related bias from tokenizers.
CLIPPO compared to other tokenizers on Crossmodal3600.
CLIPPO is more efficient than other tokenizers for most languages.
CLIPPO outperforms other models on GLUE score.

Language understanding

Training CLIPPO without image/alt-text data on pairs of parallel translated sentences
Evaluating the resulting text representations on GLUE

Modality gap analysis

Liang et al. found that text and image embeddings of CLIP-style models form two distinct clusters.
Training with sentence pairs in addition to image/alt-text pairs causes the clustering structure to disappear and the modality gap to decrease significantly.

Discussion and limitations

Proposed and evaluated CLIPPO which uses images as sole input modality
Matches performance of 1T-CLIP baseline
Has less than half the parameters of comparable CLIP*
Needs contrastive co-training with text pairs to achieve competitive language understanding performance
Adding 25% C4 data to batch strikes good balance across tasks, but induces drop in zero-shot image classification and image/text retrieval
Relies on cleanly rendered text as input
Uses encoder-only design and lacks ability to generate text outputs
Obtains strong multilingual image/text retrieval performance without requiring tokenizer

Conclusion

Introduced CLIPPO, a joint model for processing image and text through the lens of vision
Reduces design choices and parameter count
Can improve language understanding and increase generality across multiple languages
Explored methods of enhancing language understanding
CLIPPO models outperform strong NLP baselines while maintaining solid image understanding capabilities
Unified contrastive training algorithm
CLIPPO suffers somewhat when co-training on multiple tasks
Future work to further harmonize the co-training setup to ameliorate the trade-off
Deeper understanding of design choices made in rendering text as images and impact on performance
Examples of consecutive sentences from C4 corpus rendered using Unifont renderer
Example images from VQAv2 training set with rendered text
Single training setup for all baselines and visual text models
Adafactor optimizer with learning rate of 10-3
Batch size of 10,240 and train main models for 250k steps
Contrastive loss computed across full batch
Initialize learned temperature parameter in contrastive loss with value of 10
Reciprocal square root schedule with 10k steps linear warmup and 10k steps linear cooldown
Fine-tuning protocol inspired by [16, Sec. 4.1.1]
Results on vision and vision-language benchmarks and GLUE benchmark
Image classification and retrieval results
Visualize top 30 principal components of patch embedding kernel
CLIPPO produces smaller sequences for majority of languages compared to 1T-CLIP with alternative tokenizers
Visualization of modality gap for CLIP* and CLIPPO
CLIPPO has slightly smaller modality gap than CLIP*

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Contrastive language-image pretraining with pixels#

Experiments#

Training details and models#

Evaluations and metrics#

Vision and vision-language understanding#

Multilingual vision-language understanding#

Language understanding#

Modality gap analysis#

Discussion and limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Contrastive language-image pretraining with pixels

Experiments

Training details and models

Evaluations and metrics

Vision and vision-language understanding

Multilingual vision-language understanding

Language understanding

Modality gap analysis

Discussion and limitations

Conclusion