Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Multimodal models are becoming more effective due to unified components
- CLIPPO uses a single encoder to process both regular images and text rendered as images
- CLIPPO performs image-based tasks with half the number of parameters and no text-specific tower or embedding
- CLIPPO can perform well on natural language understanding tasks without any word-level loss
- CLIPPO can achieve strong performance on multilingual multimodal retrieval without a tokenizer
Paper Content
Introduction
- Large-scale multimodal training of Transformer-based models has improved performance in different domains.
- A single large pre-trained model can outperform task-specific expert models.
- Large multimodal models often use modality or dataset-specific encoders and decoders.
- Transformer architecture works well on text, vision, audio, and other domains.
- Mapping different modalities into a single shared embedding space simplifies the input/output interface.
- Alternative representations of modalities allow harnessing in one domain neural architectures or training procedures designed for another domain.
- CLIPPO performs similarly to CLIP on image classification and text/image retrieval.
- CLIPPO can perform complex language understanding tasks without any left-to-right language modelling.
- CLIPPO can obtain good performance on VQA when simply rendering the image and text together.
Related work
- CLIPPO is related to CLIP and ALIGN which use contrastive training on noisy web data
- Follow-ups have scaled further and used image representation learning
- Model unification via weight-sharing has been explored
- Co-training distinct tasks is a popular strategy
- Self-supervised learning algorithms are used to unify task training
- Discriminative tasks are used to learn representations for downstream modalities
- Generative approaches to multimodal modelling have been scaled to billions of parameters
- Document and user interface understanding models are trained on multimodal data sets
- Contrastive pretraining on sentence pairs is explored as an auxiliary objective
- Augmentations to generate text pairs involve word deletion, reordering, etc.
- PIXEL is a closely related method from the NLP domain
- Tokenizer models include WordPiece, Byte-Pair Encoding, and SentencePiece
Contrastive language-image pretraining with pixels
- Contrastive language-image pretraining is a powerful, scalable paradigm to train versatile vision models on web-scale data sets
- Image/alt-text pairs are automatically collected from the web and are usually noisy
- Two encoders are jointly trained, a text encoder and an image encoder, to embed images and alt-texts into a shared latent space
- Encoders are trained with a contrastive loss to make corresponding image and alt-text embeddings similar and dissimilar from all other embeddings
- Once trained, encoder pair can be used for zero-shot classification, image/text retrieval, and supervised transfer learning
- CLIPPO is a single vision transformer model that can understand both images and text and provides a single representation for image, image-language, and language understanding tasks
- CLIPPO alleviates common hurdles with text processing, like tokenizer and vocabulary development
- CLIPPO is competitive with strong baseline language models on the GLUE benchmark
Experiments
Training details and models
- We use a single training setup for all our baselines and visual text models.
- We use ViT-B/16 and ViT-L/16 architectures with a MAP head.
- Representation dimension is 768.
- Batch size is 10,240 and training steps are 250k.
- Adafactor optimizer with learning rate of 10-3 and decoupled weight decay of 10-4.
- CLIP-style models use T5en SentencePiece tokenizer.
- Sequence length is 196.
- We use WebLI data set with 10 billion images and 12 billion alt-texts in 109 languages.
- Text/text data is from Colossal Clean Crawled Corpus (C4).
- We also experiment with WMT19 data set and back-translated English sentences.
Evaluations and metrics
- Evaluated vision and vision/language understanding capabilities of models using standard metrics
- Reported classification accuracy on ImageNet-1k and recall@1 for cross-modal retrieval on MS-COCO and Flickr30k
- Tested low-data transfer performance with 10shot accuracy on ImageNet-1k
- Evaluated on VQA benchmark VQAv2
- Evaluated multilingual capabilities with zero-shot retrieval on CrossModal3600
- Evaluated language understanding capabilities on GLUE benchmark
Vision and vision-language understanding
- CLIPPO and 1T-CLIP incur a drop of 2-3 percentage points compared to CLIP*
- CLIPPO has 25-10% fewer parameters than 1T-CLIP
- Multilingual CLIPPO performs worse than 1T-CLIP
- Performance decreases when adding sentence pairs to training mix
- CLIPPO outperforms CLIP*, 1T-CLIP and ViT-B/16
- CLIPPO outperforms Pythia and MCAN models from [61]
- CLIPPO performs competitively with ME-TER from [16]
- CLIPPO performs comparably to ViLT, VisualBERT, PixelBERT
- Separate patch embeddings and heads do not improve performance
Multilingual vision-language understanding
- Tokenizer choice can be challenging for language models.
- Tokenizers used for English don’t work well for non-Latin scripts.
- CLIPPO removes language-related bias from tokenizers.
- CLIPPO compared to other tokenizers on Crossmodal3600.
- CLIPPO is more efficient than other tokenizers for most languages.
- CLIPPO outperforms other models on GLUE score.
Language understanding
- Training CLIPPO without image/alt-text data on pairs of parallel translated sentences
- Evaluating the resulting text representations on GLUE
Modality gap analysis
- Liang et al. found that text and image embeddings of CLIP-style models form two distinct clusters.
- Training with sentence pairs in addition to image/alt-text pairs causes the clustering structure to disappear and the modality gap to decrease significantly.
Discussion and limitations
- Proposed and evaluated CLIPPO which uses images as sole input modality
- Matches performance of 1T-CLIP baseline
- Has less than half the parameters of comparable CLIP*
- Needs contrastive co-training with text pairs to achieve competitive language understanding performance
- Adding 25% C4 data to batch strikes good balance across tasks, but induces drop in zero-shot image classification and image/text retrieval
- Relies on cleanly rendered text as input
- Uses encoder-only design and lacks ability to generate text outputs
- Obtains strong multilingual image/text retrieval performance without requiring tokenizer
Conclusion
- Introduced CLIPPO, a joint model for processing image and text through the lens of vision
- Reduces design choices and parameter count
- Can improve language understanding and increase generality across multiple languages
- Explored methods of enhancing language understanding
- CLIPPO models outperform strong NLP baselines while maintaining solid image understanding capabilities
- Unified contrastive training algorithm
- CLIPPO suffers somewhat when co-training on multiple tasks
- Future work to further harmonize the co-training setup to ameliorate the trade-off
- Deeper understanding of design choices made in rendering text as images and impact on performance
- Examples of consecutive sentences from C4 corpus rendered using Unifont renderer
- Example images from VQAv2 training set with rendered text
- Single training setup for all baselines and visual text models
- Adafactor optimizer with learning rate of 10-3
- Batch size of 10,240 and train main models for 250k steps
- Contrastive loss computed across full batch
- Initialize learned temperature parameter in contrastive loss with value of 10
- Reciprocal square root schedule with 10k steps linear warmup and 10k steps linear cooldown
- Fine-tuning protocol inspired by [16, Sec. 4.1.1]
- Results on vision and vision-language benchmarks and GLUE benchmark
- Image classification and retrieval results
- Visualize top 30 principal components of patch embedding kernel
- CLIPPO produces smaller sequences for majority of languages compared to 1T-CLIP with alternative tokenizers
- Visualization of modality gap for CLIP* and CLIPPO
- CLIPPO has slightly smaller modality gap than CLIP*