Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Present Muse, a text-to-image Transformer model that is more efficient than diffusion or autoregressive models
- Trained on a masked modeling task in discrete token space
- Uses pre-trained large language model (LLM) to predict randomly masked image tokens
- More efficient than diffusion and autoregressive models
- Achieves high-fidelity image generation and understanding of visual concepts
- New SOTA on CC3M with FID score of 6.06
- Enables image editing applications without the need to fine-tune or invert the model
Paper Content
Introduction
- Generative image models have improved in quality and flexibility in the last few years
- Deep learning architecture innovations and novel training paradigms have enabled this
- Large scale image-text paired datasets are available
- A new model for text-to-image synthesis is presented
- Model is conditioned on embeddings from a pre-trained and frozen T5-XXL language model
- Model is built on the Transformer architecture
- Model consists of several sub-models
- Model is more efficient than other models due to use of discrete tokens
- Model achieves excellent FID and CLIP scores
- Model is faster than comparable models
- Model enables zero-shot editing capabilities
Model
- Model is built on multiple components
- Overview of components provided in order of training
- Architecture and parameters discussed in Appendix
- Figure 3 provides overview of model architecture
Pre-trained text encoders
- Leveraging a pre-trained large language model (LLM) is beneficial to high-quality image generation.
- LLM embeddings carry rich information about objects, actions, visual properties, spatial relationships, cardinality and composition.
- LLM embeddings are linearly mappable to those learned by models trained on vision tasks.
Semantic tokenization using vqgan
- Model consists of an encoder and decoder with a quantization layer
- Encoder and decoder are built with convolutional layers
- Encoder has downsampling blocks, decoder has upsampling blocks
- Two VQGAN models used, one with downsampling ratio of 16 and one with 8
- Tokens capture higher-level semantics of the image, ignoring low level noise
Base model
- Base model is a masked transformer
- Inputs are projected T5 embeddings and image tokens
- Randomly mask a varying fraction of image tokens
- Replace masked tokens with a special [MASK] token
- Linearly map image tokens into image input embeddings
- Use several transformer layers to extract features
- Output layer uses MLP to convert each masked image embedding to a set of logits
- Cross-entropy loss is applied with the ground truth token label as the target
- Mask prediction is performed in an iterative manner for inference
Super-resolution model
- Trained two VQGAN models with different resolutions
- Latent map translation model is trained with text conditioning and cross-attention
Decoder finetuning
- Increase capacity of VQGAN decoder
- Finetune new decoder layers while keeping VQGAN encoder weights, codebook and transformers frozen
Variable masking rate
- Trains model with variable masking rate based on Cosine scheduling
- Expected masking rate of 0.64, bias towards higher masking rates makes prediction problem harder
- Learns P (x i |x ฮ ) for arbitrary subsets of tokens ฮ
Classifier free guidance
- Employs classifier-free guidance to improve generation quality and text-image alignment
- Removes text conditioning on 10% of samples randomly
- Computes conditional and unconditional logits for each masked token
- Moves away from unconditional logits by an amount (guidance scale)
- Increases guidance scale through sampling procedure
- Exploits mechanism to enable negative prompting
Iterative parallel decoding at inference
- Uses parallel decoding to predict multiple output tokens in a single forward pass
- Assumes tokens are conditionally independent given other tokens
- Uses cosine schedule to choose fixed fraction of highest confidence masked tokens
- Inference of 256 tokens using 24 decoding steps in base model, 4096 tokens using 8 decoding steps in super-resolution model
- Autoregressive models require 256 or 4096 steps, diffusion models require hundreds of steps
Results
- Trained a number of base Transformer models with different parameter sizes
- Used a pre-trained and frozen T5-XXL model with 4.6B parameters
- Largest base model had 3B parameters and 48 Transformer layers
- Used a CNN model with 19 ResNet blocks and a quantized codebook of size 8192
- Super-resolution model had 32 multi-axis Transformer layers
- Trained on Imagen dataset with 460M text-image pairs
- Training took 1 week on 512-core TPU-v4 chips
- Used Adafactor optimizer to save on memory consumption
- Avoided EMA of model weights during training, performed EMA offline on checkpointed weights
Qualitative performance
- Muse can generate images with different properties
- Muse can generate images with contextual variations
- Muse can generate images with prepositional object relations
- Muse can generate images in different styles
- Muse can render words and phrases
- Muse has difficulty with certain types of prompts
Quantitative performance
- Muse models achieve SOTA results on CC3M and COCO datasets
- FID score and CLIP score are used to measure performance
- 632M model achieves SOTA results on CC3M
- 3B model achieves FID score of 7.88 and CLIP score of 0.32
- Sampling algorithm has hyperparameters that can be adjusted to improve FID without hurting CLIP
- Human raters compare Muse and Stable Diffusion models for prompt-image alignment
- Muse chosen as better aligned than Stable Diffusion for 70.6% of prompts
- Muse is significantly faster than competing models
- Speed advantage due to use of discrete tokens and parallel decoding
Image editing
- Model can condition on arbitrary subsets of image tokens
- Sampling procedure gives text-guided inpainting and outpainting
- Multi-scale approach used for superresolution
- Zero-shot image editing of real input images
- Iteratively mask and resample tokens conditioned on text prompts
- Top-k sampling used to prevent process from diverging too much from input
Related work
Image generation models
- Variational autoencoders and GANs have been used to generate images with many variants.
- GANs were previously considered the best approach for image generation.
- Diffusion models, based on progressive denoising, can now generate images and video with equal or higher fidelity.
- Hybrid approaches combining multiple approaches have also shown excellent performance.
Image tokenizers
- Image tokenizers are useful for generative models.
- Tokenization approaches such as Discrete VAE’s, VQVAE and VQGAN have been developed.
- VQGAN is the highest-performing tokenization model.
- ViT-VQGAN extends VQGAN to the Transformer architecture.
- VQGAN performs better than ViT-VQGAN for the text-to-image model.
Large language models
- Leverages T5, a pre-trained large language model
- LLMs can learn powerful embeddings for few-shot transfer learning
- Modern LLMs are trained on token prediction tasks
Text-image models
- Leveraging paired text-image data is a powerful learning paradigm
- CLIP and ALIGN train models to align text and image embeddings
- Imagen and Parti use large scale text-image datasets to predict images from text
- Classifier-free guidance is used to trade off diversity and quality
Image editing with generative models
- GANs have been studied for image editing and manipulation
- A number of techniques have been developed on diffusion models to enable editing, personalization and inversion
- Dreambooth and Imagic involve fine-tuning of generative models
- ImagenEditor frames the editing task as text-guided image inpainting
Discussion and social impact
- Muse model confirms findings of Saharia et al. (2022) that frozen large pretrained language models serve as powerful text encoders for text-to-image generation
- Initial experiments showed that learning a language model from scratch was worse than using a pre-trained LLM
- Non-diffusion, non-autoregressive models based on the Transformer architecture can perform at par with diffusion models while being more efficient
- SOTA CLIP scores, showing excellent alignment between image and text
- Flexibility of approach with a number of image editing applications
- Generative models have potential to augment human creativity, but can also be leveraged for misinformation, harassment and biases
- Dataset biases can reflect negative social stereotypes and viewpoints
- Do not recommend use of text-to-image generation models without attention to use cases and potential for harm
- Caution against using such models for generation of people, humans and faces
- Muse achieves a CLIP score of 0.32, higher than the score of 0.27 reported in Imagen