Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Present Muse, a text-to-image Transformer model that is more efficient than diffusion or autoregressive models
  • Trained on a masked modeling task in discrete token space
  • Uses pre-trained large language model (LLM) to predict randomly masked image tokens
  • More efficient than diffusion and autoregressive models
  • Achieves high-fidelity image generation and understanding of visual concepts
  • New SOTA on CC3M with FID score of 6.06
  • Enables image editing applications without the need to fine-tune or invert the model

Paper Content

Introduction

  • Generative image models have improved in quality and flexibility in the last few years
  • Deep learning architecture innovations and novel training paradigms have enabled this
  • Large scale image-text paired datasets are available
  • A new model for text-to-image synthesis is presented
  • Model is conditioned on embeddings from a pre-trained and frozen T5-XXL language model
  • Model is built on the Transformer architecture
  • Model consists of several sub-models
  • Model is more efficient than other models due to use of discrete tokens
  • Model achieves excellent FID and CLIP scores
  • Model is faster than comparable models
  • Model enables zero-shot editing capabilities

Model

  • Model is built on multiple components
  • Overview of components provided in order of training
  • Architecture and parameters discussed in Appendix
  • Figure 3 provides overview of model architecture

Pre-trained text encoders

  • Leveraging a pre-trained large language model (LLM) is beneficial to high-quality image generation.
  • LLM embeddings carry rich information about objects, actions, visual properties, spatial relationships, cardinality and composition.
  • LLM embeddings are linearly mappable to those learned by models trained on vision tasks.

Semantic tokenization using vqgan

  • Model consists of an encoder and decoder with a quantization layer
  • Encoder and decoder are built with convolutional layers
  • Encoder has downsampling blocks, decoder has upsampling blocks
  • Two VQGAN models used, one with downsampling ratio of 16 and one with 8
  • Tokens capture higher-level semantics of the image, ignoring low level noise

Base model

  • Base model is a masked transformer
  • Inputs are projected T5 embeddings and image tokens
  • Randomly mask a varying fraction of image tokens
  • Replace masked tokens with a special [MASK] token
  • Linearly map image tokens into image input embeddings
  • Use several transformer layers to extract features
  • Output layer uses MLP to convert each masked image embedding to a set of logits
  • Cross-entropy loss is applied with the ground truth token label as the target
  • Mask prediction is performed in an iterative manner for inference

Super-resolution model

  • Trained two VQGAN models with different resolutions
  • Latent map translation model is trained with text conditioning and cross-attention

Decoder finetuning

  • Increase capacity of VQGAN decoder
  • Finetune new decoder layers while keeping VQGAN encoder weights, codebook and transformers frozen

Variable masking rate

  • Trains model with variable masking rate based on Cosine scheduling
  • Expected masking rate of 0.64, bias towards higher masking rates makes prediction problem harder
  • Learns P (x i |x ฮ› ) for arbitrary subsets of tokens ฮ›

Classifier free guidance

  • Employs classifier-free guidance to improve generation quality and text-image alignment
  • Removes text conditioning on 10% of samples randomly
  • Computes conditional and unconditional logits for each masked token
  • Moves away from unconditional logits by an amount (guidance scale)
  • Increases guidance scale through sampling procedure
  • Exploits mechanism to enable negative prompting

Iterative parallel decoding at inference

  • Uses parallel decoding to predict multiple output tokens in a single forward pass
  • Assumes tokens are conditionally independent given other tokens
  • Uses cosine schedule to choose fixed fraction of highest confidence masked tokens
  • Inference of 256 tokens using 24 decoding steps in base model, 4096 tokens using 8 decoding steps in super-resolution model
  • Autoregressive models require 256 or 4096 steps, diffusion models require hundreds of steps

Results

  • Trained a number of base Transformer models with different parameter sizes
  • Used a pre-trained and frozen T5-XXL model with 4.6B parameters
  • Largest base model had 3B parameters and 48 Transformer layers
  • Used a CNN model with 19 ResNet blocks and a quantized codebook of size 8192
  • Super-resolution model had 32 multi-axis Transformer layers
  • Trained on Imagen dataset with 460M text-image pairs
  • Training took 1 week on 512-core TPU-v4 chips
  • Used Adafactor optimizer to save on memory consumption
  • Avoided EMA of model weights during training, performed EMA offline on checkpointed weights

Qualitative performance

  • Muse can generate images with different properties
  • Muse can generate images with contextual variations
  • Muse can generate images with prepositional object relations
  • Muse can generate images in different styles
  • Muse can render words and phrases
  • Muse has difficulty with certain types of prompts

Quantitative performance

  • Muse models achieve SOTA results on CC3M and COCO datasets
  • FID score and CLIP score are used to measure performance
  • 632M model achieves SOTA results on CC3M
  • 3B model achieves FID score of 7.88 and CLIP score of 0.32
  • Sampling algorithm has hyperparameters that can be adjusted to improve FID without hurting CLIP
  • Human raters compare Muse and Stable Diffusion models for prompt-image alignment
  • Muse chosen as better aligned than Stable Diffusion for 70.6% of prompts
  • Muse is significantly faster than competing models
  • Speed advantage due to use of discrete tokens and parallel decoding

Image editing

  • Model can condition on arbitrary subsets of image tokens
  • Sampling procedure gives text-guided inpainting and outpainting
  • Multi-scale approach used for superresolution
  • Zero-shot image editing of real input images
  • Iteratively mask and resample tokens conditioned on text prompts
  • Top-k sampling used to prevent process from diverging too much from input

Image generation models

  • Variational autoencoders and GANs have been used to generate images with many variants.
  • GANs were previously considered the best approach for image generation.
  • Diffusion models, based on progressive denoising, can now generate images and video with equal or higher fidelity.
  • Hybrid approaches combining multiple approaches have also shown excellent performance.

Image tokenizers

  • Image tokenizers are useful for generative models.
  • Tokenization approaches such as Discrete VAE’s, VQVAE and VQGAN have been developed.
  • VQGAN is the highest-performing tokenization model.
  • ViT-VQGAN extends VQGAN to the Transformer architecture.
  • VQGAN performs better than ViT-VQGAN for the text-to-image model.

Large language models

  • Leverages T5, a pre-trained large language model
  • LLMs can learn powerful embeddings for few-shot transfer learning
  • Modern LLMs are trained on token prediction tasks

Text-image models

  • Leveraging paired text-image data is a powerful learning paradigm
  • CLIP and ALIGN train models to align text and image embeddings
  • Imagen and Parti use large scale text-image datasets to predict images from text
  • Classifier-free guidance is used to trade off diversity and quality

Image editing with generative models

  • GANs have been studied for image editing and manipulation
  • A number of techniques have been developed on diffusion models to enable editing, personalization and inversion
  • Dreambooth and Imagic involve fine-tuning of generative models
  • ImagenEditor frames the editing task as text-guided image inpainting

Discussion and social impact

  • Muse model confirms findings of Saharia et al. (2022) that frozen large pretrained language models serve as powerful text encoders for text-to-image generation
  • Initial experiments showed that learning a language model from scratch was worse than using a pre-trained LLM
  • Non-diffusion, non-autoregressive models based on the Transformer architecture can perform at par with diffusion models while being more efficient
  • SOTA CLIP scores, showing excellent alignment between image and text
  • Flexibility of approach with a number of image editing applications
  • Generative models have potential to augment human creativity, but can also be leveraged for misinformation, harassment and biases
  • Dataset biases can reflect negative social stereotypes and viewpoints
  • Do not recommend use of text-to-image generation models without attention to use cases and potential for harm
  • Caution against using such models for generation of people, humans and faces
  • Muse achieves a CLIP score of 0.32, higher than the score of 0.27 reported in Imagen