Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Denoising Diffusion models have been used to generate high quality images from text.
  • DALLE-2’s two step process uses a Diffusion Prior to generate a CLIP image embedding from text and a Diffusion Decoder to generate an image from the embedding.
  • Diffusion Prior can be used to constrain the generation to a specific domain without altering the Diffusion Decoder.
  • Diffusion Prior can be trained with additional conditional information to control the generation.
  • Proposed approaches perform better than existing baselines for color conditioned generation.

Paper Content

Introduction

  • Diffusion Models have been used to generate high quality and diverse images from text
  • Numerous efforts have been made to improve quality, speed and apply models for editing
  • Most works focus on architectures that condition the diffusion decoder directly on text embedding and encodings
  • In [52], authors propose a two-step model to generate CLIP image embedding conditioned on CLIP text embedding
  • We explore the capabilities and advantages of having a common intermediate representation
  • We leverage the Diffusion Prior model trained by LAION as the baseline for our experiments
  • We modify the cross-attention layers to condition on CLIP-L/14 image embedding instead of text embedding and encodings
  • We explore the possibilities with Diffusion Prior model to control image generations
  • We explore four applications: Text to Texture, Text to Rasterized Vectors, Text to Isolated Objects, Color Conditioned Text to Image
  • We train a new Diffusion Prior model for each application while keeping the larger Diffusion Decoder model intact
  • We evaluate our proposed setup on three domains and on color for conditional generation
  • We perform a comprehensive quantitative and qualitative performance evaluation with existing baselines
  • To the best of our knowledge, there is no existing work that shows effective semantic aware color conditional generation and domain specific generations using the DALLE-2 HDM architecture

Diffusion models

  • DMs are more popular than GANs for image synthesis
  • DMs use a Gaussian denoising process to predict noises
  • DMs are easier to train and scale than GANs
  • DMs can be conditioned on texts, images, or both
  • DMs can be applied to other computer vision tasks
  • Recent techniques have been introduced to improve DM’s training and sampling speed

Diffusion prior

  • Diffusion Prior is a classifier-free guidance DM used in OpenAI’s DALLE-2
  • Diffusion Prior maps input text embedding vector to image embedding vector in CLIP latent space
  • Diffusion Prior outperforms autoregression prior in model size and training time
  • Recent works using Diffusion Prior include Make-A-Video, Dream3D and Shifted Diffusion

Color conditioned generation

  • Color is an important attribute of an image that provides contextual information and sets the mood of the viewer.
  • Research has explored generating images with specific styles, but not using color palette.
  • There are works on image colorization and color transfer, but they require an image as input.
  • We propose to condition the Diffusion Prior model with color palette and generate a valid CLIP image embedding.
  • The proposed setup is applicable for other conditional inputs such as style, sketch, semantic map etc.

Domain specific generation

  • Our method outperforms existing large pretrained Stable Diffusion across all metrics.
  • Adding domain specific modifiers to the prompts helps a little for vector and textures in improving domain relevance.
  • FID scores show that the generated images from the proposed technique are of higher quality and are more relevant to the specific domain’s real distribution.
  • Irrespective of the complexity of the prompt, the generated image using our method does not generate out of domain images.

Proposed method

  • HDM is a text-to-image generation model
  • HDM follows DALLE-2 architecture
  • HDM has a Diffusion Prior and Diffusion Decoder model
  • y is the text prompt and x is the generated image

Diffusion prior model

  • The Prior model is a denoising diffusion model that generates a normalized CLIP L/14 image embedding.
  • The Diffusion Prior is parameterized by θ and takes as input a random noise and a CLIP text embedding.

Diffusion decoder model

  • We train a custom latent space model for memory and compute efficiency.
  • The model takes as input random sample from N (0, I) and the CLIP image embedding.
  • The generated latent is passed through a frozen decoder VAE dec to generate the final image.
  • We use the same two step hybrid architecture for text-to-image generation as [52].

Domain specific prior

  • Train separate Diffusion Prior model for each domain
  • Obtain curated internal dataset of images with texture, suitable for vectors and isolated objects
  • Train domain specific Diffusion Prior model on curated dataset
  • Domain specific models generate image embeddings within specific sub-space in CLIP embedding space
  • Embeddings from each prior can be visualized by LDM to generate domain specific images

Conditional prior

  • Studies have been done on style, shape and semantic map conditioned generations, but not on color conditioned text-to-image generation.
  • Color information can be used to help with creative workflows.
  • Color information is represented by a 3D histogram in the LAB space, which is perceptually closer to human color vision.

Experimental setup

Dataset

  • Used internal Adobe Stock dataset to train prior and decoder models
  • Removed images with humans or text using classifiers
  • Manually annotated 180K images to detect human presence/absence
  • Manually annotated 40K images to detect text presence
  • Removed NSFW images from training corpus using pretrained classifier
  • Manually annotated 30K images for texture prior
  • Used stock metadata to gather 1M positive and negative samples for vectors prior
  • Manually annotated 28K images for isolated objects prior
  • Used 61M only English subset of image-text pairs for color prior

Training and inference details

  • LDM has same number of parameters as Stable Diffusion
  • LDM trained on smaller dataset for fewer GPU hours
  • Prior models have lower number of parameters and training time
  • Prior models trained from scratch in 8-GPU A100-40GB instances
  • Training prior model from scratch takes less time and compute than finetuning larger decoder or stable diffusion
  • All experiments use 100 DDIM steps for sampling CLIP embedding and 50 DDIM steps to generate image
  • 10 embeddings generated per prompt and highest CLIP score chosen to input text prompt

Baselines

  • Represented method as ‘ours’
  • Consists of prior and LDM
  • Existing methods use prompt engineering with stable diffusion
  • Added ’texture background’, ‘vector illustration’ and ‘isolated on a plain white background’ as suffix to prompts
  • Used LAION prior as baseline for domain Diffusion Prior models
  • Used suffixes with LAION prior to support baseline

Color conditional generation

  • Table 3 provides quantitative results for the color prior model compared to baselines
  • Trade-off between color transfer and quality of images generated in existing baselines
  • SD+WCTRGB has best performance in color transfer but least quality in generation
  • SD+ReHistoGAN has least performance in color transfer but better FID score than proposed model
  • Proposed model strikes hue that corresponds to color palette applied over existing image

Metrics

  • 5000 random prompts used to generate images for comparison
  • FID used to measure quality of generated image and alignment with training distribution

Results and analysis

Conclusion

  • Common CLIP embedding space used in text to image generation pipeline
  • Diffusion Prior model trained on CLIP embedding space
  • Prior model smaller in memory and requires less time to be trained
  • Can be combined with existing decoder model to generate domain specific images
  • Robust to complex prompts
  • Color transfer techniques lack semantic awareness
  • Color conditioned text to image generation
  • Diffusion Prior model trained to accept additional conditioning input
  • On par or better performance across all metrics and domains
  • Limitation: vector/illustration as exemplar results in vector image output
  • Color histogram of ground truth image fed to Diffusion Prior model during training
  • Color words in text prompt given priority over color histogram
  • Diffusion Prior could be reduced in capacity further
  • Works for most other domains and other conditional inputs