Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Denoising Diffusion models have been used to generate high quality images from text.
- DALLE-2’s two step process uses a Diffusion Prior to generate a CLIP image embedding from text and a Diffusion Decoder to generate an image from the embedding.
- Diffusion Prior can be used to constrain the generation to a specific domain without altering the Diffusion Decoder.
- Diffusion Prior can be trained with additional conditional information to control the generation.
- Proposed approaches perform better than existing baselines for color conditioned generation.
Paper Content
Introduction
- Diffusion Models have been used to generate high quality and diverse images from text
- Numerous efforts have been made to improve quality, speed and apply models for editing
- Most works focus on architectures that condition the diffusion decoder directly on text embedding and encodings
- In [52], authors propose a two-step model to generate CLIP image embedding conditioned on CLIP text embedding
- We explore the capabilities and advantages of having a common intermediate representation
- We leverage the Diffusion Prior model trained by LAION as the baseline for our experiments
- We modify the cross-attention layers to condition on CLIP-L/14 image embedding instead of text embedding and encodings
- We explore the possibilities with Diffusion Prior model to control image generations
- We explore four applications: Text to Texture, Text to Rasterized Vectors, Text to Isolated Objects, Color Conditioned Text to Image
- We train a new Diffusion Prior model for each application while keeping the larger Diffusion Decoder model intact
- We evaluate our proposed setup on three domains and on color for conditional generation
- We perform a comprehensive quantitative and qualitative performance evaluation with existing baselines
- To the best of our knowledge, there is no existing work that shows effective semantic aware color conditional generation and domain specific generations using the DALLE-2 HDM architecture
Related work
Diffusion models
- DMs are more popular than GANs for image synthesis
- DMs use a Gaussian denoising process to predict noises
- DMs are easier to train and scale than GANs
- DMs can be conditioned on texts, images, or both
- DMs can be applied to other computer vision tasks
- Recent techniques have been introduced to improve DM’s training and sampling speed
Diffusion prior
- Diffusion Prior is a classifier-free guidance DM used in OpenAI’s DALLE-2
- Diffusion Prior maps input text embedding vector to image embedding vector in CLIP latent space
- Diffusion Prior outperforms autoregression prior in model size and training time
- Recent works using Diffusion Prior include Make-A-Video, Dream3D and Shifted Diffusion
Color conditioned generation
- Color is an important attribute of an image that provides contextual information and sets the mood of the viewer.
- Research has explored generating images with specific styles, but not using color palette.
- There are works on image colorization and color transfer, but they require an image as input.
- We propose to condition the Diffusion Prior model with color palette and generate a valid CLIP image embedding.
- The proposed setup is applicable for other conditional inputs such as style, sketch, semantic map etc.
Domain specific generation
- Our method outperforms existing large pretrained Stable Diffusion across all metrics.
- Adding domain specific modifiers to the prompts helps a little for vector and textures in improving domain relevance.
- FID scores show that the generated images from the proposed technique are of higher quality and are more relevant to the specific domain’s real distribution.
- Irrespective of the complexity of the prompt, the generated image using our method does not generate out of domain images.
Proposed method
- HDM is a text-to-image generation model
- HDM follows DALLE-2 architecture
- HDM has a Diffusion Prior and Diffusion Decoder model
- y is the text prompt and x is the generated image
Diffusion prior model
- The Prior model is a denoising diffusion model that generates a normalized CLIP L/14 image embedding.
- The Diffusion Prior is parameterized by θ and takes as input a random noise and a CLIP text embedding.
Diffusion decoder model
- We train a custom latent space model for memory and compute efficiency.
- The model takes as input random sample from N (0, I) and the CLIP image embedding.
- The generated latent is passed through a frozen decoder VAE dec to generate the final image.
- We use the same two step hybrid architecture for text-to-image generation as [52].
Domain specific prior
- Train separate Diffusion Prior model for each domain
- Obtain curated internal dataset of images with texture, suitable for vectors and isolated objects
- Train domain specific Diffusion Prior model on curated dataset
- Domain specific models generate image embeddings within specific sub-space in CLIP embedding space
- Embeddings from each prior can be visualized by LDM to generate domain specific images
Conditional prior
- Studies have been done on style, shape and semantic map conditioned generations, but not on color conditioned text-to-image generation.
- Color information can be used to help with creative workflows.
- Color information is represented by a 3D histogram in the LAB space, which is perceptually closer to human color vision.
Experimental setup
Dataset
- Used internal Adobe Stock dataset to train prior and decoder models
- Removed images with humans or text using classifiers
- Manually annotated 180K images to detect human presence/absence
- Manually annotated 40K images to detect text presence
- Removed NSFW images from training corpus using pretrained classifier
- Manually annotated 30K images for texture prior
- Used stock metadata to gather 1M positive and negative samples for vectors prior
- Manually annotated 28K images for isolated objects prior
- Used 61M only English subset of image-text pairs for color prior
Training and inference details
- LDM has same number of parameters as Stable Diffusion
- LDM trained on smaller dataset for fewer GPU hours
- Prior models have lower number of parameters and training time
- Prior models trained from scratch in 8-GPU A100-40GB instances
- Training prior model from scratch takes less time and compute than finetuning larger decoder or stable diffusion
- All experiments use 100 DDIM steps for sampling CLIP embedding and 50 DDIM steps to generate image
- 10 embeddings generated per prompt and highest CLIP score chosen to input text prompt
Baselines
- Represented method as ‘ours’
- Consists of prior and LDM
- Existing methods use prompt engineering with stable diffusion
- Added ’texture background’, ‘vector illustration’ and ‘isolated on a plain white background’ as suffix to prompts
- Used LAION prior as baseline for domain Diffusion Prior models
- Used suffixes with LAION prior to support baseline
Color conditional generation
- Table 3 provides quantitative results for the color prior model compared to baselines
- Trade-off between color transfer and quality of images generated in existing baselines
- SD+WCTRGB has best performance in color transfer but least quality in generation
- SD+ReHistoGAN has least performance in color transfer but better FID score than proposed model
- Proposed model strikes hue that corresponds to color palette applied over existing image
Metrics
- 5000 random prompts used to generate images for comparison
- FID used to measure quality of generated image and alignment with training distribution
Results and analysis
Conclusion
- Common CLIP embedding space used in text to image generation pipeline
- Diffusion Prior model trained on CLIP embedding space
- Prior model smaller in memory and requires less time to be trained
- Can be combined with existing decoder model to generate domain specific images
- Robust to complex prompts
- Color transfer techniques lack semantic awareness
- Color conditioned text to image generation
- Diffusion Prior model trained to accept additional conditioning input
- On par or better performance across all metrics and domains
- Limitation: vector/illustration as exemplar results in vector image output
- Color histogram of ground truth image fed to Diffusion Prior model during training
- Color words in text prompt given priority over color histogram
- Diffusion Prior could be reduced in capacity further
- Works for most other domains and other conditional inputs