Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Denoising Diffusion models have been used to generate high quality images from text.
DALLE-2’s two step process uses a Diffusion Prior to generate a CLIP image embedding from text and a Diffusion Decoder to generate an image from the embedding.
Diffusion Prior can be used to constrain the generation to a specific domain without altering the Diffusion Decoder.
Diffusion Prior can be trained with additional conditional information to control the generation.
Proposed approaches perform better than existing baselines for color conditioned generation.

Paper Content

Introduction

Diffusion Models have been used to generate high quality and diverse images from text
Numerous efforts have been made to improve quality, speed and apply models for editing
Most works focus on architectures that condition the diffusion decoder directly on text embedding and encodings
In [52], authors propose a two-step model to generate CLIP image embedding conditioned on CLIP text embedding
We explore the capabilities and advantages of having a common intermediate representation
We leverage the Diffusion Prior model trained by LAION as the baseline for our experiments
We modify the cross-attention layers to condition on CLIP-L/14 image embedding instead of text embedding and encodings
We explore the possibilities with Diffusion Prior model to control image generations
We explore four applications: Text to Texture, Text to Rasterized Vectors, Text to Isolated Objects, Color Conditioned Text to Image
We train a new Diffusion Prior model for each application while keeping the larger Diffusion Decoder model intact
We evaluate our proposed setup on three domains and on color for conditional generation
We perform a comprehensive quantitative and qualitative performance evaluation with existing baselines
To the best of our knowledge, there is no existing work that shows effective semantic aware color conditional generation and domain specific generations using the DALLE-2 HDM architecture

Diffusion models

DMs are more popular than GANs for image synthesis
DMs use a Gaussian denoising process to predict noises
DMs are easier to train and scale than GANs
DMs can be conditioned on texts, images, or both
DMs can be applied to other computer vision tasks
Recent techniques have been introduced to improve DM’s training and sampling speed

Diffusion prior

Diffusion Prior is a classifier-free guidance DM used in OpenAI’s DALLE-2
Diffusion Prior maps input text embedding vector to image embedding vector in CLIP latent space
Diffusion Prior outperforms autoregression prior in model size and training time
Recent works using Diffusion Prior include Make-A-Video, Dream3D and Shifted Diffusion

Color conditioned generation

Color is an important attribute of an image that provides contextual information and sets the mood of the viewer.
Research has explored generating images with specific styles, but not using color palette.
There are works on image colorization and color transfer, but they require an image as input.
We propose to condition the Diffusion Prior model with color palette and generate a valid CLIP image embedding.
The proposed setup is applicable for other conditional inputs such as style, sketch, semantic map etc.

Domain specific generation

Our method outperforms existing large pretrained Stable Diffusion across all metrics.
Adding domain specific modifiers to the prompts helps a little for vector and textures in improving domain relevance.
FID scores show that the generated images from the proposed technique are of higher quality and are more relevant to the specific domain’s real distribution.
Irrespective of the complexity of the prompt, the generated image using our method does not generate out of domain images.

Proposed method

HDM is a text-to-image generation model
HDM follows DALLE-2 architecture
HDM has a Diffusion Prior and Diffusion Decoder model
y is the text prompt and x is the generated image

Diffusion prior model

The Prior model is a denoising diffusion model that generates a normalized CLIP L/14 image embedding.
The Diffusion Prior is parameterized by θ and takes as input a random noise and a CLIP text embedding.

Diffusion decoder model

We train a custom latent space model for memory and compute efficiency.
The model takes as input random sample from N (0, I) and the CLIP image embedding.
The generated latent is passed through a frozen decoder VAE dec to generate the final image.
We use the same two step hybrid architecture for text-to-image generation as [52].

Domain specific prior

Train separate Diffusion Prior model for each domain
Obtain curated internal dataset of images with texture, suitable for vectors and isolated objects
Train domain specific Diffusion Prior model on curated dataset
Domain specific models generate image embeddings within specific sub-space in CLIP embedding space
Embeddings from each prior can be visualized by LDM to generate domain specific images

Conditional prior

Studies have been done on style, shape and semantic map conditioned generations, but not on color conditioned text-to-image generation.
Color information can be used to help with creative workflows.
Color information is represented by a 3D histogram in the LAB space, which is perceptually closer to human color vision.

Experimental setup

Dataset

Used internal Adobe Stock dataset to train prior and decoder models
Removed images with humans or text using classifiers
Manually annotated 180K images to detect human presence/absence
Manually annotated 40K images to detect text presence
Removed NSFW images from training corpus using pretrained classifier
Manually annotated 30K images for texture prior
Used stock metadata to gather 1M positive and negative samples for vectors prior
Manually annotated 28K images for isolated objects prior
Used 61M only English subset of image-text pairs for color prior

Training and inference details

LDM has same number of parameters as Stable Diffusion
LDM trained on smaller dataset for fewer GPU hours
Prior models have lower number of parameters and training time
Prior models trained from scratch in 8-GPU A100-40GB instances
Training prior model from scratch takes less time and compute than finetuning larger decoder or stable diffusion
All experiments use 100 DDIM steps for sampling CLIP embedding and 50 DDIM steps to generate image
10 embeddings generated per prompt and highest CLIP score chosen to input text prompt

Baselines

Represented method as ‘ours’
Consists of prior and LDM
Existing methods use prompt engineering with stable diffusion
Added ’texture background’, ‘vector illustration’ and ‘isolated on a plain white background’ as suffix to prompts
Used LAION prior as baseline for domain Diffusion Prior models
Used suffixes with LAION prior to support baseline

Color conditional generation

Table 3 provides quantitative results for the color prior model compared to baselines
Trade-off between color transfer and quality of images generated in existing baselines
SD+WCTRGB has best performance in color transfer but least quality in generation
SD+ReHistoGAN has least performance in color transfer but better FID score than proposed model
Proposed model strikes hue that corresponds to color palette applied over existing image

Metrics

5000 random prompts used to generate images for comparison
FID used to measure quality of generated image and alignment with training distribution

Results and analysis

Conclusion

Common CLIP embedding space used in text to image generation pipeline
Diffusion Prior model trained on CLIP embedding space
Prior model smaller in memory and requires less time to be trained
Can be combined with existing decoder model to generate domain specific images
Robust to complex prompts
Color transfer techniques lack semantic awareness
Color conditioned text to image generation
Diffusion Prior model trained to accept additional conditioning input
On par or better performance across all metrics and domains
Limitation: vector/illustration as exemplar results in vector image output
Color histogram of ground truth image fed to Diffusion Prior model during training
Color words in text prompt given priority over color histogram
Diffusion Prior could be reduced in capacity further
Works for most other domains and other conditional inputs

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Diffusion models#

Diffusion prior#

Color conditioned generation#

Domain specific generation#

Proposed method#

Diffusion prior model#

Diffusion decoder model#

Domain specific prior#

Conditional prior#

Experimental setup#

Dataset#

Training and inference details#

Baselines#

Color conditional generation#

Metrics#

Results and analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Diffusion models

Diffusion prior

Color conditioned generation

Domain specific generation

Proposed method

Diffusion prior model

Diffusion decoder model

Domain specific prior

Conditional prior

Experimental setup

Dataset

Training and inference details

Baselines

Color conditional generation

Metrics

Results and analysis

Conclusion