Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Generative models produce high-quality images from a large-scale database Propose Custom Diffusion to augment existing text-to-image models Optimizing a few parameters in the text-to-image conditioning mechanism is sufficient to represent new concepts Can jointly train for multiple concepts or combine multiple fine-tuned models Generates variations of multiple, new concepts and composes them with existing concepts Outperforms several baselines and concurrent works Paper Content Introduction Text-to-image models can generate images of unprecedented quality Users often want to generate images of personal concepts, such as family, friends, pets, or personal objects and places Describing these concepts through text is difficult and unable to produce the personal concept with sufficient fidelity Model customization is needed to generate these personal concepts Challenges include model forgetting, overfitting, and concept mixing and omission Custom Diffusion is a fine-tuning technique to overcome these challenges It is computationally and memory efficient It can compose multiple new concepts efficiently Related work Generative models aim to synthesize samples from a data distribution Types of generative models include GANs, VAEs, autoregressive, flow-based, and diffusion models Models can be conditioned on a class, image, or text prompt Text-to-image models have demonstrated remarkable generalization ability Aim to adapt models to become specialists in new concepts Leverage generative models for image and model editing Transfer learning used to produce a whole distribution of images Aim to acquire multiple new concepts without catastrophic forgetting Method Updates a small subset of weights in the cross-attention layers of the model Uses a regularization set of real images to prevent overfitting Single-concept fine-tuning Stable Diffusion model is built on Latent Diffusion Model Images are encoded into a latent representation using VAE, Patch-GAN and LPIPS Diffusion models aim to approximate the original data distribution Model is trained to learn the reverse process of a fixed-length Markov chain Model is conditioned on timestep and can be further conditioned on any other modality, e....