Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Large-scale text-to-image diffusion models have made advances Existing models use text input alone, which can impede controllability GLIGEN is a novel approach that builds upon and extends existing models GLIGEN preserves concept knowledge of pre-trained model and injects grounding information into new trainable layers GLIGEN achieves open-world grounded text2img generation with caption and bounding box condition inputs GLIGEN outperforms existing supervised layout-to-image baselines by a large margin Paper Content Introduction Image generation research has seen advances in recent years GANs and text conditional autoregressive and diffusion models have been used These models have practical use cases and can generate high quality images Existing models cannot be conditioned on other input modalities apart from text Propose a method to provide new grounding conditional inputs to pretrained text-to-image diffusion models Model can generalize to unseen objects Model’s zero-shot performance on layout2img tasks outperforms prior state-of-the-art Propose a method to build upon large pretrained generative models for downstream tasks Related work Autoregressive and diffusion models are state-of-the-art for text-to-image generation DALL-E and Parti demonstrate zero-shot and scaling up abilities Diffusion models have shown promising results Masked modeling can achieve SoTA-level generation performance Make-A-Scene incorporates semantic maps into text-to-image generation Layout2Im generates images from bounding boxes Existing layout2image methods are closed-set GANs and diffusion models have been explored for various conditioning information Our work investigates how to build upon existing models to enable open-set grounded image generation Preliminaries on latent diffusion models Diffusion-based methods are effective for text2image tasks Latent Diffusion Model (LDM) and Stable Diffusion are powerful models LDM has two stages: mapping network to obtain latent representation and diffusion model on latent Training objective is to denoise latent representations of image LDM can generate impressive language-to-image results with pretraining on internet-scale data Open-set grounded image generation Grounding instruction input Grounding: entities described through text or example image, spatial configuration described with bounding box or keypoints Caption and grounding entities are processed as input tokens to the diffusion model Existing lay-out2img works only deal with closed-set setting Training data requires both text and grounding entity Three types of data: grounding, detection, detection and caption Image prompt: entity described using an image instead of language Keypoints: richer spatial configurations than bounding boxes Continual learning for grounded generation Goal is to endow new spatial grounding capabilities to existing large language-to-image generation models Models pre-trained on web-scale imagetext to gain knowledge for synthesizing realistic images Original model weights retained while expanding new capability New gated self-attention layer added to enable spatial grounding ability Attention performed over concatenation of visual and grounding tokens Original denoising objective used for model continual learning Model learns to use additional localization information while retaining pre-trained concept knowledge Versatile interface allows user to ground entities that exist in caption input or add objects freely Scheduled sampling scheme used in inference to improve visual quality and extend model to other domains Experiments Evaluated model’s grounded text2img generation in closed-set and open-set settings Ablated components of model Showed extensions to image prompt and keypoint grounded generation Conducted quantitative experiments using pretrained LDM on LAION Closed-set grounded text2img generation Evaluated generation quality and grounding accuracy of model in closed-set setting Trained and evaluated on COCO2014 dataset Used 3 types of grounding instructions Compared to baseline models Used FID and YOLO score to evaluate Model trained with detection annotation instructions had best performance Combining data from all grounding instructions can lead to complementary benefits Used gated self-attention to absorb grounding instruction Ablated on null caption and gated cross-attention Achieved state-of-the-art performance for image quality and grounding accuracy Pretrained model on larger dataset and evaluated zero-shot and finetuned results Open-set grounded text2img generation GLIGEN can generate grounded entities beyond the COCO categories GLIGEN learns to re-position the visual features corresponding to the grounding entities Model is evaluated on LVIS and outperforms supervised baseline Performance increases as training data is scaled up Model gains grounding ability compared to vanilla Stable Diffusion Inpainting comparison GLIGEN can be used for inpainting tasks An experiment was conducted on the COCO dataset to inpaint randomly masked objects of different sizes Results show that GLIGEN inpainted objects more tightly occupy the missing region compared to baselines Keypoints grounding Model uses bounding boxes and human keypoints as grounding conditions for generation Model compared to pix2pixHD Model trained with and without captions Model generates better image quality than pix2pixHD Model can be used to specify scene and person’s gender for image creation Image grounding Image grounded generation uses a reference image to represent a grounded entity....