Aligning Text-to-Image Models using Human Feedback
Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Deep generative models have been used for text-to-image synthesis. Current models often generate images that are not well-aligned with text prompts. A fine-tuning method is proposed to improve alignment using human feedback. Paper Content Introduction Deep generative models have been successful in generating high-quality images from text prompts Scaling of deep generative models to large-scale datasets has been a factor in this success Challenges remain in domains where large-scale text-to-image models fail to generate images that are well-aligned with text prompts Learning from human feedback has emerged as a powerful solution for aligning model behavior with human intent Proposed fine-tuning method for aligning text-to-image models using human feedback Fine-tuning with human feedback significantly improves the image-text alignment of a text-to-image model Learned reward function predicts human assessments of the quality more accurately than the CLIP score Careful investigations on several design choices are important in balancing alignment-fidelity tradeoffs Related work Variational auto-encoders, generative adversarial networks, auto-regressive models, and diffusion models have been proposed for image distributions Combined with language encoders, these models have shown impressive results in text-to-image generation Text-to-image models struggle to generate images that are well-aligned with text prompts Techniques such as character-aware text encoders and structured representations of language inputs have been investigated to address these issues Human feedback has been used to improve various AI systems We propose a fine-tuning method with human feedback for improving text-to-image models Various evaluation protocols have been proposed to measure image-text alignment We train a reward function that is better aligned with human evaluations by exploiting pre-trained representations and human feedback data Main method Generate a set of diverse images from text prompts Human raters provide binary feedback on images Train a reward model to predict human feedback Fine-tune text-to-image model using reward-weighted log likelihood Human data collection Generated image-text dataset with prompts combining words or phrases from three categories (count, color, background) Collected binary feedback from human labelers on the image-text dataset Reward learning Measure image-text alignment by learning a reward function Data augmentation to improve data-efficiency and performance Generate N-1 text prompts with different semantics Use reward function to classify original prompt Auxiliary loss to encourage low values for prompts with different semantics Combined loss to combine penalty parameter Updating the text-to-image model Update text-to-image model with parameters θ by minimizing loss Minimize reward-weighted negative log-likelihood on model-generated dataset Minimize pre-training loss to reduce NLL on pre-training dataset Regularization in loss function enables model to generate more natural images Experiments Conducted experiments to test efficacy of fine-tuning approach Used human feedback in experiments Experimental setup Stable diffusion v1....