Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Misalignment between model predictions and intended usage can be damaging
- Reinforcement learning techniques can be used to align models with a task reward
- This approach is effective for multiple computer vision tasks
- This approach has potential to be widely useful for better aligning models with computer vision tasks
Paper Content
Introduction
- Complex outputs in computer vision require alignment with task risk
- Researchers use postprocessing, global loss, and altered input data to improve behavior
- NLP and RL fields have studied this problem and use imitation and reinforcement learning
- Reward optimization has not been explored for computer vision tasks
- Reward optimization works out-of-the-box for a wide range of computer vision tasks
- Reward optimization can be used with evaluation metrics, human feedback, or holistic system performance
Related work
- Optimizing computer vision metrics by computing pseudo-gradients and approximations
- CRF loss used to ensure segmentation mask consistency
- Optimizing text generation with MLE and REINFORCE
- Generalization of sampled outputs is an underlying issue
- Reinforcement learning used for vision tasks to attend to parts of the image and iterative refinement
Tuning models with rewards
- Formulate computer vision task as learning a function that maps an input to an output
- Maximum-likelihood training to maximize likelihood of ground-truth annotations
- Goal is to learn a conditional distribution that maximizes a reward function
- Two step framework: pretraining with maximum-likelihood estimation and tuning with REINFORCE algorithm
- Maximum-likelihood pretraining captures distribution of training data
- REINFORCE algorithm tunes model to optimize an arbitrary reward function
- Pretrained MLE model provides good initial sampling strategy
Practical applications
- Use encoder-decoder architecture with ViT encoder and Transformer decoder
- Pretrain model with maximum-likelihood estimation and tune with task reward
- Use Adafactor variant as optimizer and sample greedily at inference time
- Validation metrics may differ from task risk in real scenario, requiring further validation or reward design
Panoptic segmentation
- Panoptic segmentation combines instance and semantic segmentation
- Panoptic Quality (PQ) is used to measure the completeness and detail of predictions
- Pretrained encoder-decoder Transformer model on COCO captions
- Tuning for CIDEr to optimize with batch size 256 and 10k steps
Object detection
- The goal of object detection is to predict a tight bounding box for objects in an image
- Many approaches have been proposed, but they don’t offer an explicit way to obtain a model aligned with the task risk
- We use detection-specific rewards to optimize a vanilla detection data likelihood model
- We represent a set of bounding boxes as a discrete sequence
- We use the standard ViT-B/16 as image encoder and 6-layer auto-regressive Transformer decoder
- We pretrain a MLE model and then tune it with rewards for recall and mAP
- We tune our MLE model to optimize the recall reward
- We use a supervised loss to learn the expected IoU scores of sampled outputs plus the recall reward
- We improve the reward by computing its value at various IoU ranges and by weighting each class
- Our strong ViT-B result demonstrates the promise of the proposed task reward tuning
Colorization
- Colorization task is adding color to grayscale images
- Standard image colorization models use MLE to generate plausible image coloring
- Tuning MLE model to produce vivid images with “colorfulness” reward
- Reward discourages gray colors and promotes color diversity
- Tuning step increases vividness and diversity of predicted colors
Image captioning
- Image captioning is the task of generating text descriptions for images.
- CIDEr is a metric that measures caption quality based on how similar it is to a set of human-written reference captions.
- CIDEr takes into account the frequency of words across all captions.
- REINFORCE is an established technique used to optimize CIDEr reward in image captioning.
Analysis
Reward distribution
- We compared two models in an image captioning example
- The reward tuned model had higher expected rewards than the MLE model
- The MLE model had higher rewards in the top 1%-tile, but we can’t benefit from them
Reward-risk progression
- Decomposing a metric into per-example reward can lead to divergence between the reward and metric.
- Empirically, no significant divergence was observed between reward and goal metrics.
- Object detection mAP score quickly increased from 40.2% to 53.2% over 60k steps.
- MLE model includes high-quality outputs in a large enough pool.
- Low reward when using the most likely out of N samples even when using 10000 samples.
Discussion and limitations
- Reward hacking is possible
- Reward design can be simple or complex
- Advanced RL techniques are not necessary
- Data for imitation learning can be hard to predict
- Training cost is proportional to inference usage
Conclusion
- Reward optimization is a viable option to optimize computer vision tasks
- Pretraining followed by reward optimization can improve models for object detection and panoptic segmentation
- Reward optimization can qualitatively affect the results of colorization models
- Reward optimization is competitive with recent works in captioning