Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Misalignment between model predictions and intended usage can be damaging
Reinforcement learning techniques can be used to align models with a task reward
This approach is effective for multiple computer vision tasks
This approach has potential to be widely useful for better aligning models with computer vision tasks

Paper Content

Introduction

Complex outputs in computer vision require alignment with task risk
Researchers use postprocessing, global loss, and altered input data to improve behavior
NLP and RL fields have studied this problem and use imitation and reinforcement learning
Reward optimization has not been explored for computer vision tasks
Reward optimization works out-of-the-box for a wide range of computer vision tasks
Reward optimization can be used with evaluation metrics, human feedback, or holistic system performance

Optimizing computer vision metrics by computing pseudo-gradients and approximations
CRF loss used to ensure segmentation mask consistency
Optimizing text generation with MLE and REINFORCE
Generalization of sampled outputs is an underlying issue
Reinforcement learning used for vision tasks to attend to parts of the image and iterative refinement

Tuning models with rewards

Formulate computer vision task as learning a function that maps an input to an output
Maximum-likelihood training to maximize likelihood of ground-truth annotations
Goal is to learn a conditional distribution that maximizes a reward function
Two step framework: pretraining with maximum-likelihood estimation and tuning with REINFORCE algorithm
Maximum-likelihood pretraining captures distribution of training data
REINFORCE algorithm tunes model to optimize an arbitrary reward function
Pretrained MLE model provides good initial sampling strategy

Practical applications

Use encoder-decoder architecture with ViT encoder and Transformer decoder
Pretrain model with maximum-likelihood estimation and tune with task reward
Use Adafactor variant as optimizer and sample greedily at inference time
Validation metrics may differ from task risk in real scenario, requiring further validation or reward design

Panoptic segmentation

Panoptic segmentation combines instance and semantic segmentation
Panoptic Quality (PQ) is used to measure the completeness and detail of predictions
Pretrained encoder-decoder Transformer model on COCO captions
Tuning for CIDEr to optimize with batch size 256 and 10k steps

Object detection

The goal of object detection is to predict a tight bounding box for objects in an image
Many approaches have been proposed, but they don’t offer an explicit way to obtain a model aligned with the task risk
We use detection-specific rewards to optimize a vanilla detection data likelihood model
We represent a set of bounding boxes as a discrete sequence
We use the standard ViT-B/16 as image encoder and 6-layer auto-regressive Transformer decoder
We pretrain a MLE model and then tune it with rewards for recall and mAP
We tune our MLE model to optimize the recall reward
We use a supervised loss to learn the expected IoU scores of sampled outputs plus the recall reward
We improve the reward by computing its value at various IoU ranges and by weighting each class
Our strong ViT-B result demonstrates the promise of the proposed task reward tuning

Colorization

Colorization task is adding color to grayscale images
Standard image colorization models use MLE to generate plausible image coloring
Tuning MLE model to produce vivid images with “colorfulness” reward
Reward discourages gray colors and promotes color diversity
Tuning step increases vividness and diversity of predicted colors

Image captioning

Image captioning is the task of generating text descriptions for images.
CIDEr is a metric that measures caption quality based on how similar it is to a set of human-written reference captions.
CIDEr takes into account the frequency of words across all captions.
REINFORCE is an established technique used to optimize CIDEr reward in image captioning.

Analysis

Reward distribution

We compared two models in an image captioning example
The reward tuned model had higher expected rewards than the MLE model
The MLE model had higher rewards in the top 1%-tile, but we can’t benefit from them

Reward-risk progression

Decomposing a metric into per-example reward can lead to divergence between the reward and metric.
Empirically, no significant divergence was observed between reward and goal metrics.
Object detection mAP score quickly increased from 40.2% to 53.2% over 60k steps.
MLE model includes high-quality outputs in a large enough pool.
Low reward when using the most likely out of N samples even when using 10000 samples.

Discussion and limitations

Reward hacking is possible
Reward design can be simple or complex
Advanced RL techniques are not necessary
Data for imitation learning can be hard to predict
Training cost is proportional to inference usage

Conclusion

Reward optimization is a viable option to optimize computer vision tasks
Pretraining followed by reward optimization can improve models for object detection and panoptic segmentation
Reward optimization can qualitatively affect the results of colorization models
Reward optimization is competitive with recent works in captioning

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Tuning models with rewards#

Practical applications#

Panoptic segmentation#

Object detection#

Colorization#

Image captioning#

Analysis#

Reward distribution#

Reward-risk progression#

Discussion and limitations#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Tuning models with rewards

Practical applications

Panoptic segmentation

Object detection

Colorization

Image captioning

Analysis

Reward distribution

Reward-risk progression

Discussion and limitations

Conclusion