  • Misalignment between model predictions and intended usage can be damaging
  • Reinforcement learning techniques can be used to align models with a task reward
  • This approach is effective for multiple computer vision tasks
  • This approach has potential to be widely useful for better aligning models with computer vision tasks

  • Complex outputs in computer vision require alignment with task risk
  • Researchers use postprocessing, global loss, and altered input data to improve behavior
  • NLP and RL fields have studied this problem and use imitation and reinforcement learning
  • Reward optimization has not been explored for computer vision tasks
  • Reward optimization works out-of-the-box for a wide range of computer vision tasks
  • Reward optimization can be used with evaluation metrics, human feedback, or holistic system performance
  • Optimizing computer vision metrics by computing pseudo-gradients and approximations
  • CRF loss used to ensure segmentation mask consistency
  • Optimizing text generation with MLE and REINFORCE
  • Generalization of sampled outputs is an underlying issue
  • Reinforcement learning used for vision tasks to attend to parts of the image and iterative refinement

Tuning models with rewards

  • Formulate computer vision task as learning a function that maps an input to an output
  • Maximum-likelihood training to maximize likelihood of ground-truth annotations
  • Goal is to learn a conditional distribution that maximizes a reward function
  • Two step framework: pretraining with maximum-likelihood estimation and tuning with REINFORCE algorithm
  • Maximum-likelihood pretraining captures distribution of training data
  • REINFORCE algorithm tunes model to optimize an arbitrary reward function
  • Pretrained MLE model provides good initial sampling strategy

Practical applications

  • Use encoder-decoder architecture with ViT encoder and Transformer decoder
  • Pretrain model with maximum-likelihood estimation and tune with task reward
  • Use Adafactor variant as optimizer and sample greedily at inference time
  • Validation metrics may differ from task risk in real scenario, requiring further validation or reward design

Panoptic segmentation

  • Panoptic segmentation combines instance and semantic segmentation
  • Panoptic Quality (PQ) is used to measure the completeness and detail of predictions
  • Pretrained encoder-decoder Transformer model on COCO captions
  • Tuning for CIDEr to optimize with batch size 256 and 10k steps

Object detection

  • The goal of object detection is to predict a tight bounding box for objects in an image
  • Many approaches have been proposed, but they don’t offer an explicit way to obtain a model aligned with the task risk
  • We use detection-specific rewards to optimize a vanilla detection data likelihood model
  • We represent a set of bounding boxes as a discrete sequence
  • We use the standard ViT-B/16 as image encoder and 6-layer auto-regressive Transformer decoder
  • We pretrain a MLE model and then tune it with rewards for recall and mAP
  • We tune our MLE model to optimize the recall reward
  • We use a supervised loss to learn the expected IoU scores of sampled outputs plus the recall reward
  • We improve the reward by computing its value at various IoU ranges and by weighting each class
  • Our strong ViT-B result demonstrates the promise of the proposed task reward tuning


  • Colorization task is adding color to grayscale images
  • Standard image colorization models use MLE to generate plausible image coloring
  • Tuning MLE model to produce vivid images with “colorfulness” reward
  • Reward discourages gray colors and promotes color diversity
  • Tuning step increases vividness and diversity of predicted colors

Image captioning

  • Image captioning is the task of generating text descriptions for images.
  • CIDEr is a metric that measures caption quality based on how similar it is to a set of human-written reference captions.
  • CIDEr takes into account the frequency of words across all captions.
  • REINFORCE is an established technique used to optimize CIDEr reward in image captioning.


Reward distribution

  • We compared two models in an image captioning example
  • The reward tuned model had higher expected rewards than the MLE model
  • The MLE model had higher rewards in the top 1%-tile, but we can’t benefit from them

Reward-risk progression

  • Decomposing a metric into per-example reward can lead to divergence between the reward and metric.
  • Empirically, no significant divergence was observed between reward and goal metrics.
  • Object detection mAP score quickly increased from 40.2% to 53.2% over 60k steps.
  • MLE model includes high-quality outputs in a large enough pool.
  • Low reward when using the most likely out of N samples even when using 10000 samples.

Discussion and limitations

  • Reward hacking is possible
  • Reward design can be simple or complex
  • Advanced RL techniques are not necessary
  • Data for imitation learning can be hard to predict
  • Training cost is proportional to inference usage


  • Reward optimization is a viable option to optimize computer vision tasks
  • Pretraining followed by reward optimization can improve models for object detection and panoptic segmentation
  • Reward optimization can qualitatively affect the results of colorization models
  • Reward optimization is competitive with recent works in captioning