Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Proposed a new framework for object detection called DiffusionDet
DiffusionDet formulates object detection as a denoising diffusion process from noisy boxes to object boxes
During training, object boxes diffuse from ground-truth boxes to random distribution
Model learns to reverse this noising process during inference
Evaluations on MS-COCO and LVIS show favorable performance compared to previous detectors
Random boxes are effective object candidates
Object detection can be solved by a generative way

Paper Content

Introduction

We propose DiffusionDet, a novel noise-to-box object detection framework, which decouples the training and evaluation and enables progressive refinement.
We evaluate DiffusionDet on MS-COCO and LVIS datasets, and it achieves competitive performance compared to existing approaches.
We demonstrate the effectiveness of DiffusionDet in various settings, such as different backbones, different sampling steps, and different numbers of random boxes.
We release the source code of DiffusionDet for reproducibility.
Object detection aims to predict bounding boxes and category labels
Used in many related recognition scenarios
Modern approaches use empirically designed object candidates
Learnable object queries have been proposed
Question: simpler approach without learnable queries?
Noise-to-box approach proposed - no hand-designed components
DiffusionDet proposed - generative denoising process
Decouples training and evaluation, enables progressive refinement
Evaluated on MS-COCO and LVIS datasets
Source code released

Object detection is usually done using box regression and category classification on empirical object priors.
DETR [10] proposed a query-based detection paradigm.
Diffusion models are a class of deep generative models.
Diffusion models have been used for image generation, segmentation, and other tasks.
This is the first work to use a diffusion model for object detection.

Preliminaries

Object detection is a learning objective that takes an input image and produces a set of bounding boxes and category labels.
Diffusion models are a class of likelihood-based models inspired by nonequilibrium thermodynamics.
A neural network is trained to predict the bounding boxes from noisy boxes, conditioned on the corresponding image.

Architecture

Diffusion model generates data samples iteratively
Computationally intractable to directly apply model to raw image
Model separated into two parts: image encoder and detection decoder
Image encoder takes raw image and extracts high-level features
Detection decoder takes set of proposal boxes and sends to detection head
DiffusionDet begins from random boxes, re-uses detector head in iterative steps

Training

Construct diffusion process from ground-truth boxes to noisy boxes
Pad ground truth boxes to fixed number
Explore padding strategies (e.g. repeating, concatenating)
Add Gaussian noises to padded boxes
Use monotonically decreasing cosine schedule for noise scale
Use set prediction loss on set of predictions
Assign multiple predictions to each ground truth

Inference

DiffusionDet is a denoising sampling process from noise to object boxes.
In each sampling step, boxes are sent into the detection decoder to predict category classification and box coordinates.
Box renewal strategy is used to filter out undesired boxes and replace them with random boxes.
DiffusionDet can be evaluated with an arbitrary number of random boxes and sampling steps.

Experiments

DiffusionDet has a property called “once-for-all”
DiffusionDet is compared to other detectors on MS-COCO and LVIS datasets
MS-COCO dataset has 118K training images and 5K validation images, with 80 object categories
Evaluation metrics for MS-COCO are box average precision over multiple IoU thresholds (AP), threshold 0.5 (AP 50 ) and 0.75 (AP 75 )
LVIS dataset has 100K training images and 20K validation images, with 1203 categories
Evaluation metrics for LVIS are MS-COCO style box metric AP, AP 50 and AP 75

Implementation details.

ResNet and Swin backbone are initialized with pre-trained weights on ImageNet-1K and ImageNet-21K
Detection decoder is initialized with Xavier init
AdamW optimizer with initial learning rate of 2.5 × 10 −5 and weight decay of 10 −4
Training schedule for MS-COCO is 450K iterations, learning rate divided by 10 at 350K and 420K iterations
Training schedule for LVIS is 210K, 250K, 270K
Data augmentation includes random horizontal flip, scale jitter, and random crop
At inference stage, top-100 and top-300 scoring predictions for MS-COCO and LVIS, respectively, are selected and ensembled together by NMS

Main properties

DiffusionDet can be used with changing the number of boxes and number of sample steps in inference.
DiffusionDet can achieve better accuracy with more boxes or/and more refining steps.
DiffusionDet is compared with DETR to show the advantage of dynamic boxes.
Performance of DiffusionDet increases steadily with the number of boxes used for evaluation.
DETR has a clear performance drop when the number of boxes is different from the number of queries.
DiffusionDet can be improved by increasing the number of random boxes or the iterative steps.

Benchmarking on detection datasets

DiffusionDet is compared to 6 previous detectors
Results are compared on a challenging LVIS dataset
Reproduced Faster R-CNN and Cascade R-CNN using default settings of detectron2
Sparse R-CNN on its original code
Federated loss used to boost performance
DiffusionDet attains remarkable gains with more refinement steps
Refinement brings more gains on LVIS than MS-COCO

Ablation study

Experiments conducted on MS-COCO to study DiffusionDet
ResNet-50 with FPN used as backbone
Signal scaling factor of 2.0 achieves optimal AP performance
GT boxes padding strategy studied using different methods
Sampling strategy studied using different methods
Box renewal threshold studied using different methods
Matching between N train and N eval studied
Accuracy vs. speed studied
Random seed studied for stability

Conclusion and future work

Proposed a novel detection paradigm, DiffusionDet, by viewing object detection as a denoising diffusion process from noisy boxes to object boxes
Dynamic box and progressive refinement, enabling same network parameters to obtain desired speed-accuracy trade-off without re-training
Experiments on standard detection benchmarks show favorable performance compared to well-established detectors
Future works include applying DiffusionDet to video-level tasks, such as object tracking and action recognition, and extending DiffusionDet from close-world to open-world or open-vocabulary object detection
DiffusionDet has dynamic box property, progressive refinement property, and is able to benefit from more proposal boxes and iterative refinements using the same network parameters
DiffusionDet has negligible performance drop with more refinement steps, and even has performance gains with more refinement steps
DiffusionDet is optimized with a multi-task loss function, which includes focal loss, L1 loss, and GIoU loss
Visualize sampling step of DiffusionDet with 300 boxes, 50 boxes are drawn in the image

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Preliminaries#

Architecture#

Training#

Inference#

Experiments#

Implementation details.#

Main properties#

Benchmarking on detection datasets#

Ablation study#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Related work

Preliminaries

Architecture

Training

Inference

Experiments

Implementation details.

Main properties

Benchmarking on detection datasets

Ablation study

Conclusion and future work