Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Proposed a new framework for object detection called DiffusionDet
- DiffusionDet formulates object detection as a denoising diffusion process from noisy boxes to object boxes
- During training, object boxes diffuse from ground-truth boxes to random distribution
- Model learns to reverse this noising process during inference
- Evaluations on MS-COCO and LVIS show favorable performance compared to previous detectors
- Random boxes are effective object candidates
- Object detection can be solved by a generative way
Paper Content
Introduction
- We propose DiffusionDet, a novel noise-to-box object detection framework, which decouples the training and evaluation and enables progressive refinement.
- We evaluate DiffusionDet on MS-COCO and LVIS datasets, and it achieves competitive performance compared to existing approaches.
- We demonstrate the effectiveness of DiffusionDet in various settings, such as different backbones, different sampling steps, and different numbers of random boxes.
- We release the source code of DiffusionDet for reproducibility.
- Object detection aims to predict bounding boxes and category labels
- Used in many related recognition scenarios
- Modern approaches use empirically designed object candidates
- Learnable object queries have been proposed
- Question: simpler approach without learnable queries?
- Noise-to-box approach proposed - no hand-designed components
- DiffusionDet proposed - generative denoising process
- Decouples training and evaluation, enables progressive refinement
- Evaluated on MS-COCO and LVIS datasets
- Source code released
Related work
- Object detection is usually done using box regression and category classification on empirical object priors.
- DETR [10] proposed a query-based detection paradigm.
- Diffusion models are a class of deep generative models.
- Diffusion models have been used for image generation, segmentation, and other tasks.
- This is the first work to use a diffusion model for object detection.
Preliminaries
- Object detection is a learning objective that takes an input image and produces a set of bounding boxes and category labels.
- Diffusion models are a class of likelihood-based models inspired by nonequilibrium thermodynamics.
- A neural network is trained to predict the bounding boxes from noisy boxes, conditioned on the corresponding image.
Architecture
- Diffusion model generates data samples iteratively
- Computationally intractable to directly apply model to raw image
- Model separated into two parts: image encoder and detection decoder
- Image encoder takes raw image and extracts high-level features
- Detection decoder takes set of proposal boxes and sends to detection head
- DiffusionDet begins from random boxes, re-uses detector head in iterative steps
Training
- Construct diffusion process from ground-truth boxes to noisy boxes
- Pad ground truth boxes to fixed number
- Explore padding strategies (e.g. repeating, concatenating)
- Add Gaussian noises to padded boxes
- Use monotonically decreasing cosine schedule for noise scale
- Use set prediction loss on set of predictions
- Assign multiple predictions to each ground truth
Inference
- DiffusionDet is a denoising sampling process from noise to object boxes.
- In each sampling step, boxes are sent into the detection decoder to predict category classification and box coordinates.
- Box renewal strategy is used to filter out undesired boxes and replace them with random boxes.
- DiffusionDet can be evaluated with an arbitrary number of random boxes and sampling steps.
Experiments
- DiffusionDet has a property called “once-for-all”
- DiffusionDet is compared to other detectors on MS-COCO and LVIS datasets
- MS-COCO dataset has 118K training images and 5K validation images, with 80 object categories
- Evaluation metrics for MS-COCO are box average precision over multiple IoU thresholds (AP), threshold 0.5 (AP 50 ) and 0.75 (AP 75 )
- LVIS dataset has 100K training images and 20K validation images, with 1203 categories
- Evaluation metrics for LVIS are MS-COCO style box metric AP, AP 50 and AP 75
Implementation details.
- ResNet and Swin backbone are initialized with pre-trained weights on ImageNet-1K and ImageNet-21K
- Detection decoder is initialized with Xavier init
- AdamW optimizer with initial learning rate of 2.5 × 10 −5 and weight decay of 10 −4
- Training schedule for MS-COCO is 450K iterations, learning rate divided by 10 at 350K and 420K iterations
- Training schedule for LVIS is 210K, 250K, 270K
- Data augmentation includes random horizontal flip, scale jitter, and random crop
- At inference stage, top-100 and top-300 scoring predictions for MS-COCO and LVIS, respectively, are selected and ensembled together by NMS
Main properties
- DiffusionDet can be used with changing the number of boxes and number of sample steps in inference.
- DiffusionDet can achieve better accuracy with more boxes or/and more refining steps.
- DiffusionDet is compared with DETR to show the advantage of dynamic boxes.
- Performance of DiffusionDet increases steadily with the number of boxes used for evaluation.
- DETR has a clear performance drop when the number of boxes is different from the number of queries.
- DiffusionDet can be improved by increasing the number of random boxes or the iterative steps.
Benchmarking on detection datasets
- DiffusionDet is compared to 6 previous detectors
- Results are compared on a challenging LVIS dataset
- Reproduced Faster R-CNN and Cascade R-CNN using default settings of detectron2
- Sparse R-CNN on its original code
- Federated loss used to boost performance
- DiffusionDet attains remarkable gains with more refinement steps
- Refinement brings more gains on LVIS than MS-COCO
Ablation study
- Experiments conducted on MS-COCO to study DiffusionDet
- ResNet-50 with FPN used as backbone
- Signal scaling factor of 2.0 achieves optimal AP performance
- GT boxes padding strategy studied using different methods
- Sampling strategy studied using different methods
- Box renewal threshold studied using different methods
- Matching between N train and N eval studied
- Accuracy vs. speed studied
- Random seed studied for stability
Conclusion and future work
- Proposed a novel detection paradigm, DiffusionDet, by viewing object detection as a denoising diffusion process from noisy boxes to object boxes
- Dynamic box and progressive refinement, enabling same network parameters to obtain desired speed-accuracy trade-off without re-training
- Experiments on standard detection benchmarks show favorable performance compared to well-established detectors
- Future works include applying DiffusionDet to video-level tasks, such as object tracking and action recognition, and extending DiffusionDet from close-world to open-world or open-vocabulary object detection
- DiffusionDet has dynamic box property, progressive refinement property, and is able to benefit from more proposal boxes and iterative refinements using the same network parameters
- DiffusionDet has negligible performance drop with more refinement steps, and even has performance gains with more refinement steps
- DiffusionDet is optimized with a multi-task loss function, which includes focal loss, L1 loss, and GIoU loss
- Visualize sampling step of DiffusionDet with 300 boxes, 50 boxes are drawn in the image