Universal Instance Perception as Object Discovery and Retrieval
Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract UNINEXT is a universal instance perception model for object discovery and retrieval. Benefits of UNINEXT include exploiting data from different tasks and label vocabularies for joint training of general instance-level representations, and being parameter-efficient when handling multiple tasks. UNINEXT has shown superior performance on 20 challenging benchmarks from 10 instance-level tasks. Paper Content Introduction Object-centric understanding is a challenging problem in computer vision 10 sub-tasks are discussed, distributed on the vertices of a cube Object detection and instance segmentation require finding objects of specific categories Multiple Object Tracking, Multi-Object Tracking and Segmentation, and Video Instance Segmentation require finding object trajectories of specific categories in videos Referring Expression Comprehension, Referring Expression Segmentation, and Referring Video Object Segmentation aim to find objects matched with language expressions Single Object Tracking and Video Object Segmentation take the target annotations given in the first frame as the reference Fragmented task definitions split the field into pieces, causing redundant parameters and overlooking the possibility of mutual collaboration UNINEXT is proposed as a universal instance perception model of the next generation UNINEXT can flexibly perceive different instances by changing the input prompts UNINEXT achieves superior performance on 20 challenging benchmarks Related work Retrieval by Category Names: Object detection and instance segmentation Retrieval by Language Expressions: REC, RES, and R-VOS Retrieval by Reference Annotations: SOT and VOS Unified Vision Models: Unified learning paradigms and unified model architectures Object detection and instance segmentation are foundations for other instance perception tasks REC methods divided into two-stage, one-stage, and Transformer-based RES approaches focus on designing diverse attention mechanisms R-VOS is an extension of RES from images to videos SOT and VOS extract target features and fuse target information with representations of the current frame Unified vision models attempt to solve multiple vision or multi-modal tasks by a single model Unified learning paradigms cover many tasks and modalities Unified model architectures designed for a group of closely related tasks Approach Categorize existing instance perception tasks into three classes Object detection, instance segmentation, MOT, MOTS, and VIS use category names as prompts REC, RES, and R-VOS use an expression as the prompt SOT and VOS use annotation given in the first frame as the prompt Reformulate all instance perception tasks into a prompt-guided object discovery and retrieval problem UNINEXT consists of three components: prompt generation, image-prompt feature fusion, object discovery and retrieval Prompt generation A prompt generation module is used to transform the original diverse prompt inputs into a unified form....