arxiv-summary: AI-summarized AI papers

Learning Context-aware Classifier for Semantic Segmentation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Semantic segmentation is a challenging task for parsing different contexts. A context-aware classifier is used to adapt to different latent distributions. The method is model-agnostic and can be applied to generic segmentation models. With negligible additional parameters and +2% inference time, decent performance gain is achieved. Paper Content Introduction Semantic segmentation has been used in a wide range of applications Recent advances in model structure focus on strong backbones and decoder heads Classifier in recent literature is composed of shared parameters for all images This can lead to difficulty in handling diverse contexts Enriching classifier with contextual information can improve performance Entropy-aware KL loss is designed to mitigate information imbalance Method can be plugged into existing segmentation models with little efficiency compensation Related work Semantic segmentation is a challenging task that requires precise pixel-wise predictions....

Novel Class Discovery for 3D Point Cloud Semantic Segmentation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract NCD is a task of learning a model to segment unlabelled classes using only labelled classes No work exists for 3D point cloud data This paper advances the state of the art on point cloud data analysis in four directions Presents a new method for NCD based on online clustering Introduces a new evaluation protocol to assess the performance of NCD for point cloud semantic segmentation Paper Content Introduction Humans can organize new visual knowledge into groups Machines cannot do this without supervision Novel Class Discovery (NCD) is the task of classifying unlabelled samples into different classes NCD has been explored in 2D image domain for classification and semantic segmentation NCD for 3D data is different because one point cloud can contain more than one novel class NCD for 3D semantic segmentation is explored in this paper A new method for NCD is presented, called NOPS (NOvel Point Segmentation) A new evaluation protocol is introduced to assess the performance of NCD for 3D semantic segmentation Related work Point cloud semantic segmentation can be performed at the point level, on range view maps, and by voxelising the input points Point-level networks process the input without intermediate representations, examples include PointNet, PointNet++, RandLA-Net, and KPConv Range view architectures and voxel-based approaches are more computationally efficient than point-level networks Novel class discovery is explored for 2D classification and 2D segmentation NCD is more complex than standard semi-supervised learning NOPS tackles the problem of NCD in 3D point cloud semantic segmentation NOPS produces two augmented views that are processed with the same deep neural network Sinkhorn-Knopp algorithm is used to obtain pseudo-labels Network is trained by minimising the optimisation objective function through a swapped prediction task based on the computed pseudo-labels Problem formulation X is a dataset of 3D point clouds captured in different scenes....

Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract InDI is a new formulation for supervised image restoration that avoids the “regression to the mean” effect. InDI gradually improves image quality in small steps, similar to generative denoising diffusion models. InDI does not require knowledge of any analytic form of the degradation process. InDI can be applied to virtually any image degradation, given paired training data....

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed system paradigm integrates ChatGPT with a pool of vision experts Defined and explored a comprehensive list of advanced vision tasks Textual prompt design allows language models to accept, associate, and process multimodal information Zero-shot experiments demonstrate effectiveness in addressing specified capabilities Discussed and compared system paradigm with alternative approach Paper Content Introduction Recent years have seen significant advancement for computer vision Different vision problems require different models One research direction is to combine vision and language modules Large language models have shown impressive dialogue capability NLP research has demonstrated the effectiveness of integrating external NLP tools with LLMs MM-REACT combines vision experts with ChatGPT for multimodal reasoning and action MM-REACT provides extra flexibility in module upgrades Related work LLMs have strong chain-of-thought capabilities LLMs can use external NLP tools to solve problems LLMs can reason and take action independently, but not together Recent studies have attempted to merge reasoning and action for LLMs MM-REACT uses vision tools as executable actions MM-REACT uses ChatGPT to determine which vision expert to invoke User input ChatGPT only accepts texts as input File paths are used to indicate non-text inputs Vision experts are used to understand image content from different perspectives Chatgpt response ChatGPT is expected to provide two kinds of responses Key challenge is to set up a protocol to know when to invoke vision expert Use keyword “Assistant” to distinguish if vision expert is required Encourage Chat-GPT to show thought process to highlight why external tool is required Vision experts Use regular expression matching to parse expert name and file path Standardize output into text format Represent output of detection model as <object name, x1, y1, x2, y2> Add text description to explain numerical values Inject knowledge of vision experts’ usages into prefix Extensibility Motivated by REACT, which uses NLP tools Extended to vision domain by replacing non-text modality with path string Can be extended to other modalities, such as speech and audio Can incorporate more tools by formatting their outputs in text format Performance can be enhanced by upgrading to more powerful LLM Experiments Experiment setup Implemented MM-REACT based on LangChain codebase and ReAct Accessed ChatGPT via Azure API with token length limit of 4096 Utilized vision experts from Azure Cognitive Services APIs Expanded toolset with customized tools for spatial understanding and image editing Examples of capabilities and application scenarios in Figures 4-14 Unfolded steps in Figure 18 Enhanced LLM from ChatGPT to GPT-4 in Figures 23 and 24 Plugged image editing tool from X-decoder in Figure 25 Limitations Recognition capability in the wild is hard to evaluate with accuracy numbers due to lack of annotated benchmarks Vision capability is limited by integrated vision experts Knowledge is injected in the prefix, limited by context window Visual signals are converted to text words for ChatGPT understanding Manual prompt engineering required for MM-REACT Conclusion MM-REACT is a system paradigm that combines multimodal reasoning and action to solve visual understanding problems....

Reflexion: an autonomous agent with dynamic memory and self-reflection

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recent advancements in LLM agents have shown impressive performance Implementing these methods can be challenging due to lack of data or well-defined state space Reflexion is an approach that endows an agent with dynamic memory and self-reflection capabilities Heuristic introduced to enable agent to pinpoint hallucination instances, avoid repetition and construct memory map Evaluated in AlfWorld and HotPotQA environments with success rates of 97% and 51% respectively Paper Content Introduction Mastering decision-making and knowledge-intensive search tasks is important for natural language agents LLMs have achieved impressive results on various benchmarks Grounding complex tasks in natural language helps agents avoid false-negative errors Learning optimal policies for natural language RL agents is challenging due to vast and mostly unbound state spaces Several decision-making approaches have been proposed to enable natural language agents to select their next action Chain-of-thought reasoning leverages emergent properties to solve tasks in a single action ReAct utilizes emergent properties to solve problems Several recent works have aimed to allow natural language agents to exhibit reflective-like qualities DEPS uses multi-step reasoning and sub-task error correction to solve long-range tasks Huang et al....

Zero-1-to-3: Zero-shot One Image to 3D Object

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce Zero-1-to-3, a framework for changing camera viewpoint of an object from a single RGB image Capitalize on geometric priors learned from large-scale diffusion models Use synthetic dataset to learn controls of relative camera viewpoint Model has strong zero-shot generalization ability to out-of-distribution datasets and in-the-wild images Can be used for 3D reconstruction from a single image Outperforms state-of-the-art single-view 3D reconstruction and novel view synthesis models Paper Content Introduction Humans can imagine 3D shape and appearance from a single camera view This ability is important for everyday tasks and visual creativity Humans rely on prior knowledge accumulated through a lifetime of visual exploration Existing approaches for 3D image reconstruction rely on expensive 3D annotations or category-specific priors Recent methods have made strides in open-world 3D reconstruction Paper demonstrates that large diffusion models have learned rich 3D priors from 2D images Paper presents experiments to evaluate zero-shot view synthesis and 3D reconstruction from a single image Related work Recent advancements in generative image architectures have made it possible to synthesize high-fidelity diverse scenes and objects....

Context-faithful Prompting for Large Language Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract LLMs encode parametric knowledge about world facts and have shown good performance in knowledge-driven NLP tasks. LLMs may overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks. This paper seeks to assess and enhance LLMs’ contextual faithfulness. Opinion-based prompts and counterfactual demonstrations are the most effective methods for improving faithfulness....

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Diffusion models have been successful in text-to-image generation. Existing methods for customizing these models have limitations. Proposed approach addresses these limitations. Method involves fine-tuning singular values of weight matrices. Cut-Mix-Unmix data-augmentation technique enhances quality of multi-subject image generation. Proposed SVDiff method has significantly smaller model size. Paper Content Introduction Recent years have seen rapid advancement of text-to-image generative models These models can generate high-quality images from text prompts Researchers have investigated ways to use these models for image editing Some methods allow the diffusion models to be adapted to specific tasks or user preferences Limitations include large parameter space and difficulty in learning multiple personalized concepts Related work Text-to-image diffusion models have been used for image synthesis and various applications....

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract 3D object detectors usually rely on hand-crafted proxies Propose VoxelNext for fully sparse 3D object detection Predicts objects directly based on sparse voxel features Elegant and efficient framework, no need for sparse-to-dense conversion or NMS post-processing Better speed-accuracy trade-off than other mainframe detectors on nuScenes dataset Paper Content Introduction 3D perception is a fundamental component in autonomous driving systems 3D detection networks take sparse point clouds or voxels as input Recent 3D object detectors use sparse convolutional networks for feature extraction Anchors and centers are used for prediction Mainstream detectors convert 3D sparse features to 2D dense features VoxelNeXt is a simple, efficient, and post-processing-free 3D object detector VoxelNeXt predicts 3D objects from voxel features with a fully sparse convolutional network VoxelNeXt is evaluated on three large-scale benchmarks and achieves leading performance with high efficiency Related work 3D detectors work similarly to 2D counterparts Many approaches still use 2D dense convolutional heads VoxelNet uses PointNet for voxel feature encoding SECOND improves Voxel-Net with dense anchor-based head Other state-of-the-art methods use sparse-to-dense scheme CenterPoint predicts dense heatmap of center locations Sparse Detectors avoid dense detection heads Sparse CNNs are used for 3D deep learning Sparse CNNs have limited representation ability 3D object tracking models tracklets of multiple objects Fully sparse voxel-based network Point clouds or voxels are scattered on the surface of 3D objects....

A Survey on Oversmoothing in Graph Neural Networks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Node features of graph neural networks become more similar with increased network depth, known as over-smoothing. Definition of over-smoothing is unified and new quantitative measures are introduced. Over-smoothing is demonstrated empirically on different graphs. Approaches for mitigating over-smoothing are reviewed and tested on real-world datasets. Mitigating over-smoothing is necessary but not sufficient for building deep GNNs....