arxiv-summary: AI-summarized AI papers

Towards a Foundation Model for Neural Network Wavefunctions

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Deep neural networks are accurate and powerful wavefunction ansatz for solving the electronic Schr"odinger equation. Optimizing the wavefunction from scratch for each new system is computationally costly. A novel neural network ansatz is proposed which maps uncorrelated, computationally cheap Hartree-Fock orbitals to correlated, high-accuracy neural network orbitals. This ansatz is capable of learning a single wavefunction across multiple compounds and geometries....

Adversarial Counterfactual Visual Explanations

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations. Adversarial attacks cannot be used directly in a counterfactual explanation perspective. The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks. The paper’s key idea is to build attacks through a diffusion model to polish them....

A Dynamic Multi-Scale Voxel Flow Network for Video Prediction

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Performance of video prediction has been improved by deep neural networks Current methods require extra inputs for better performance DMVFN proposed to achieve better video prediction performance with only RGB images DMVFN has a differentiable routing module to perceive motion scales of video frames DMVFN is faster than Deep Voxel Flow and surpasses OPT on generated image quality Paper Content Introduction Aim to predict future video frames from current ones Benefits representation learning and downstream forecasting tasks Video prediction studied in academia and industry Challenging due to diverse and complex motion patterns Early methods use recurrent neural networks Semantic/instance maps used for semantically coherent motion estimation OPT uses only RGB images to estimate optical flow Need to develop single model for multiscale motion estimation DMVFN proposed to model complex motion cues of diverse scales Routing Module to adaptively generate routing vector Experiments on four benchmarks show state-of-the-art results Related work Video prediction Early video prediction methods used only RGB frames as inputs Later methods used extra information for better performance This paper develops a light-weight and efficient video prediction network that requires only sRGB images as inputs Optical flow Optical flow estimation is used to measure motion between frames Deep learning-based optical flow models have been improved since Flownet Flownet2....

Trained on 100 million words and still in shape: BERT meets British National Corpus

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Modern masked language models are trained on large corpora. We explore the effects of training on a smaller, well-balanced corpus. Pre-training on this corpus can reach better performance than the original BERT model. Smaller corpora have potential as a language modeling benchmark. We present comparative studies of LMs to evaluate training objectives and model architectures....

CoLT5: Faster Long-Range Transformers with Conditional Computation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Natural language processing tasks benefit from long inputs. Processing long documents with Transformers is expensive. CoLT5 is a long-input Transformer model that uses conditional computation. CoLT5 achieves stronger performance than LongT5 with faster training and inference. CoLT5 achieves SOTA on the long-input SCROLLS benchmark. CoLT5 can effectively and tractably make use of extremely long inputs....

Efficient Diffusion Training via Min-SNR Weighting Strategy

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Denoising diffusion models are used for image generation. Training these models can be slow. Conflicting optimization directions between timesteps can cause slow convergence. Min-SNR-$\gamma$ is a method to address this issue. Min-SNR-$\gamma$ improves converging speed and achieves a new record FID score. Paper Content Introduction We proposed a Min-SNR-γ weighting strategy to tackle the conflicting gradients issue....

LERF: Language Embedded Radiance Fields

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Humans use natural language to refer to 3D locations Language Embedded Radiance Fields (LERFs) is a method for grounding language embeddings into NeRF LERF learns a dense, multi-scale language field inside NeRF LERF can extract 3D relevancy maps for language prompts in real-time LERF enables zero-shot queries on 3D CLIP embeddings without relying on region proposals or masks Paper Content Introduction Neural Radiance Fields (NeRFs) can capture photorealistic digital representations of 3D scenes Natural language is an intuitive interface for interacting with a 3D scene Language Embedded Radiance Fields (LERF) grounds language within NeRF by optimizing embeddings from a vision-language model LERF preserves the integrity of CLIP embeddings at multiple scales, allowing it to handle a broad range of language queries LERF utilizes self-supervised DINO features to regularize the optimized language field LERF can localize both fine-grained and abstract queries across in-the-wild scenes LERF has potential use cases in robotics, analyzing vision-language models, and interacting with 3D scenes Related work Open-Vocabulary Object Detection approaches lie on a spectrum from zero-shot to fully trained on segmentation datasets LSeg trains a 2D image encoder on labeled segmentation datasets CRIS and CLIPSeg train a 2D image decoder to output a relevancy map Common approach for 2D images is a two-stage framework with class-agnostic region or mask proposals OpenSeg and ViLD use CLIP to classify 2D regions from class-agnostic mask proposal networks Detic builds on existing two-stage object detector approaches OWL-ViT attaches lightweight object classification and localization heads after a pre-trained 2D image encoder LERF avoids region proposals by incorporating language embeddings in a dense, 3D, multiscale field Grad-CAM and attention-based methods provide a relevancy mapping between 2D images and text NeRF has an attractive property of averaging information across multiple views Semantic NeRF and Panoptic Lifting embed semantic information from semantic segmentation networks into 3D Distilled Feature Fields and Neural Feature Fusion Fields explore embedding pixel-aligned feature vectors into NeRF LERF embeds feature vectors into NeRF without fine-tuning 3D Language Grounding has been explored in a wide range of contexts VL-Maps and Open-Scene build a 3D volume of language features which can be queried CLIP-Fields and NLMaps-SayCan fuse CLIP embeddings of crops into pointclouds ConceptFusion fuses CLIP features more densely in RGBD pointclouds LERF provides a new dense, volumetric interface for 3D text queries Multi-scale supervision Supervising language field outputs requires querying language embeddings over image patches, not pixels....

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Progress in machine learning is driven by large datasets LAION datasets are largely uncurated SemDeDup uses embeddings from pre-trained models to identify and remove semantic duplicates SemDeDup can remove 50% of the data with minimal performance loss, halving training time Performance increases out of distribution SemDeDup improves over prior approaches and provides efficiency gains Paper Content Related work Work in language and vision has focused on removing exact duplicates C4 text corpus was deduplicated by discarding repeated 3-sentence spans MinHash technique used to further deduplicate dataset without loss of performance Deduplication prevents memorization in LLMs and mitigates privacy concerns Model-based feature extraction used to improve similarity metric for deduplication Semdedup Identifying semantic duplicates is more difficult than perceptual duplicates....

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Diffusion-based generative models have been successful in text-based image generation. It is challenging to apply these models for real-world visual content editing, especially in videos. FateZero is a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. FateZero captures intermediate attention maps during inversion, which retain both structural and motion information....

$P+$: Extended Textual Conditioning in Text-to-Image Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce an Extended Textual Conditioning space in text-to-image models Show that the extended space provides greater disentangling and control over image synthesis Introduce Extended Textual Inversion (XTI) XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space Conduct a series of experiments to analyze and understand the properties of the new space Paper Content Introduction Neural generative models have advanced the field of image synthesis....