arxiv-summary: AI-summarized AI papers

Allegro-Legato: Scalable, Fast, and Robust Neural-Network Quantum Molecular Dynamics via Sharpness-Aware Minimization

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract NNQMD simulations based on machine learning are revolutionizing atomistic simulations of materials. Allegro model combines group theory, rotational equivariance and local descriptors for higher accuracy and speed. Allegro-Legato model combines Allegro model with sharpness aware minimization for improved smoothness and robustness. Allegro-Legato model exhibits weaker dependence of time-to-failure on problem size and excellent computational scalability....

Blind Video Deflickering by Neural Filtering with a Flawed Atlas

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Videos often contain flickering artifacts. Prior work requires specific guidance to remove flicker. This paper proposes a general flicker removal framework that only requires a single flickering video as input. The approach uses a neural atlas and a neural filtering strategy. Experiments show that the method achieves satisfying deflickering performance. Paper Content Introduction Many videos suffer from flickering artifacts due to low-quality hardware, high-speed cameras, or video processing algorithms....

Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Training data attribution (TDA) methods trace a model’s prediction back to specific influential training examples. Existing approaches assign a scalar influence score to each training example, assuming influence is additive. Simfluence is a new paradigm for TDA which produces a training run simulator to study non-additive interactions. Simfluence-Linear captures non-additive interactions and predicts the spiky trajectory of individual example losses....

A Theory of Emergent In-Context Learning as Implicit Structure Induction

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract LLMs have the capacity to learn in-context from example demonstrations. In-context learning relies on recombination of compositional operations found in natural language data. Theoretical predictions are validated by introducing a controlled setup for inducing in-context learning. In-context learning emerges when scaling parameters and data. Models perform better when prompted to output intermediate steps....

I$^2$-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Presents I$^2$-SDF, a method for intrinsic indoor scene reconstruction and editing Uses differentiable Monte Carlo raytracing on neural signed distance fields Jointly recovers shapes, incident radiance and materials from multi-view images Introduces a novel bubble loss and error-guided adaptive sampling scheme Decomposes neural radiance field into spatially-varying material of the scene Demonstrates superior quality on indoor scene reconstruction, novel view synthesis, and scene editing Paper Content Introduction Reconstructing 3D scenes from multi-view images is a fundamental task in computer science Neural Radiance Field (NeRF) uses MLPs to approximate the underlying geometry and appearance of a 3D scene Novel view synthesis is insufficient for scene editing applications Inverse rendering or intrinsic decomposition reconstructs and decomposes the scene into shape, shading and surface reflectance Complex indoor scenes are difficult to reconstruct I 2 -SDF is a new method to decompose a 3D scene into its underlying shape, material, and incident radiance components Bubble loss and error-guided adaptive sampling improve reconstruction quality on small objects I 2 -SDF enables photorealistic indoor scene relighting and editing High-quality synthetic indoor scene multi-view dataset provided Related work Neural implicit scene representations are used to represent 3D geometry and radiance information Neural radiance field (NeRF) uses a single MLP to encode a scene as a continuous volumetric field Follow-up works accelerate reconstruction speed using voxels, hashgrids or deep image features Neural fields can also be applied to represent 3D geometric functions Difficulties in handling shape-radiance ambiguity on texture-less surfaces Traditional multi-view stereo methods struggle with texture-less regions Learning-based MVS methods divided into two categories: depth-based and TSDF-based Neural implicit SDF methods used to tackle texture-less regions Inverse rendering attempts to reconstruct and factorize the scene with geometry, material and lighting Neural implicit representations used to estimate BRDF and lighting from image collections Recent methods mainly focus on single object reconstruction and do not handle spatially-varying lighting conditions Overview Goal is to decompose shape, radiance and material of indoor scene according to multi-view input images Implicit representations used to model geometry, radiance and material Pipeline consists of neural SDF field, neural radiance field, neural material fields and emission field Two-stage training scheme used to avoid training ambiguities Implicit neural surface representation and volume rendering Represent scene geometry as an implicit signed distance function (SDF) SDF maps 3D point to closest distance to surface Parameterize SDF and scene appearance as MLP Use differentiable volume rendering to learn scene implicit representation from images Color, depth, and normal of surface can be accumulated Intrinsics decomposition Indoor scenes contain objects of different scales and visibility levels Existing indoor reconstruction methods often fail to recognize and reconstruct thin or suspended objects Neural networks tend to converge faster on low-frequency information than high-frequency information Gradients for small objects can vanish due to the nature of neural networks To address this problem, “bubbles” are inserted to create gradients for SDF near small or thin objects Bubble loss is used to minimize the absolute SDF value of surface points Importance sampling algorithm is used to filter out large planar areas and preserve small-object areas Geometry loss is used to approximate the geometry field Depth and normal priors are used to handle shape-radiance ambiguity Smoothness loss is used to encourage smooth surface reconstruction Emitter semantic field Radiance field F c is trained from LDR images, causing under-estimation of light intensity from emitters....

FreeNeRF: Improving Few-shot Neural Rendering with Free Frequency Regularization

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Novel view synthesis with sparse inputs is a challenging problem for neural radiance fields (NeRF). Frequency plays an important role in NeRF’s training. Two regularization terms are proposed to address the challenge. FreeNeRF achieves state-of-the-art performance across diverse datasets. Paper Content Introduction Neural Radiance Field (NeRF) can render high-fidelity novel views but struggles with few inputs....

Erasing Concepts from Diffusion Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recent advancements in text-to-image diffusion have motivated the study of erasing specific concepts from model weights. We propose a fine-tuning method that can erase a visual concept from a pre-trained diffusion model, given only the name of the style and using negative guidance as a teacher. We benchmark our method against previous approaches that remove sexually explicit content and demonstrate its effectiveness....

Meet in the Middle: A New Pre-training Paradigm

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Most language models are trained and applied in a left-to-right fashion. This paper proposes a new pre-training paradigm to improve training data efficiency and capabilities of language models. The proposed pre-training paradigm includes a training objective and a bidirectional inference procedure. Experiments show the effectiveness of the pre-training paradigm, outperforming strong baselines....

Scaling Vision-Language Models with Sparse Mixture of Experts

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract NLP has made progress in developing large-scale vision-language models (VLMs). VLMs aim to bridge the gap between text and visual information. MoE techniques can divide the model into smaller, specialized sub-models. This paper explores the effectiveness of MoE in scaling VLMs. Research offers insights into stabilizing the training of MoE models. Paper Content Introduction Ability to understand and generate natural language from visual information is important for applications Deep learning has led to development of large-scale vision-language models MoE can improve efficiency and effectiveness of VLMs Contributions: proposed VL-MoE, explored scaling strategies, presented ablations Related work Vision-language pretraining involves developing model architecture and pretraining objectives to learn effective multimodal representations from image-text pairs Two main approaches for model architecture: separate encoders and complex fusion module MOME Transformer unifies dual-encoder and fusion-encoder models Increasing interest to grow VL model capacity with affordable compute budget Pretraining objectives can be categorized into discriminative and generative modeling Sparse Mixture of Experts models studied for conditional computation, multitask learning, and multimodal learning Method Describe masked data modeling pretraining objectives Discuss MoEs, sparse MoEs Apply sparse MoEs methodology to vision-language models Explain design choices for routing algorithm and implementation of VL-MoE Vision-language masked data modeling Utilized a unified masked data modeling objective to pretrain VL-MoE on monomodal and multimodal data Used masked language modeling to learn language representations from text-only data Used masked image modeling to learn vision representations from image data Used masked vision-language modeling to recover masked image patches and text tokens Input text is tokenized and projected onto word embeddings Input image is split and reshaped into patches and flattened into vectors Image and text input vectors are concatenated Used mixture-of-modality-experts Transformer to encode different modalities Used mixture-of-experts model to selectively activate different parts of a neural network Replaced a subset of V-FFN and T-FFN with V-MoE and T-MoE layers Used Batch Priority Routing for stable training of VL-MoE Pretrained on 4 million images and 10 million image-text pairs Used Adam optimizer with linear warmup and cosine learning rate decay Used 32 expert parallelism and TUTEL for fast routing and computation Results show cost-performance tradeoff of VL-MoE dominates dense models Maximum wall-clock overhead of VL-MoE compared to dense counterparts is 13% Vision-and-language downstream tasks Explored performance of VL-MoE on vision-and-language downstream tasks Used 480x480 image resolution for VQA and 384x384 for other tasks Used VQA 2....

High-throughput Generative Inference of Large Language Models with a Single GPU

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract LLM inference traditionally requires multiple high-end accelerators. This paper studies LLM inference using limited resources, such as a single commodity GPU. FlexGen is a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints. FlexGen compresses weights and the attention cache to 4 bits with negligible accuracy loss....