arxiv-summary: AI-summarized AI papers

Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Diffusion models can be used for music generation Music generation requires handling multiple aspects Developed a cascading latent diffusion approach to generate high-quality stereo music Targeting real-time on a single consumer GPU Open-sourced music samples, codes and all music samples for all models Paper Content Introduction Music generation is a challenging problem Recently, deep learning models have been used to explore audio generation Existing models explore the use of recursive neural networks, adversarial generative networks, autoencoders, and transformers Diffusion models have been used in speech synthesis, but are still under-explored for music generation Long-term structure, sound quality, diversity of music, and control of generation are challenges in the area of music generation Moûsai is a text-conditional cascading diffusion model that tries to address all the challenges Moûsai uses a custom two-stage cascading diffusion method Moûsai can generate long-context 48kHz stereo music exceeding the minute mark Moûsai uses an efficient 1D U-Net architecture for both stages of the cascade Moûsai uses a diffusion magnitude autoencoder to compress the audio signal 64x Related work Common trend in generative space is to train a model on input domain and learn a generative model on top of reduced representation Auto-encoding and quantized auto-encoding are popular compression methods for images Two popular directions in generative space are to learn a quantized representation or use a compressed/downsampled representation Cascading diffusion approach has not been attempted for audio generation Our work follows ideas from cascading diffusion approach, using a two-stage method to compress audio and generate reduced representation while conditioning on a textual description Preliminaries Diffusion: process of spreading information or resources Latent Diffusion: process of spreading information or resources in a hidden way U-Net: a type of convolutional neural network Audio generation Audio generation is a challenging task Waveforms can be represented in different resolutions Higher sample rates allow for more temporal resolution Qualitative properties such as texture and pitch can be observed Audio can be represented with mono, stereo, or surround sound Models can be trained on single or multiple modalities Diffusion Employed v v v-objective diffusion as proposed by Salimans & Ho (2022)....

Unsupervised Volumetric Animation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed novel approach for unsupervised 3D animation of non-rigid deformable objects Learns 3D structure and dynamics from single-view RGB videos Decomposes objects into semantically meaningful parts Uses 3D autodecoder framework and keypoint estimator Evaluated on two video datasets and one image dataset Can obtain animatable 3D objects from single or few images Paper Content Introduction Ability to animate dynamic object from single image enables creative tasks Applications range from visual effects to consumer applications Two approaches: outsourcing understanding to existing models or learning from raw data Outsourcing requires knowledge of object, learning is unsupervised Recent progress in unsupervised image animation Methods typically learn motion model based on object parts and transformations Prior works offer means to perform 2D animation only Our work explores unsupervised image animation in 3D Challenges include identifying and controlling object parts from 2D videos, modeling camera in 3D, and lack of bias of 2D CNNs Our framework maps object to canonical volumetric representation, parameterized with voxel grid Rigid parts are softly assigned to points in canonical volume Linear blend skinning produces deformed volume according to pose of each part Differentiable Perspectiven-Point algorithm estimates pose, linking 2D observations to 3D representation Parts are learned in unsupervised manner, allowing for 3D reconstruction and novel view synthesis Evaluated on three diverse datasets Related work 3D-aware image and video synthesis has seen significant progress in the last two years Early works used Neural Radiance Fields (NeRFs) to synthesize simple objects Later works scaled the generator and increased its efficiency to attain high-resolution 3D synthesis Different types of volumetric representations have been used Implicit video synthesis techniques have been combined with volumetric rendering to generate 3D-aware videos Unsupervised 3D reconstruction has been attempted Supervised image animation requires an off-the-shelf keypoint predictor or a 3D morphable model estimator Unsupervised image animation does not require supervision beyond photometric reconstruction loss Improved motion representations have been proposed for animation Latent Image Animator learned a latent space for possible motions Method Canonical voxel generator Uses voxel grid to parametrize volume Generates volume cube with density and RGB fields Models object as set of rigid moving parts Optimizes identity embeddings directly during training Learns 3D keypoints and uses 2D keypoint predictor to predict 2D keypoints Computes deformed density and radiance via volumetric skinning Volumetrically renders deformed radiance to produce rendered image Supervised using reconstruction loss Unsupervised pose estimation An object movement can be factorized into a set of rigid movements of each individual object’s part....

MusicLM: Generating Music From Text

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce MusicLM, a model that generates high-fidelity music from text descriptions MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task MusicLM generates music at 24 kHz that remains consistent over several minutes MusicLM outperforms previous systems in audio quality and adherence to the text description MusicLM can be conditioned on both text and a melody Release MusicCaps, a dataset of 5....

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Cut-and-LEaRn (CutLER) is a computer science approach for training unsupervised object detection and segmentation models. CutLER uses a MaskCut approach to generate coarse masks for multiple objects in an image and then learns a detector on these masks. CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER improves detection performance AP50 by over 2....

Open Problems in Applied Deep Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Formulates machine learning mechanism as bi-level optimization problem Inner level optimization loop minimizes loss function on training data Outer level optimization loop maximizes performance metric on validation data Entails model engineering, experiment tracking, dataset versioning, etc. Automated via AutoML or left to intuition of ML students, engineers, researchers Need to reduce computational cost and carbon footprint of developing AI algorithms Considers supervised, semi-supervised, self-supervised, unsupervised, etc....

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract LLMs can be used to complete written assignments, making it difficult for instructors to assess student learning. Text sampled from an LLM tends to occupy negative curvature regions of the model’s log probability function. DetectGPT is a new curvature-based criterion for judging if a passage is generated from a given LLM. DetectGPT is more discriminative than existing zero-shot methods for model sample detection....

Text-To-4D Dynamic Scene Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract MAV3D is a method for generating 3D dynamic scenes from text descriptions MAV3D uses a 4D dynamic Neural Radiance Field (NeRF) MAV3D does not require 3D or 4D data MAV3D is trained on Text-Image pairs and unlabeled videos MAV3D is the first to generate 3D dynamic scenes given a text description Paper Content Introduction Generative models can now generate realistic images from natural language prompts Generative models have been extended to videos and 3D shapes MAV3D combines the benefits of video and 3D generative models MAV3D takes natural language description and outputs dynamic 3D scene No readily available collection of 4D models with textual annotations MAV3D uses video generator as ‘statistical’ multi-camera setup MAV3D uses Neural Radiance Field (NeRF) to represent dynamic 3D scenes MAV3D uses multi-stage training pipeline for dynamic scene rendering MAV3D uses temporal-aware SDS loss and motion regularizers MAV3D uses temporal-aware super-resolution fine-tuning for higher resolution outputs Related work Neural rendering uses neural networks to represent 3D scenes Recent work has improved efficiency by incorporating 3D data structures Aim to generate dynamic scenes which can be viewed from any angle Generating 3D scenes from text dates back decades Recent improvements in diffusion models have led to advanced image synthesis Video generator is based on Make-A-Video (MAV) Method Goal is to develop a method to produce a dynamic 3D scene from a natural-language description Use a pretrained text-to-video (T2V) diffusion model as a scene prior Given a text prompt, fit a 4D scene representation Render a sequence of images from the 4D scene representation Pass the text prompt and the video to a pretrained T2V diffusion model Use Score Distillation Sampling (SDS) to compute an update direction for the scene parameters 4d scene representation Neural rendering is used to represent a dynamic 3D scene implicitly Rays are cast through the camera plane into the scene and points are sampled along the ray Volume density and color are computed for each point MLP is used to output the color HexPlane is used to represent the 4D scene MLP is used to predict volume density and color Background model simulates a large static sphere surrounding the dynamic foreground Dynamic scene optimization HexPlane model used to match textual prompt Temporal Score Distillation Sampling (SDS-T) introduced as an extension of SDS Loss computed and applied to MAV3D Pretrained conditional video generator based on diffusion Update direction for scene parameters θ computed using SDS Multi-stage static-to-dynamic optimization scheme used Gaussian Annealing and Total Variation Loss used as regularizers Super-resolution fine-tuning 4D scene representation is supervised via low-resolution 64x64 renderings Rendering higher-resolution videos from the learned model can lack detail and exhibit artifacts SRFT uses pretrained and frozen video super-resolution module SR t l SR t l inputs a high-resolution noisy 256x256 video and a clean 64x64 low-res video SR t l is used to improve high resolution renderings from 4D scene model SRFT trains jointly using SDS from SR t l and SDS-T Experiments MAV3D evaluates dynamic scenes from text descriptions Three alternative methods developed as baselines Evaluates simplified versions of model on sub-tasks of T2V and Text-To-3D Comprehensive ablation study to justify method’s design Conversion of dynamic NeRFs into dynamic meshes Results Text-to-4D comparison: Text-to-3D comparison: Text-to-Video comparison: Ablation study Human raters prefer model trained with SR for quality, text alignment and motion SR fine-tuning enhances quality of rendered videos Model trained without static scene pre-training has lower scene quality and poor convergence Dynamic camera variant has less motion and suffers from multi-face object Gaussian annealing leads to renderings with larger and more realistic motion HexPlane is slightly preferred in terms of overall quality and realistic motion Instant-NGP is significantly less preferred Real-time rendering HexPlane model can be converted to animated meshes Marching cube algorithm is used to extract a simplicial mesh Mesh decimation and removal of small noisy connected components XATLAS algorithm is used to map mesh vertices to a texture atlas Texture is initialized using HexPlane colors Texture is further optimized to better match example frames Collection of texture meshes can be played back in 3D engine Image to 4d Input image can be used to generate 4D asset 4D asset shares same semantics as input image Images provided by Nichol et al....

Deep Laplacian-based Options for Temporally-Extended Exploration

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Selecting exploratory actions to generate experience is a challenge in RL. Options-based exploration builds on graph Laplacian eigenfunctions. Previous methods limited to tabular domains, separate option discovery phase, and exact value function learning. This paper introduces a deep RL algorithm to discover Laplacian-based options. Evaluated on pixel-based tasks, compared to state-of-the-art exploration methods....

Finding Regions of Counterfactual Explanations via Robust Optimization

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Counterfactual explanations are important for detecting bias and improving explainability of data-driven classification models. Counterfactual explanations are minimal perturbed data points that cause the model’s decision to change. Existing methods can only provide one CE, which may not be achievable for the user. This work provides an iterative method to calculate robust CEs that remain valid even after features are slightly perturbed....

simple diffusion: End-to-end diffusion for high resolution images

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Diffusion models are difficult to apply to high resolution images. Existing approaches focus on lower dimensional spaces or multiple super-resolution levels. This paper aims to improve denoising diffusion for high resolution images while keeping the model simple. Four main findings: noise schedule should be adjusted, scale only a particular part of the architecture, add dropout at specific locations, and downsampling is an effective strategy....