arxiv-summary: AI-summarized AI papers

How different are self and nonself?

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Biological and artificial neural networks can make reliable distinctions between similar inputs. The immune system is similar in that it can make reliable distinctions and is partly learned. Self and nonself peptides have nearly identical distributions but are strongly inhomogeneous. The immune system targets the spaces in between the samples, which would be overfitting in conventional learning problems....

Deep learning for size-agnostic inverse design of random-network 3D printed mechanical metamaterials

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Mechanical metamaterials are used to solve inverse problems. Inverse problems involve finding microarchitectures that give desired properties. Additive manufacturing techniques have limited resolution. Multi-objective inverse design problem is difficult to solve. Deep-DRAM combines four decoupled models to solve the problem. Deep-DRAM finds many solutions to the multi-objective inverse design problem. Filtering step is used to identify designs with minimum peak stresses....

Scalable Adaptive Computation for Iterative Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Presents Recurrent Interface Network (RIN), a neural net architecture that allocates computation adaptively to the input Hidden units of RINs are partitioned into the interface and latents RIN block selectively reads from the interface into latents for high-capacity processing Stacking multiple blocks enables effective routing across local and global levels Latent self-conditioning technique “warm-starts” the latents at each iteration of the generation process RINs yield state-of-the-art image and video generation without cascades or guidance Up to 10$\times$ more efficient compared to specialized 2D and 3D U-Nets Paper Content Introduction Design of effective neural network architectures is important for deep learning Convolutional neural networks and Transformers are examples of architectures Computation is usually allocated in a fixed, uniform manner It is important to allocate computation in an adaptive manner to improve scalability Prior work has explored dynamic and input-decoupled computation Generating images and videos with high-dimensional data requires adaptive computation Recurrent Interface Networks (RINs) is a new architecture that allocates computation more effectively RINs outperform U-Net architectures for image and video generation Latent self-conditioning is proposed to reduce the cost of routing RINs lead to significant performance and efficiency gains in diffusion models Method Overview RINs use tokenization to connect the interface to the input space and learnable embeddings to initialize the latents....

Shakes on a Plane: Unsupervised Depth Estimation from Unstabilized Photography

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Capturing and merging frames to enhance an image disregards 3D nature of scene 42 12-megapixel RAW frames captured in 2-second sequence can recover high-quality scene depth Test-time optimization approach fits neural RGB-D representation to long-burst data Plane plus depth model is trained end-to-end and performs coarse-to-fine refinement Geometrically accurate depth reconstructions with no additional hardware or separate steps Paper Content Introduction Rise and fall of film and DSLR photography Cellphones offer high megapixel image streams On-board motion measurement devices Integrated active depth sensors Depth imaging and 3D reconstruction Depth can help with object understanding Depth can help compensate for non-ideal camera hardware Depth can be used for augmented reality and interactive experiences Many ways to produce depth Apple iPhone 12-14 Pro devices use depth derived from RGB Existing passive monocular depth estimation methods Multiview depth estimation methods Neural radiance field approaches Millimeter-scale view variation from natural hand tremor Unsupervised end-to-end approach to estimate depth and camera motion Jointly distill relative depth and pose estimates Evaluations demonstrate approach outperforms existing methods Related work Depth estimation can be divided into active and passive methods Active methods use controlled illumination to infer object shape or improve stereo feature matching Time-of-flight (ToF) depth sensors use round trip time of photons to infer depth Passive methods use correlation between visual and geometric features to estimate 3D structure Multiview and structure from motion works leverage epipolar geometry to extract 3D information from multiple images Neural scene representation works learn an implicit representation of a 3D scene by fitting a multi-layer perceptron to a set of input images Our work uses a neural representation of RGB to distill high quality continuous representations of both depth and camera poses Long-burst photography Burst photography is a type of imaging where multiple frames are taken in rapid succession Parameters such as ISO and exposure time can be varied during capture Burst imaging pipelines investigate how these frames can be merged into a single higher-fidelity image Video processing literature operates on sequences hundreds of frames in length and/or large camera motion Long-burst photography is several seconds of continuous capture with small view variation Data collection tool was designed to record two-second, 42 frame long-bursts RAW images are preserved with 14-bit color depth and minimal processing Unsupervised depth estimation Proposed a method for depth estimation from long-burst data Assessment Comparing approach to BARF, SfSM, iPhone 14 Pro, MiDaS, RCVD, and HNDR All baselines run on processed RGB data except HNDR Assessing absolute performance and geometric consistency with 3D objects scanned by a commercial high-precision turntable structured light scanner Outperforming existing learned, mixed, and multi-view only methods Reconstructing small features such as Dragon’s tail, Harold’s scarf, and the ear of the Tiger statue Plane plus depth offset approach avoids spurious depth solutions in low-parallax regions Producing high-quality object reconstructions Quantitative depth metrics outperform all comparison methods RAW long-burst data can improve depth reconstruction compared to 8-bit RGB Discussion and future work It is possible to recover high-quality, geometrically-accurate object depth from a stack of images acquired during long-burst photography....

Beyond SOT: It's Time to Track Multiple Generic Objects at Once

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Generic Object Tracking (GOT) is the problem of tracking target objects in a video. Previous research has focused on single object tracking, but multi-object tracking has wider applicability. A new large-scale GOT benchmark, LaGOT, is introduced to tackle key remaining challenges in GOT. A Transformer-based GOT tracker, TaMOS, is proposed to process multiple objects simultaneously....

Impossibility Theorems for Feature Attribution

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Interpretability methods can produce plausible explanations, but have also seen failure cases. It is unclear how to use these methods and choose between them. Feature attribution methods can fail to improve on random guessing for inferring model behaviour. End-tasks should be defined and a simple approach of repeated model evaluations can outperform complex feature attribution methods....

GOOD: Exploring Geometric Cues for Detecting Objects in an Open World

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Task of open-world class-agnostic object detection is addressed RGB-based models suffer from overfitting and fail to detect novel-looking objects Geometric cues such as depth and normals are incorporated to train an object proposal network GOOD significantly improves detection recall for novel object categories and performs well with few training classes Using a single “person” class for training on the COCO dataset, GOOD surpasses SOTA methods by 5....

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract T2V generation requires large-scale text-video datasets for fine-tuning. Humans can learn new visual concepts from a single exemplar. One-Shot Video Generation uses a single text-video pair for training. T2I diffusion models are adapted for T2V generation. Tune-A-Video uses Sparse-Causal Attention to generate videos from text prompts. Paper Content Introduction A large-scale multimodal dataset has enabled breakthroughs in open-domain Text-to-Image (T2I) generation Recent works have extended the spatial-only T2I generation models to the spatiotemporal domain Human is capable of one-shot learning Can pre-trained T2I models infer other novel videos from a single video example?...

Local Policy Improvement for Recommender Systems

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recommender systems aim to predict what items a user will interact with next. Historically, this problem has been solved using supervised learning. Recently, policy optimization has been used to maximize user engagement. When training a new policy, data from a previously-deployed policy is used. An alternative approach is local policy improvement without off-policy correction....

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposes to unify the subjects of speech enhancement and study Generalized Speech Enhancement Goal is to improve certain aspects of speech, such as intelligibility, quality, and video synchronization Model is composed of two steps: pseudo audio-visual speech recognition and pseudo text-to-speech synthesis Model is called ReVISE and is evaluated on EasyCom, an audio-visual benchmark Paper Content Introduction Speech in-the-wild is often corrupted with natural and non-natural sounds....