arxiv-summary: AI-summarized AI papers

A Lightweight Instrument-Agnostic Model for Polyphonic Note Transcription and Multipitch Estimation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Automatic Music Transcription (AMT) is a key technology with many applications. Instrument-specific systems tend to be more accurate than instrument-agnostic methods. Estimating frame-wise $f_0$ values is easier than note event detection. This paper proposes a lightweight neural network for musical instrument transcription. The model is trained to predict onsets, multipitch and note activations....

BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Building a dataset, BEAT, with 76 hours of multi-modal data from 30 speakers 32 million frame-level emotion and semantic relevance annotations Correlation of conversational gestures with facial expressions, emotions, and semantics Proposing a baseline model, Cascaded Motion Network (CaMN) Introducing a metric, Semantic Relevance Gesture Recall (SRGR) BEAT is the largest motion capture dataset for investigating human gestures Paper Content Related work Mo-cap and pseudo-label conversational gesture datasets exist Most common mo-cap dataset is 4-hour Trinity dataset Datasets for talking-face generation exist, but cannot be used for gesture synthesis Semantic or emotion-aware motion synthesis studied in action recognition and sign-language analysis/synthesis Baseline models for conversational gesture synthesis exist Efforts to improve performance of baseline models by input/output representation selection, adversarial training, and generative modeling techniques Probabilistic gesture generation enables generating diversity based on noise Beat: body-expression-audio-text dataset Dataset acquisition process described Text, emotion, and semantic relevance information annotation introduced Correlation between conversational gestures and emotions analyzed using BEAT Distribution of semantic relevance shown Motion capture system based on 16 synchronized cameras recording motion at 120 Hz Facial capture system uses ARKit with a depth camera on iPhone 12 Pro Audio recorded in 48KHz stereo Data acquisition BEAT is divided into conversation and self-talk sessions Speaker’s gestures are divided into four categories Topics are selected from 20 predefined topics Self-talk sessions consist of 120 1-minute recordings 8 emotions are covered in the dataset Proportion of languages and accents is strictly controlled Mainly English data, with some Chinese, Spanish and Japanese 30 speakers from different ethnicities Speakers asked to read answers proficiently and show natural, personal, daily style of conversational gestures Professional speaker instructs them to elicit corresponding emotion correctly Data annotation Used an Automatic Speech Recognizer (ASR) to obtain initial text for conversation session Used Montreal Forced Aligner (MFA) for temporal alignment of text with audio Confirmed 8-class emotion label of self-talk Annotators watched video with audio and gestures to perform frame-level annotation 600 annotators from Amazon Mechanical Turk (AMT) scored semantic relevance on a scale of 0-10 Data analysis BEAT collection and annotation enables analysis of correlations between conversational gestures and other modalities....

Training language models to follow instructions with human feedback

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Making language models bigger does not necessarily make them better at following user intent. This paper shows an avenue for aligning language models with user intent by fine-tuning with human feedback. Human evaluations show that outputs from a 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3. InstructGPT models show improvements in truthfulness and reductions in toxic output generation....

Pseudo Numerical Methods for Diffusion Models on Manifolds

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract DDPMs can generate high-quality samples such as images and audio. DDPMs require hundreds to thousands of iterations to produce final samples. Prior works have attempted to accelerate DDPMs, but have not been able to maintain sample quality. We propose pseudo numerical methods for diffusion models (PNDMs). PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs....

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract VLP has improved performance for vision-language tasks, but only excels in understanding or generation tasks. Performance improvement is achieved by scaling up dataset with noisy image-text pairs from the web. BLIP is a new VLP framework which transfers flexibly to both understanding and generation tasks. BLIP utilizes noisy web data by bootstrapping captions and achieves state-of-the-art results on a range of vision-language tasks....

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Alpa automates model-parallel training of large deep learning models Existing model-parallel training systems require manual creation of parallelization plans or limited space of model parallelism configurations Alpa views parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms Alpa designs compilation passes to automatically derive efficient parallel execution plans Alpa implements an efficient runtime to orchestrate two-level parallel execution on distributed compute devices Evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems Paper Content Introduction Deep learning advances are due to increases in model size Training large models on distributed clusters requires engineering effort Tuning parallelization strategy can improve training performance Automating parallelization of large-scale models would accelerate ML research and production Complex space of plans grows exponentially with model and cluster size Recent efforts to automatically parallelize model training are limited Hierarchical space of plans can be used to optimize Alpa is a compiler system for distributed DL on GPU clusters Alpa features compilation passes, runtime architecture, and system optimizations Alpa is evaluated on large models with billions of parameters Alpa can match specialized systems and achieve speedups compared to hand-tuned systems Background: distributed deep learning DL computation is represented by ML frameworks as a dataflow graph....

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Neural graphics primitives can be costly to train and evaluate. A new input encoding is used to reduce cost without sacrificing quality. The encoding uses a small neural network and a multiresolution hash table. The system is implemented using fully-fused CUDA kernels. Training is done in a matter of seconds and rendering in tens of milliseconds....

PromptBERT: Improving BERT Sentence Embeddings with Prompts

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed PromptBERT, a novel contrastive learning method for learning better sentence representation Analyzed drawback of current sentence embedding from original BERT Proposed first prompt-based sentence embeddings method Discussed two prompt representing methods and three prompt searching methods Proposed novel unsupervised training objective by technology of template denoising Experiments show effectiveness of method Compared to SimCSE, PromptBert achieved 2....

DM-VIO: Delayed Marginalization Visual-Inertial Odometry

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed DM-VIO system is a monocular visual-inertial odometry system Uses two novel techniques called delayed marginalization and pose graph bundle adjustment Photometric bundle adjustment with dynamic weight for visual residuals Delayed marginalization allows for injection of IMU information into already marginalized states IMU initialization captures full photometric uncertainty and improves scale estimation System evaluated on EuRoC, TUM-VI, and 4Seasons datasets Outperforms stereo-inertial methods while using only a single camera and IMU Paper Content I....

High-Resolution Image Synthesis with Latent Diffusion Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Decomposing image formation process into denoising autoencoders and diffusion models achieves state-of-the-art synthesis results. Diffusion models are powerful but require hundreds of GPU days and expensive inference. Applying diffusion models in latent space of pretrained autoencoders reduces complexity and preserves detail. Cross-attention layers turn diffusion models into powerful and flexible generators for general conditioning inputs....