A Survey on Transformers in Reinforcement Learning

A Survey on Transformers in Reinforcement Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Transformers are widely used in NLP and CV, mostly in supervised settings. Transformers are being used in reinforcement learning, but face unique design choices and challenges. This paper reviews motivations and progress on using Transformers in RL, provides a taxonomy, and discusses future prospects. Paper Content Introduction Reinforcement learning (RL) is a mathematical formalism for sequential decision-making RL can be used to acquire intelligent behaviors automatically Deep neural networks can be used to approximate functions with high capacity Deep reinforcement learning (DRL) has achieved tremendous developments in recent years Sample efficiency is an issue for DRL in real-world applications Inductive bias can be introduced into the DRL framework Choosing function approximator architectures is an important inductive bias Supervised learning (SL) has been used to motivate architecture for RL Convolutional neural networks (CNN) and recurrent neural networks (RNN) are common practices for DRL Transformer architecture has revolutionized learning paradigm across SL tasks Transformers have been applied to RL to extract relations between entities and capture multi-step temporal dependencies Offline RL has attracted attention due to its ability to leverage offline large-scale datasets Transformers can serve directly as a model for sequential decisions Transformer-based architectures often suffer from high computational and memory costs Problem scope Reinforcement learning Reinforcement Learning (RL) is a type of learning in a Markov Decision Process (MDP) RL aims to learn a policy to maximize the expected discounted return Topics in RL include meta RL, multi-task RL, and multi-agent RL Offline RL does not allow interaction with the environment during training Goal-conditioned RL extends the standard RL problem to goal-augmented setting Model-based RL learns an auxiliary dynamic model of the environment Transformers Transformer is a neural network for modeling sequential data Self-attention mechanism captures dependencies within long sequences Inputs, queries, keys, and values are mapped to linear transformations Output of self-attention layer is a weighted sum of all values Multi-head attention and residual connection help Transformers learn expressive representations and model long-term interactions Combination of transformers and rl Transformers can be used as a component for RL algorithms Transformers can also be used as a whole sequential decision-maker Network architecture in rl Early progress of network architecture design in RL has challenges Techniques of neural networks (e....

January 8, 2023 · 1010 words · Wenzhe Li, Hao Luo, Zichuan Lin, Chongjie Zhang, Zongqing Lu and 1 others
Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement

Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Time series forecasting is an important task in many applications. Real-world time series data is often limited and noisy. A bidirectional variational auto-encoder (BVAE) is proposed to address the time series forecasting problem. The BVAE is equipped with diffusion, denoise, and disentanglement. Experiments show that the BVAE outperforms competitive algorithms. Paper Content Introduction Time series forecasting is important for decision-making Traditional RNN-based methods capture temporal dependencies LSTMs and GRUs use gate functions to handle long-term dependencies CNNs capture complex inner patterns of the time series Transformer-based models have shown great performance Neural networks have uncertainty issues VAR models try to model the distribution of time series Interpretable representation learning is another merit VAEs have superiority in modeling latent distributions Disentangled representation can improve performance and robustness Real-world time series are often noisy and short D 3 VAE proposed to address time series forecasting problem Coupled diffusion probabilistic model Diffusion probabilistic model is a family of latent variable models to generate high-quality samples Coupled forward process is developed to augment input and target series synchronously Bidirectional variational auto-encoder (BVAE) proposed to take place of reverse process in diffusion model Markov chain adds Gaussian noise to data Coupled diffusion process diffuses input and output series Variance schedule and scale parameter used to reduce aleatoric uncertainty BVAE opens interface to integrate disentanglement for model interpretability Scaled denoising score matching for diffused time series cleaning Augmenting time series data with coupled diffusion probabilistic model Generative distribution moves toward diffused target series Employ Denoising Score Matching (DSM) to accelerate de-uncertainty process Use monotonically decreasing series of fixed σ values to scale noise of different levels Disentangling latent variables for interpretation Interpretability of time series forecasting model is important Disentangling latent variables can enhance reliability of prediction Total Correlation (TC) is used to measure dependencies among multiple random variables Bidirectional structure of BVAE aggregates rich semantics into latent variables Algorithm 1 and 2 used to train and forecast Training and forecasting Proposed coupled diffusion with denoising network to reduce effect of uncertainty Minimized TC of latent variables to disentangle them Reconstructed loss with trade-off parameters Minimized objective to learn generative model Experiment settings Generated two synthetic datasets and six real-world datasets Sliced datasets to contain at most 1000 time points Compared D3VAE to one GP based method, two auto-regressive methods, and four VAE-based methods Used Adam optimizer with initial learning rate of 5e-4 Batch size of 16 and training set to 20 epochs Number of disentanglement factors chosen from {4, 8} Evaluation metrics: CRPS and MSE Experiments conducted on Linux machine with single NVIDIA P40 GPU Experiments repeated five times Main results Two prediction lengths (8 and 16) are evaluated Results of longer prediction lengths are in Appendix D Noise of outcome series can be estimated to assess uncertainty Scale parameter ω can be adjusted to generate distribution space Uncertainty estimation can quantify uncertainty effectively Disentanglement quality can be assessed by evaluating classification performance MIG metric used to evaluate disentanglement Diffusion process can effectively augment input or target Model analysis Variance Schedule β and The Number of Diffusion Steps T should be configured properly to reduce the effect of uncertainty....

January 8, 2023 · 886 words · Yan Li, Xinjiang Lu, Yaqing Wang, Dejing Dou
DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching

DeepMatcher: A Deep Transformer-based Network for Robust and Accurate Local Feature Matching

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Local feature matching between images is difficult, especially when there are significant appearance variations. DeepMatcher is a deep Transformer-based network that captures more human-intuitive and simpler-to-match features. SlimFormer leverages vector-based attention to model relevance among all keypoints and relative position encoding is applied to each SlimFormer. Feature Transition Module (FTM) and Fine Matches Module are used to generate robust and accurate matches....

January 8, 2023 · 1154 words · Tao Xie, Kun Dai, Ke Wang, Ruifeng Li, Lijun Zhao
Perceptual-Neural-Physical Sound Matching

Perceptual-Neural-Physical Sound Matching

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Sound matching algorithms use parametric audio synthesis to approximate a target waveform. Deep neural networks have achieved good results in matching sustained harmonic tones. Matching nonstationary and inharmonic targets (e.g. percussion) is more challenging. Mean square error in the parametric domain (P-loss) is simple and fast, but doesn’t take into account the differing perceptual significance of each parameter....

January 7, 2023 · 913 words · Han Han, Vincent Lostanlen, Mathieu Lagrange
Why do Nearest Neighbor Language Models Work?

Why do Nearest Neighbor Language Models Work?

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Language models (LMs) calculate representations of an already-seen context to predict the next word. Retrieval-augmented LMs have been shown to improve over standard neural LMs. This paper investigates why k-nearest neighbor language models (kNN-LMs) perform better than standard parametric LMs. Three main reasons are identified: different input representation, approximate kNN search, and softmax temperature....

January 7, 2023 · 846 words · Frank F. Xu, Uri Alon, Graham Neubig
Modeling Scattering Coefficients using Self-Attentive Complex Polynomials with Image-based Representation

Modeling Scattering Coefficients using Self-Attentive Complex Polynomials with Image-based Representation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Finding antenna designs that meet frequency requirements and are optimal is a critical component in designing next generation hardware. The process is non-trivial because the objective function is nonlinear and sensitive to design changes. EM simulations are slow and expensive with commercial software. CZP is a sample-efficient and accurate surrogate model to estimate scattering coefficients in the frequency domain of a given 2D planar antenna design....

January 6, 2023 · 1175 words · Andrew Cohen, Weiping Dou, Jiang Zhu, Slawomir Koziel, Peter Renner and 5 others
No, to the Right -- Online Language Corrections for Robotic Manipulation via Shared Autonomy

'No, to the Right' -- Online Language Corrections for Robotic Manipulation via Shared Autonomy

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Systems for language-guided human-robot interaction must be adaptive and efficient. Existing instruction-following agents cannot adapt and require many demonstrations to learn. LILAC is a framework for incorporating and adapting to natural language corrections. LILAC splits agency between the human and robot. Real-time corrections refine the human’s control space. User study shows higher task completion rates and is preferred by users....

January 6, 2023 · 873 words · Yuchen Cui, Siddharth Karamcheti, Raj Palleti, Nidhya Shivakumar, Percy Liang and 1 others
Automatic segmentation of clear cell renal cell tumors, kidney, and cysts in patients with von Hippel-Lindau syndrome using U-net architecture on magnetic resonance images

Automatic segmentation of clear cell renal cell tumors, kidney, and cysts in patients with von Hippel-Lindau syndrome using U-net architecture on magnetic resonance images

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Demonstrated automated segmentation of ccRCC, cysts, and normal kidney parenchyma in VHL patients using CNN on MRI Queried 115 VHL patients and 117 scans with 504 ccRCCs and 1171 cysts from 2015 to 2021 Evaluated U-Net performance on 10 randomized splits of the cohort using DSC 2D U-Net achieved an average ccRCC lesion detection AUC of 0....

January 6, 2023 · 800 words · Pouria Yazdian Anari, Nathan Lay, Aditi Chaurasia, Nikhil Gopal, Safa Samimi and 11 others
Better Differentially Private Approximate Histograms and Heavy Hitters using the Misra-Gries Sketch

Better Differentially Private Approximate Histograms and Heavy Hitters using the Misra-Gries Sketch

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Problem of computing differentially private approximate histograms and heavy hitters in a stream of elements Misra and Gries [Science of Computer Programming, 1982] used in non-private setting Chan, Li, Shi, and Xu [PETS 2012] describe a differentially private version of the Misra-Gries sketch Amount of noise added scales linearly with size of sketch We present a better mechanism for releasing Misra-Gries sketch under $(\varepsilon,\delta)$-differential privacy Noise magnitude independent of sketch size Maximum error same as best known in private non-streaming setting Simple and likely to be practical Post-processing step of Misra-Gries sketch does not increase worst-case error guarantee Noise magnitude less than twice the magnitude of the non-streaming setting Paper Content Introduction Computing the histogram of a dataset is a fundamental task in data analysis Differentially private algorithms exist to compute the histogram These algorithms are not practical when the amount of data is large Non-private approximate histograms are often computed using the Misra-Gries (MG) sketch The MG sketch returns approximate frequencies with an optimal error This paper develops a way of releasing a MG sketch in a differentially private way while adding only a small amount of noise This allows for efficient and accurate approximate histograms while not violating users’ privacy This improves upon the work of Chan et al....

January 6, 2023 · 888 words · Christian Janos Lebeda, Jakub Tětek
Myths and Legends in High-Performance Computing

Myths and Legends in High-Performance Computing

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Common myths and legends exist among members of the high-performance computing community. These myths are based on evidence or argumentation, not scientific facts. These myths can be used to discuss possible new directions for research and industry investment. Paper Content Introduction Human society has myths and legends, including in the HPC community HPC drives powerful computers and technologies forward AI language models cannot create or share fictional content 12 myths about HPC are discussed Myth 1: Quantum Computing will take over HPC Myth 2: Everything will be Deep Learning Myth 3: Extreme specialization as seen in smartphones will push supercomputers beyond Moore’s Law AI is now in the palm of everyone’s hand GPUs have been successful in HPC Multitudes of hardware customization per each facet of the workload is not likely to be successful Amdahl’s Law limits potential speedup Weak scaling is important for modern supercomputers Load balancing is difficult with heterogeneous accelerators Dark silicon is becoming increasingly difficult to utilize Plethora of accelerators is only beneficial for small programs on small machines Conclusions Debates myths in HPC community Many myths form core of thinking Some points may settle in the future, others may not Serious treatment needed to guide future directions for research, industry and government investment

January 6, 2023 · 227 words · Satoshi Matsuoka, Jens Domke, Mohamed Wahib, Aleksandr Drozd, Torsten Hoefler