Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Introduce a real-time, high-fidelity audio codec leveraging neural networks
  • Use a streaming encoder-decoder architecture with quantized latent space
  • Simplify and speed-up training with a single multiscale spectrogram adversary
  • Introduce a novel loss balancer mechanism to stabilize training
  • Use lightweight Transformer models to compress representation by up to 40%

Paper Content

Introduction

  • 82% of internet traffic is streaming audio and video in 2021
  • Audio compression is an important problem
  • Audio codecs use an encoder and decoder to remove redundancies in audio content
  • Neural networks have been used to compress audio
  • Problems with lossy neural compression include overfitting and efficient compression
  • Human perception is used to evaluate audio codecs
  • Recent advancements in neural audio generation enabled computers to generate natural sounding audio
  • Autoregressive models such as WaveNet produce convincing results but are slow
  • Generative Adversarial Networks (GANs) can match the quality of autoregressive models
  • Low bitrate parametric speech and audio codecs have been studied for a long time
  • Neural based audio codecs have been proposed and show promising results
  • Audio and speech can be represented using discrete values for various tasks

Model

  • Audio signal of duration d can be represented by a sequence of audio channels and samples
  • EnCodec model composed of 3 components: encoder, quantization layer, decoder
  • System trained end-to-end to minimize reconstruction and perceptual loss
  • Visual description of method in Figure 1

Encoder & decoder architecture

  • EnCodec model is a streaming, convolutional-based encoder-decoder architecture
  • Used for various audio-related tasks, e.g. source separation, enhancement, neural vocoders, audio codec, and artificial bandwidth extension
  • Encoder consists of 1D convolution, convolution blocks, two-layer LSTM, and 1D convolution layer
  • Decoder mirrors encoder, using transposed convolutions instead of strided convolutions
  • Two variants of model, depending on whether targeting low-latency streamable setup or high fidelity non-streamable usage

Residual vector quantization

  • RVQ is a method of vector quantization
  • Training procedure follows Dhariwal et al. (2020) and Zeghidour et al. (2021)
  • Codebook entries are updated using exponential moving average
  • Straight-through-estimator is used to compute gradient of encoder
  • Commitment loss is added to overall training loss
  • Variable number of residual steps used to support multiple bandwidth targets

Language modeling and entropy coding

  • Trained a Transformer-based language model to compress/decompress faster than real time
  • Model consists of 5 layers, 8 heads, 200 channels, 800 dimension feed-forward blocks, no dropout
  • Select bandwidth and corresponding number of codebooks at train time
  • Discrete representation from time t-1 transformed into continuous representation using learnt embedding tables
  • Output of Transformer fed into Nq linear layers with output channels equal to cardinality of each codebook
  • Attention layer has causal receptive field of 3.5 seconds
  • Input to network is complex-valued STFT
  • Discriminator composed of 2D convolutional layer, followed by 2D convolutions with increasing dilation rates
  • Train model on sequences of 5 seconds
  • Use range based arithmetic coder to leverage estimated probabilities given by language model
  • Round estimated probabilities with precision of 10-6

Training objective

  • Reconstruction loss term is comprised of time and frequency domain losses
  • Frequency domain loss is a linear combination of L1 and L2 losses over mel-spectrogram
  • Perceptual loss term is based on multi-scale STFT-based discriminator
  • Adversarial loss for generator is constructed from multiple discriminators
  • Relative feature matching loss for generator is included
  • Multi-bandwidth training is used
  • Commitment loss is added between output of encoder and its quantized value
  • Generator is trained to optimize a combination of all the losses
  • Loss balancer is introduced to stabilize training

Experiments and results

Dataset

  • Trained EnCodec on 24 kHz monophonic audio from speech, noisy speech, music and general audio
  • Trained fullband stereo EnCodec on 48 kHz music
  • Used Lyra-v2, EVS and Opus as competitive standard codecs
  • Audio samples from speech and music, ground truth 16bits 24kHz wave
  • Data splits detailed in Appendix A.1
  • Training and validation used four strategies: single source from Jamendo, single source from other datasets, two sources from all datasets, three sources from all datasets except music
  • Audio normalized by file, random gain between -10 and 6 dB, reverberation added with probability 0.2
  • Tested four categories: clean speech from DNS, clean speech mixed with FSDK50K, Jamendo sample, proprietary music sample

Baselines

  • Opus is a speech and audio codec standardized by IETF in 2012
  • EVS is a codec standardized in 2014 by 3GPP
  • Both codecs are used as traditional digital signal processing baselines
  • MP3 compression is used as an additional baseline for stereophonic signal compression
  • EnCodec is compared to SoundStream model from official implementation in Lyra 21 at 3.2 and 6 kbps

Evaluation methods

  • Used subjective and objective evaluation metrics
  • Followed MUSHRA protocol with hidden reference and low anchor
  • Recruited annotators using crowd-sourcing platform
  • Asked annotators to rate perceptual quality of samples on a scale of 1-100
  • Randomly selected 50 samples of 5 seconds from each category of test set
  • Forced at least 10 annotations per sample
  • Removed noisy annotations and outliers
  • Used ViSQOL and SI-SNR for objective metrics

Training

  • Trained models for 300 epochs
  • Each epoch is 2000 updates with Adam optimizer, batch size of 64 examples, 1 second each, learning rate of 3 x 10-4
  • Used 8 A100 GPUs
  • Balancer introduced in Section 3.4 with weights λ t = 0.1, λ f = 1, λ g = 3, λ feat = 3 for 24 kHz models
  • For 48 kHz model, λ g = 4, λ feat = 4

Results

  • EnCodec outperforms all evaluated baselines when considering the same bandwidth
  • EnCodec at 3kbps reaches better performance than Lyra-v2 using 6kbps and Opus at 12kbps
  • Language model and entropy coding can reduce bandwidth by 25-40%
  • MS-STFTD discriminator is enough to generate high quality audio
  • Streamable setup causes small degradation in performance
  • Balancer significantly stabilizes training process
  • EnCodec outperforms Opus at 6kbps and is comparable to MP3 at 64kbps

Latency and computation time

  • Initial latency of 13.3 ms for 24 kHz streaming EnCodec model
  • Initial latency of 1 second for 48 kHz non-streaming version
  • Processing 10 times faster than real time
  • Processing slower than real time at 48 kHz

Conclusion

  • Presented EnCodec: a state-of-the-art real-time neural audio compression model
  • Produces high-fidelity audio samples across a range of sample rates and bandwidths
  • Improved sample quality with a spectrogram-only adversarial loss
  • Stabilized training and improved interpretability of weights with a novel gradient balancer
  • Demonstrated a small Transformer model can reduce bandwidth by up to 40%
  • Internet traffic is mostly audio and video streams
  • Different audio codecs developed and widely adopted
  • Addressing very low bitrate compression with high fidelity is an essential challenge
  • Figure 2: MS-STFT Discriminator architecture
  • Figure 3: Human evaluations of standard codecs and neural codecs
  • Table A.2: Comparing EnCodec to SoundStream
  • Table A.3: Model architecture Analysis
  • Table A.4: Jamando music dataset
  • λ t λ f λ g λ f eat Balancer SI-SNR ViSQOL