Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Introduce a real-time, high-fidelity audio codec leveraging neural networks
- Use a streaming encoder-decoder architecture with quantized latent space
- Simplify and speed-up training with a single multiscale spectrogram adversary
- Introduce a novel loss balancer mechanism to stabilize training
- Use lightweight Transformer models to compress representation by up to 40%
Paper Content
Introduction
- 82% of internet traffic is streaming audio and video in 2021
- Audio compression is an important problem
- Audio codecs use an encoder and decoder to remove redundancies in audio content
- Neural networks have been used to compress audio
- Problems with lossy neural compression include overfitting and efficient compression
- Human perception is used to evaluate audio codecs
Related work
- Recent advancements in neural audio generation enabled computers to generate natural sounding audio
- Autoregressive models such as WaveNet produce convincing results but are slow
- Generative Adversarial Networks (GANs) can match the quality of autoregressive models
- Low bitrate parametric speech and audio codecs have been studied for a long time
- Neural based audio codecs have been proposed and show promising results
- Audio and speech can be represented using discrete values for various tasks
Model
- Audio signal of duration d can be represented by a sequence of audio channels and samples
- EnCodec model composed of 3 components: encoder, quantization layer, decoder
- System trained end-to-end to minimize reconstruction and perceptual loss
- Visual description of method in Figure 1
Encoder & decoder architecture
- EnCodec model is a streaming, convolutional-based encoder-decoder architecture
- Used for various audio-related tasks, e.g. source separation, enhancement, neural vocoders, audio codec, and artificial bandwidth extension
- Encoder consists of 1D convolution, convolution blocks, two-layer LSTM, and 1D convolution layer
- Decoder mirrors encoder, using transposed convolutions instead of strided convolutions
- Two variants of model, depending on whether targeting low-latency streamable setup or high fidelity non-streamable usage
Residual vector quantization
- RVQ is a method of vector quantization
- Training procedure follows Dhariwal et al. (2020) and Zeghidour et al. (2021)
- Codebook entries are updated using exponential moving average
- Straight-through-estimator is used to compute gradient of encoder
- Commitment loss is added to overall training loss
- Variable number of residual steps used to support multiple bandwidth targets
Language modeling and entropy coding
- Trained a Transformer-based language model to compress/decompress faster than real time
- Model consists of 5 layers, 8 heads, 200 channels, 800 dimension feed-forward blocks, no dropout
- Select bandwidth and corresponding number of codebooks at train time
- Discrete representation from time t-1 transformed into continuous representation using learnt embedding tables
- Output of Transformer fed into Nq linear layers with output channels equal to cardinality of each codebook
- Attention layer has causal receptive field of 3.5 seconds
- Input to network is complex-valued STFT
- Discriminator composed of 2D convolutional layer, followed by 2D convolutions with increasing dilation rates
- Train model on sequences of 5 seconds
- Use range based arithmetic coder to leverage estimated probabilities given by language model
- Round estimated probabilities with precision of 10-6
Training objective
- Reconstruction loss term is comprised of time and frequency domain losses
- Frequency domain loss is a linear combination of L1 and L2 losses over mel-spectrogram
- Perceptual loss term is based on multi-scale STFT-based discriminator
- Adversarial loss for generator is constructed from multiple discriminators
- Relative feature matching loss for generator is included
- Multi-bandwidth training is used
- Commitment loss is added between output of encoder and its quantized value
- Generator is trained to optimize a combination of all the losses
- Loss balancer is introduced to stabilize training
Experiments and results
Dataset
- Trained EnCodec on 24 kHz monophonic audio from speech, noisy speech, music and general audio
- Trained fullband stereo EnCodec on 48 kHz music
- Used Lyra-v2, EVS and Opus as competitive standard codecs
- Audio samples from speech and music, ground truth 16bits 24kHz wave
- Data splits detailed in Appendix A.1
- Training and validation used four strategies: single source from Jamendo, single source from other datasets, two sources from all datasets, three sources from all datasets except music
- Audio normalized by file, random gain between -10 and 6 dB, reverberation added with probability 0.2
- Tested four categories: clean speech from DNS, clean speech mixed with FSDK50K, Jamendo sample, proprietary music sample
Baselines
- Opus is a speech and audio codec standardized by IETF in 2012
- EVS is a codec standardized in 2014 by 3GPP
- Both codecs are used as traditional digital signal processing baselines
- MP3 compression is used as an additional baseline for stereophonic signal compression
- EnCodec is compared to SoundStream model from official implementation in Lyra 21 at 3.2 and 6 kbps
Evaluation methods
- Used subjective and objective evaluation metrics
- Followed MUSHRA protocol with hidden reference and low anchor
- Recruited annotators using crowd-sourcing platform
- Asked annotators to rate perceptual quality of samples on a scale of 1-100
- Randomly selected 50 samples of 5 seconds from each category of test set
- Forced at least 10 annotations per sample
- Removed noisy annotations and outliers
- Used ViSQOL and SI-SNR for objective metrics
Training
- Trained models for 300 epochs
- Each epoch is 2000 updates with Adam optimizer, batch size of 64 examples, 1 second each, learning rate of 3 x 10-4
- Used 8 A100 GPUs
- Balancer introduced in Section 3.4 with weights λ t = 0.1, λ f = 1, λ g = 3, λ feat = 3 for 24 kHz models
- For 48 kHz model, λ g = 4, λ feat = 4
Results
- EnCodec outperforms all evaluated baselines when considering the same bandwidth
- EnCodec at 3kbps reaches better performance than Lyra-v2 using 6kbps and Opus at 12kbps
- Language model and entropy coding can reduce bandwidth by 25-40%
- MS-STFTD discriminator is enough to generate high quality audio
- Streamable setup causes small degradation in performance
- Balancer significantly stabilizes training process
- EnCodec outperforms Opus at 6kbps and is comparable to MP3 at 64kbps
Latency and computation time
- Initial latency of 13.3 ms for 24 kHz streaming EnCodec model
- Initial latency of 1 second for 48 kHz non-streaming version
- Processing 10 times faster than real time
- Processing slower than real time at 48 kHz
Conclusion
- Presented EnCodec: a state-of-the-art real-time neural audio compression model
- Produces high-fidelity audio samples across a range of sample rates and bandwidths
- Improved sample quality with a spectrogram-only adversarial loss
- Stabilized training and improved interpretability of weights with a novel gradient balancer
- Demonstrated a small Transformer model can reduce bandwidth by up to 40%
- Internet traffic is mostly audio and video streams
- Different audio codecs developed and widely adopted
- Addressing very low bitrate compression with high fidelity is an essential challenge
- Figure 2: MS-STFT Discriminator architecture
- Figure 3: Human evaluations of standard codecs and neural codecs
- Table A.2: Comparing EnCodec to SoundStream
- Table A.3: Model architecture Analysis
- Table A.4: Jamando music dataset
- λ t λ f λ g λ f eat Balancer SI-SNR ViSQOL