Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion
Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Diffusion models can be used for music generation Music generation requires handling multiple aspects Developed a cascading latent diffusion approach to generate high-quality stereo music Targeting real-time on a single consumer GPU Open-sourced music samples, codes and all music samples for all models Paper Content Introduction Music generation is a challenging problem Recently, deep learning models have been used to explore audio generation Existing models explore the use of recursive neural networks, adversarial generative networks, autoencoders, and transformers Diffusion models have been used in speech synthesis, but are still under-explored for music generation Long-term structure, sound quality, diversity of music, and control of generation are challenges in the area of music generation Moûsai is a text-conditional cascading diffusion model that tries to address all the challenges Moûsai uses a custom two-stage cascading diffusion method Moûsai can generate long-context 48kHz stereo music exceeding the minute mark Moûsai uses an efficient 1D U-Net architecture for both stages of the cascade Moûsai uses a diffusion magnitude autoencoder to compress the audio signal 64x Related work Common trend in generative space is to train a model on input domain and learn a generative model on top of reduced representation Auto-encoding and quantized auto-encoding are popular compression methods for images Two popular directions in generative space are to learn a quantized representation or use a compressed/downsampled representation Cascading diffusion approach has not been attempted for audio generation Our work follows ideas from cascading diffusion approach, using a two-stage method to compress audio and generate reduced representation while conditioning on a textual description Preliminaries Diffusion: process of spreading information or resources Latent Diffusion: process of spreading information or resources in a hidden way U-Net: a type of convolutional neural network Audio generation Audio generation is a challenging task Waveforms can be represented in different resolutions Higher sample rates allow for more temporal resolution Qualitative properties such as texture and pitch can be observed Audio can be represented with mono, stereo, or surround sound Models can be trained on single or multiple modalities Diffusion Employed v v v-objective diffusion as proposed by Salimans & Ho (2022)....