You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While the generated audios are of high quality, running inference with the original implementation is very slow: a 10
second audio sample takes upwards of 30 seconds to generate. This is due to a combination of factors, including a deep
multi-stage modelling approach, large checkpoint sizes, and un-optimised code.
In this blog post, we showcase how to use AudioLDM 2 in the Hugging Face 🧨 Diffusers library, exploring a range of code
optimisations such as half-precision, flash attention, and compilation, and model optimisations such as scheduler choice
and negative prompting, to reduce the inference time by over 10 times, with minimal degradation in quality of the
output audio. The blog post is also accompanied by a more streamlined Colab notebook,
that contains all the code but fewer explanations.
Read to the end to find out how to generate a 10 second audio sample in just 1 second!
Model overview
Inspired by Stable Diffusion, AudioLDM 2
is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from text embeddings.
The overall generation process is summarised as follows:
Given a text input \(\boldsymbol{x}\), two text encoder models are used to compute the text embeddings: the text-branch of CLAP, and the text-encoder of Flan-T5
The CLAP text embeddings are trained to be aligned with the embeddings of the corresponding audio sample, whereas the Flan-T5 embeddings give a better representation of the semantics of the text.
These text embeddings are projected to a shared embedding space through individual linear projections:
A GPT2 language model (LM) is used to auto-regressively generate a sequence of \(N\) new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings:
The generated embedding vectors \(\tilde{\boldsymbol{E}}{1:N}\) and Flan-T5 text embeddings \(\boldsymbol{E}{2}\) are used as cross-attention conditioning in the LDM, which de-noises
a random latent via a reverse diffusion process. The LDM is run in the reverse diffusion process for a total of \(T\) inference steps:
$$
\boldsymbol{z}{t} = \text{LDM}\left(\boldsymbol{z}{t-1} | \tilde{\boldsymbol{E}}{1:N}, \boldsymbol{E}{2}\right) \qquad \text{for } t = 1, \dots, T
$$
where the initial latent variable \(\boldsymbol{z}{0}\) is drawn from a normal distribution \(\mathcal{N} \left(\boldsymbol{0}, \boldsymbol{I} \right)\).
The UNet of the LDM is unique in
the sense that it takes two sets of cross-attention embeddings, \(\tilde{\boldsymbol{E}}{1:N}\) from the GPT2 language model and \(\boldsymbol{E}_{2}\)
from Flan-T5, as opposed to one cross-attention conditioning as in most other LDMs.
The final de-noised latents \(\boldsymbol{z}_{T}\) are passed to the VAE decoder to recover the Mel spectrogram \(\boldsymbol{s}\):
The diagram below demonstrates how a text input is passed through the text conditioning models, with the two prompt embeddings used as cross-conditioning in the LDM:
For full details on how the AudioLDM 2 model is trained, the reader is referred to the AudioLDM 2 paper.
Hugging Face 🧨 Diffusers provides an end-to-end inference pipeline class AudioLDM2Pipeline that wraps this multi-stage generation process into a single callable object, enabling you to generate audio samples from text in just a few lines of code.
AudioLDM 2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. See the table below for details on the three official checkpoints, which can all be found on the Hugging Face Hub:
Now that we've covered a high-level overview of how the AudioLDM 2 generation process works, let's put this theory into practice!
Load the pipeline
For the purposes of this tutorial, we'll initialise the pipeline with the pre-trained weights from the base checkpoint,
cvssp/audioldm2. We can load the entirety of the pipeline using the
.from_pretrained
method, which will instantiate the pipeline and load the pre-trained weights:
The pipeline can be moved to the GPU in much the same way as a standard PyTorch nn module:
pipe.to("cuda");
Great! We'll define a Generator and set a seed for reproducibility. This will allow us to tweak our prompts and observe
the effect that they have on the generations by fixing the starting latents in the LDM model:
Now we're ready to perform our first generation! We'll use the same running example throughout this notebook, where we'll
condition the audio generations on a fixed text prompt and use the same seed throughout. The audio_length_in_s
argument controls the length of the generated audio. It defaults to the audio length that the LDM was trained on
(10.24 seconds):
prompt="The sound of Brazilian samba drums with waves gently crashing in the background"audio=pipe(prompt, audio_length_in_s=10.24, generator=generator).audios[0]