Diffusion models power a vast majority of the text-to-audio generation methods. Unfortunately, diffusion models suffer from a slow inference speed due to iteratively querying the underlying denoising network, thus unsuitable for applications with time or computational constraints. This work proposes text-to-audio models that only require a single non-autoregressive neural network query, accelerating the generation hundreds of times and enabling on-device audio generation.
To achieve this, we propose "CFG-aware latent consistency model'', which moves consistency generation into a latent space and incorporates classifier-free guidance (CFG) into the training process. By doing so, our models retain diffusion models' impressive generation quality and diversity. Unlike diffusion models, ConsistencyTTA's single-step generation makes its generated audio available during training. We leverage this advantage to finetune ConsistencyTTA end-to-end with audio-space text-aware metrics, such as the CLAP score, further enhancing the generations. We use the CLAP loss as an example, confirming that end-to-end fine-tuning further boosts the generation quality.
Please check out our poster at INTERSPEECH 2024 at Kos Island, Greece!
Our method reduce the computation of the core step of diffusion-based text-to-audio generation by
a factor of 400 and enables on-device generation, while observing minimal performance degradation in
FrΓ©chet Audio Distance (FAD), FrΓ©chet Distance (FD), KL Divergence, and CLAP Scores.
Generation Time is the time in minutes to generate the entire validation set (882 samples).
β: higher is better; β: lower is better.
Model Queries β | Generation Time β |
Subjective Quality β | Subjective Text Align β |
CLAPT β | CLAPA β |
FAD β | FD β | KLD β |
|
---|---|---|---|---|---|---|---|---|---|
AudioLDM-L (Baseline) | 400 | - | - | - | - | - | 2.08 | 27.12 | 1.86 |
TANGO (Baseline) | 400 | 168 | 4.136 | 4.064 | 24.10 | 72.85 | 1.631 | 20.11 | 1.362 |
ConsistencyTTA + CLAP-FT | 1 | 2.3 | 3.830 | 4.064 | 24.69 | 72.54 | 2.406 | 20.97 | 1.358 |
ConsistencyTTA | 1 | 2.3 | 3.902 | 4.010 | 22.50 | 72.30 | 2.575 | 22.08 | 1.354 |
Ground Truth | - | - | - | - | 26.71 | 100 | - | - | - |
This benchmark demonstrates how our single-step models stack up with previous methods, most of which requiring hundreds of generation steps.
Guidance Method | CFG Weight | Teacher Solver | Noise Schedule | FAD β | FD β | KLD β |
---|---|---|---|---|---|---|
Unguided | 1 | DDIM | Uniform | 13.48 | 45.75 | 2.409 |
External CFG | 3 | DDIM | Uniform | 8.565 | 38.67 | 2.015 |
Heun | Karras | 7.421 | 39.36 | 1.976 | ||
CFG Distillation with Fixed Weight |
3 | Heun | Karras | 5.702 | 33.18 | 1.494 |
Uniform | 3.859 | 27.79 | 1.421 | |||
CFG Distillation with Random Weight |
4 | Heun | Uniform | 3.180 | 27.92 | 1.394 |
6 | 2.975 | 28.63 | 1.378 |
Consistency models demonstrate non-trivial generation diversity, as do diffusion models. In this page, we present 50 groups of generations from four different random seeds to demonstrate this diversity, showing that our method combines the diversity of diffusion models and the efficiency of single-step models.
ConsistencyTTA's performance is verified via extensive human evaluation. Audio clips generated from ConsistencyTTA and baseline methods are mixed and shown to the evaluators, who are then asked to rate the audio clips based on their quality and correspondence with the textual prompt. A sample of the evaluation form is shown on this page.
@inproceedings{bai2024consistencytta, author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh}, title = {ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation}, booktitle = {INTERSPEECH}, year = {2024} }
For any questions or suggestions regarding our work, please email yatong_bai@berkeley.edu and dung.tran@microsoft.com.