ConsistencyTTA: Accelerating Diffusion-Based
Text-to-Audio Generation with Consistency Distillation

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

Microsoft Applied Science Group,      University of California, Berkeley

In INTERSPEECH 2024

Description

Diffusion models power a vast majority of the text-to-audio generation methods. Unfortunately, diffusion models suffer from a slow inference speed due to iteratively querying the underlying denoising network, thus unsuitable for applications with time or computational constraints. This work proposes text-to-audio models that only require a single non-autoregressive neural network query, accelerating the generation hundreds of times and enabling on-device audio generation.

To achieve this, we propose "CFG-aware latent consistency model'', which moves consistency generation into a latent space and incorporates classifier-free guidance (CFG) into the training process. By doing so, our models retain diffusion models' impressive generation quality and diversity. Unlike diffusion models, ConsistencyTTA's single-step generation makes its generated audio available during training. We leverage this advantage to finetune ConsistencyTTA end-to-end with audio-space text-aware metrics, such as the CLAP score, further enhancing the generations. We use the CLAP loss as an example, confirming that end-to-end fine-tuning further boosts the generation quality.

Please check out our poster at INTERSPEECH 2024 at Kos Island, Greece!

Main Experiment Results

ConsistencyTTA Results                  

Our method reduce the computation of the core step of diffusion-based text-to-audio generation by a factor of 400 and enables on-device generation, while observing minimal performance degradation in FrΓ©chet Audio Distance (FAD), FrΓ©chet Distance (FD), KL Divergence, and CLAP Scores.
Generation Time is the time in minutes to generate the entire validation set (882 samples).
↑: higher is better; ↓: lower is better.

Model Queries
↓
Generation Time
↓
Subjective Quality
↑
Subjective Text Align
↑
CLAPT
↑
CLAPA
↑
FAD
↓
FD
↓
KLD
↓
AudioLDM-L (Baseline) 400 - - - - - 2.08 27.12 1.86
TANGO (Baseline) 400 168 4.136 4.064 24.10 72.85 1.631 20.11 1.362
ConsistencyTTA + CLAP-FT 1 2.3 3.830 4.064 24.69 72.54 2.406 20.97 1.358
ConsistencyTTA 1 2.3 3.902 4.010 22.50 72.30 2.575 22.08 1.354
Ground Truth - - - - 26.71 100 - - -

This benchmark demonstrates how our single-step models stack up with previous methods, most of which requiring hundreds of generation steps.

Ablation Studies on Distillation Settings

Guidance Method CFG Weight Teacher Solver Noise Schedule FAD ↓ FD ↓ KLD ↓
Unguided 1 DDIM Uniform 13.48 45.75 2.409
External CFG 3 DDIM Uniform 8.565 38.67 2.015
Heun Karras 7.421 39.36 1.976
CFG Distillation
with Fixed Weight
3 Heun Karras 5.702 33.18 1.494
Uniform 3.859 27.79 1.421
CFG Distillation
with Random Weight
4 Heun Uniform 3.180 27.92 1.394
6 2.975 28.63 1.378
Based on these results, we can conclude that:

Generation Diversity

Consistency models demonstrate non-trivial generation diversity, as do diffusion models. In this page, we present 50 groups of generations from four different random seeds to demonstrate this diversity, showing that our method combines the diversity of diffusion models and the efficiency of single-step models.

Human Evaluation

ConsistencyTTA's performance is verified via extensive human evaluation. Audio clips generated from ConsistencyTTA and baseline methods are mixed and shown to the evaluators, who are then asked to rate the audio clips based on their quality and correspondence with the textual prompt. A sample of the evaluation form is shown on this page.

Citing Our Work (BibTeX)

@inproceedings{bai2024accelerating,
  author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},
  title = {ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
  booktitle = {INTERSPEECH},
  year = {2024}
}

Contact

For any questions or suggestions regarding our work, please email yatong_bai@berkeley.edu and dung.tran@microsoft.com.