ConsistencyTTA: Accelerating Diffusion-Based
Text-to-Audio Generation with Consistency Distillation

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

Microsoft Applied Science Group,      University of California, Berkeley

Description

Diffusion models power a vast majority of the text-to-audio generation methods. Unfortunately, diffusion models suffer from a slow inference speed due to iteratively querying the underlying denoising network, thus unsuitable for applications with time or computational constraints. This work proposes trains text-to-audio models that only require a single non-autoregressive neural network query, accelerating the generation hundreds of times.

To achieve this, we propose "CFG-aware latent consistency model'', which moves consistency generation into a latent space and incorporates classifier-free guidance (CFG) into the training process. By doing so, our models retain diffusion models' impressive generation quality and diversity. Unlike diffusion models, ConsistencyTTA's single-step generation makes its generated audio available during training. We leverage this advantage to finetune ConsistencyTTA end-to-end with audio-space text-aware metrics, such as the CLAP score, further enhancing the generations. We use the CLAP loss as an example, confirming that end-to-end fine-tuning further boosts the generation quality.

Main Experiment Results

ConsistencyTTA Results

Our method reduce the computation of the core step of diffusion-based text-to-audio generation by a factor of 400, while observing minimal performance degradation in terms of Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.

# queries (↓) CLAPT (↑) CLAPA (↑) FAD (↓) FD (↓) KLD (↓)
Diffusion (Baseline) 400 24.57 72.79 1.908 19.57 1.350
Consistency + CLAP FT (Ours) 1 24.69 72.54 2.406 20.97 1.358
Consistency (Ours) 1 22.50 72.30 2.575 22.08 1.354

This benchmark demonstrates how our single-step models stack up with previous methods, most of which mostly require hundreds of generation steps.

Generation Diversity

Consistency models demonstrate non-trivial generation diversity, as do diffusion models. In this page, we present 50 groups of generations from four different random seeds to demonstrate this diversity, showing that our method combines the diversity of diffusion models and the efficiency of single-step models.

Human Evaluation

ConsistencyTTA's performance is verified via extensive human evaluation. Audio clips generated from ConsistencyTTA and baseline methods are mixed and shown to the evaluators, who are then asked to rate the audio clips based on their quality and correspondence with the textual prompt. A sample of the evaluation form is shown on this page.

Citing Our Work (BibTeX)

@article{bai2023accelerating,
  author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},
  title = {Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
  journal={arXiv preprint arXiv:2309.10740},
  year = {2023}
}

Contact

For any questions or suggestions regarding our work, please email yatong_bai@berkeley.edu and dung.tran@microsoft.com.