ConsistencyTTA

Description

Diffusion models power a vast majority of the text-to-audio generation methods. Unfortunately, diffusion models suffer from a slow inference speed due to iteratively querying the underlying denoising network, thus unsuitable for applications with time or computational constraints. This work proposes text-to-audio models that only require a single non-autoregressive neural network query, accelerating the generation hundreds of times and enabling on-device audio generation.

To achieve this, we propose "CFG-aware latent consistency model'', which moves consistency generation into a latent space and incorporates classifier-free guidance (CFG) into the training process. By doing so, our models retain diffusion models' impressive generation quality and diversity. Unlike diffusion models, ConsistencyTTA's single-step generation makes its generated audio available during training. We leverage this advantage to finetune ConsistencyTTA end-to-end with audio-space text-aware metrics, such as the CLAP score, further enhancing the generations. We use the CLAP loss as an example, confirming that end-to-end fine-tuning further boosts the generation quality.

Please check out our poster at INTERSPEECH 2024 at Kos Island, Greece!

Main Experiment Results

Our method reduce the computation of the core step of diffusion-based text-to-audio generation by a factor of 400 and enables on-device generation, while observing minimal performance degradation in Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.
Generation Time is the time in minutes to generate the entire validation set (882 samples).
↑: higher is better; ↓: lower is better.

	Model Queries ↓	Generation Time ↓	Subjective Quality ↑	Subjective Text Align ↑	CLAP_T ↑	CLAP_A ↑	FAD ↓	FD ↓	KLD ↓
AudioLDM-L (Baseline)	400	-	-	-	-	-	2.08	27.12	1.86
TANGO (Baseline)	400	168	4.136	4.064	24.10	72.85	1.631	20.11	1.362
ConsistencyTTA + CLAP-FT	1	2.3	3.830	4.064	24.69	72.54	2.406	20.97	1.358
ConsistencyTTA	1	2.3	3.902	4.010	22.50	72.30	2.575	22.08	1.354
Ground Truth	-	-	-	-	26.71	100	-	-	-

This benchmark demonstrates how our single-step models stack up with previous methods, most of which requiring hundreds of generation steps.

Ablation Studies on Distillation Settings

Guidance Method	CFG Weight	Teacher Solver	Noise Schedule	FAD ↓	FD ↓	KLD ↓
Unguided	1	DDIM	Uniform	13.48	45.75	2.409
External CFG	3	DDIM	Uniform	8.565	38.67	2.015
External CFG	3	Heun	Karras	7.421	39.36	1.976
CFG Distillation with Fixed Weight	3	Heun	Karras	5.702	33.18	1.494
CFG Distillation with Fixed Weight	3	Heun	Uniform	3.859	27.79	1.421
CFG Distillation with Random Weight	4	Heun	Uniform	3.180	27.92	1.394
	6	Heun	Uniform	2.975	28.63	1.378

Based on these results, we can conclude that:

CFG distillation with random weight is more effective than fixed weight, which is more effective than external CFG.
Heun is a better teacher solver than DDIM, and Uniform noise schedule outperforms Karras noise schedule.

Generation Diversity

Consistency models demonstrate non-trivial generation diversity, as do diffusion models. In this page, we present 50 groups of generations from four different random seeds to demonstrate this diversity, showing that our method combines the diversity of diffusion models and the efficiency of single-step models.

Human Evaluation

ConsistencyTTA's performance is verified via extensive human evaluation. Audio clips generated from ConsistencyTTA and baseline methods are mixed and shown to the evaluators, who are then asked to rate the audio clips based on their quality and correspondence with the textual prompt. A sample of the evaluation form is shown on this page.

Citing Our Work (BibTeX)

@inproceedings{bai2024consistencytta,
  author = {Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},
  title = {ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
  booktitle = {INTERSPEECH},
  year = {2024}
}

Contact

For any questions or suggestions regarding our work, please email yatong_bai@berkeley.edu and dung.tran@microsoft.com.

ConsistencyTTA: Accelerating Diffusion-Based
Text-to-Audio Generation with Consistency Distillation

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi

Microsoft Applied Science Group, University of California, Berkeley

In INTERSPEECH 2024