AICurious Logo

What is: WaveVAE?

SourceNon-Autoregressive Neural Text-to-Speech
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

WaveVAE is a generative audio model that can be used as a vocoder in text-to-speech systems. It is a VAE based model that can be trained from scratch by jointly optimizing the encoder q_ϕ(zx,c)q\_{\phi}\left(\mathbf{z}|\mathbf{x}, \mathbf{c}\right) and decoder p_θ(xz,c)p\_{\theta}\left(\mathbf{x}|\mathbf{z}, \mathbf{c}\right), where z\mathbf{z} is latent variables and c\mathbf{c} is the mel spectrogram conditioner.

The encoder of WaveVAE q_ϕ(zx)q\_{\phi}\left(\mathbf{z}|\mathbf{x}\right) is parameterized by a Gaussian autoregressive WaveNet that maps the ground truth audio x into the same length latent representation z\mathbf{z}. The decoder p_θ(xz)p\_{\theta}\left(\mathbf{x}|\mathbf{z}\right) is parameterized by the one-step ahead predictions from an inverse autoregressive flow.

The training objective is the ELBO for the observed x\mathbf{x} in the VAE.