**GAN-TTS** is a generative adversarial network for text-to-speech synthesis. The architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyze the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced.

The generator architecture consists of several GBlocks, which are residual based (dilated) [convolution](https://paperswithcode.com/method/convolution) blocks. GBlocks 3–7 gradually upsample the temporal dimension of hidden representations by factors of 2, 2, 2, 3, 5, while the number of channels is reduced by GBlocks 3, 6 and 7 (by a factor of 2 each). The final convolutional layer with [Tanh activation](https://paperswithcode.com/method/tanh-activation) produces a single-channel audio waveform.

Instead of a single discriminator, GAN-TTS uses an ensemble of Random Window Discriminators (RWDs) which operate on randomly sub-sampled fragments of the real or generated samples. The ensemble allows for the evaluation of audio in different complementary ways.

**CBHG** is a building block used in the [Tacotron](https://paperswithcode.com/method/tacotron) text-to-speech model. It consists of a bank of 1-D convolutional filters, followed by highway networks and a bidirectional gated recurrent unit ([BiGRU](https://paperswithcode.com/method/bigru)). 

The module is used to extract representations from sequences. The input sequence is first
convolved with $K$ sets of 1-D convolutional filters, where the $k$-th set contains $C\_{k}$ filters of width $k$ (i.e. $k = 1, 2, \dots , K$). These filters explicitly model local and contextual information (akin to modeling unigrams, bigrams, up to K-grams). The [convolution](https://paperswithcode.com/method/convolution) outputs are stacked together and further max pooled along time to increase local invariances. A stride of 1 is used to  preserve the original time resolution. The processed sequence is further passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections. [Batch normalization](https://paperswithcode.com/method/batch-normalization) is used for all convolutional layers. The convolution outputs are fed into a multi-layer [highway network](https://paperswithcode.com/method/highway-network) to extract high-level features. Finally, a bidirectional [GRU](https://paperswithcode.com/method/gru) RNN is stacked on top to extract sequential features from both forward and backward context.

CBHG

Tacotron: Towards End-to-End Speech Synthesis

GAN-TTS

High Fidelity Speech Synthesis with Adversarial Networks

The NA method can be divided into two steps: (i) Training a neural network approximation of f , and (ii) inference of xˆ. Step (i) is conventional and involves training a generic neural network on a dataset
ˆ
of input/output pairs from the simulator, denoted D, resulting in f, an approximation of the forward ˆ
model. This is illustrated in the left inset of Fig 1. In step (ii), our goal is to use ∂f/∂x to help us gradually adjust x so that we achieve a desired output of the forward model, y. This is similar to many classical inverse modeling approaches, such as the popular Adjoint method [8, 9]. For many practical
ˆ
expression for the simulator, from which it is trivial to compute ∂f/∂x, and furthermore, we can use modern deep learning software packages to efficiently estimate gradients, given a loss function L.
More formally, let y be our target output, and let xˆi be our current estimate of the solution, where i indexes each solution we obtain in an iterative gradient-based estimation procedure. Then we compute xˆi+1 with
inverse problems, however, obtaining ∂f/∂x requires significant expertise and/or effort, making these approaches challenging. Crucially, fˆ from step (i) provides us with a closed-form differentiable

Source	High Fidelity Speech Synthesis with Adversarial Networks
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com