**Tacotron 2** is a neural network architecture for speech synthesis directly from text. It consists of two components:

- a recurrent sequence-to-sequence feature prediction network with
attention which predicts a sequence of mel spectrogram frames from
an input character sequence
- a modified version of [WaveNet](https://paperswithcode.com/method/wavenet) which generates time-domain waveform samples conditioned on the
predicted mel spectrogram frames

In contrast to the original [Tacotron](https://paperswithcode.com/method/tacotron), Tacotron 2 uses simpler building blocks, using vanilla [LSTM](https://paperswithcode.com/method/lstm) and convolutional layers in the encoder and decoder instead of [CBHG](https://paperswithcode.com/method/cbhg) stacks and [GRU](https://paperswithcode.com/method/gru) recurrent layers. Tacotron 2 does not use a “reduction factor”, i.e., each decoder step corresponds to a single spectrogram frame. Location-sensitive attention is used instead of [additive attention](https://paperswithcode.com/method/additive-attention).

**Metropolis-Hastings** is a Markov Chain Monte Carlo (MCMC) algorithm for approximate inference. It allows for sampling from a probability distribution where direct sampling is difficult - usually owing to the presence of an intractable integral.

M-H consists of a proposal distribution $q\left(\theta^{'}\mid\theta\right)$ to draw a parameter value. To decide whether $\theta^{'}$ is accepted or rejected, we then calculate a ratio:

$$ \frac{p\left(\theta^{'}\mid{D}\right)}{p\left(\theta\mid{D}\right)} $$

We then draw a random number $r \in \left[0, 1\right]$ and accept if it is under the ratio, reject otherwise. If we accept, we set $\theta_{i} = \theta^{'}$ and repeat.

By the end we have a sample of $\theta$ values that we can use to form quantities over an approximate posterior, such as the expectation and uncertainty bounds. In practice, we typically have a period of tuning to achieve an acceptable acceptance ratio for the algorithm, as well as a warmup period to reduce bias towards initialization values.

Image: [Samuel Hudec](https://static1.squarespace.com/static/52e69d46e4b05a145935f24d/t/5a7dbadcf9619a745c5b2513/1518189289690/Stan.pdf)

Metropolis Hastings

Tacotron 2

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

**Probabilistically Masked Language Model**, or **PMLM**, is a type of language model that utilizes a probabilistic masking scheme, aiming to bridge the gap between masked and autoregressive language models. The basic idea behind the connection of two categories of models is similar to MADE by Germain et al (2015). PMLM is a masked language model with a probabilistic masking scheme, which defines the way sequences are masked by following a probabilistic distribution. The authors employ a simple uniform distribution of the masking ratio and name the model as u-PMLM.

Source	Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com