What is: WaveRNN?

WaveRNN is a single-layer recurrent neural network for audio generation that is designed efficiently predict 16-bit raw audio samples.

The overall computation in the WaveRNN is as follows (biases omitted for brevity):

$\mathbf{x}\_{t} = \left[\mathbf{c}\_{t−1},\mathbf{f}\_{t−1}, \mathbf{c}\_{t}\right]$

$\mathbf{u}\_{t} = \sigma\left(\mathbf{R}\_{u}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{u}\mathbf{x}\_{t}\right)$

$\mathbf{r}\_{t} = \sigma\left(\mathbf{R}\_{r}\mathbf{h}\_{t-1} + \mathbf{I}^{*}\_{r}\mathbf{x}\_{t}\right)$

$\mathbf{e}\_{t} = \tau\left(\mathbf{r}\_{t} \odot \left(\mathbf{R}\_{e}\mathbf{h}\_{t-1}\right) + \mathbf{I}^{*}\_{e}\mathbf{x}\_{t} \right)$

$\mathbf{h}\_{t} = \mathbf{u}\_{t} \cdot \mathbf{h}\_{t-1} + \left(1-\mathbf{u}\_{t}\right) \cdot \mathbf{e}\_{t}$

$\mathbf{y}\_{c}, \mathbf{y}\_{f} = \text{split}\left(\mathbf{h}\_{t}\right)$

$P\left(\mathbf{c}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{2}\text{relu}\left(\mathbf{O}\_{1}\mathbf{y}\_{c}\right)\right)$

$P\left(\mathbf{f}\_{t}\right) = \text{softmax}\left(\mathbf{O}\_{4}\text{relu}\left(\mathbf{O}\_{3}\mathbf{y}\_{f}\right)\right)$

where the $*$ indicates a masked matrix whereby the last coarse input $\mathbf{c}\_{t}$ is only connected to the fine part of the states $\mathbf{u}\_{t}$ , $\mathbf{r}\_{t}$ , $\mathbf{e}\_{t}$ and $\mathbf{h}\_{t}$ and thus only affects the fine output $\mathbf{y}\_{f}$ . The coarse and fine parts $\mathbf{c}\_{t}$ and $\mathbf{f}\_{t}$ are encoded as scalars in $\left[0, 255\right]$ and scaled to the interval $\left[−1, 1\right]$ . The matrix $\mathbf{R}$ formed from the matrices $\mathbf{R}\_{u}$ , $\mathbf{R}\_{r}$ , $\mathbf{R}\_{e}$ is computed as a single matrix-vector product to produce the contributions to all three gates $\mathbf{u}\_{t}$ , $mathbf{r}\_{t}$ and $\mathbf{e}\_{t}$ (a variant of the GRU cell. $\sigma$ and $\tau$ are the standard sigmoid and tanh non-linearities.

Each part feeds into a softmax layer over the corresponding 8 bits and the prediction of the 8 fine bits is conditioned on the 8 coarse bits. The resulting Dual Softmax layer allows for efficient prediction of 16-bit samples using two small output spaces (2 8 values each) instead of a single large output space (with 2 16 values).

Source	Efficient Neural Audio Synthesis
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com