**MelGAN** is a non-autoregressive feed-forward convolutional architecture to perform audio waveform generation in a [GAN](https://paperswithcode.com/method/gan) setup. The architecture is a fully convolutional feed-forward network with mel-spectrogram $s$ as input and raw waveform $x$ as output. Since the mel-spectrogram is at
a 256× lower temporal resolution, the authors use a stack of transposed convolutional layers to upsample the input sequence. Each transposed convolutional layer is followed by a stack of residual blocks with dilated convolutions. Unlike traditional GANs, the MelGAN generator does not use a global noise vector as input.

To deal with 'checkerboard artifacts' in audio, instead of using [PhaseShuffle](https://paperswithcode.com/method/phase-shuffle), MelGAN uses kernel-size as a multiple of stride.

[Weight normalization](https://paperswithcode.com/method/weight-normalization) is used for normalization. A [window-based discriminator](https://paperswithcode.com/method/window-based-discriminator), similar to a [PatchGAN](https://paperswithcode.com/method/patchgan) is used for the discriminator.

**Reformer** is a [Transformer](https://paperswithcode.com/method/transformer) based architecture that seeks to make efficiency improvements. [Dot-product attention](https://paperswithcode.com/method/dot-product-attention) is replaced by one that uses locality-sensitive hashing, changing its complexity
from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, Reformers use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers.

Reformer

Reformer: The Efficient Transformer

MelGAN

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

**Spatial Feature Transform**, or **SFT**, is a layer that generates affine transformation parameters for spatial-wise feature modulation, and was originally proposed within the context of image super-resolution. A Spatial Feature Transform (SFT) layer learns a mapping function $\mathcal{M}$ that outputs a modulation parameter pair $(\mathbf{\gamma}, \mathbf{\beta})$ based on some prior condition $\Psi$. The learned parameter pair adaptively influences the outputs by applying an affine transformation spatially to each intermediate feature maps in an SR network. During testing, only a single forward pass is needed to generate the HR image given the LR input and segmentation probability maps.

More precisely, the prior $\Psi$ is modeled by a pair of affine transformation parameters $(\mathbf{\gamma}, \mathbf{\beta})$ through a mapping function $\mathcal{M}: \Psi \mapsto(\mathbf{\gamma}, \mathbf{\beta})$. Consequently,

$$
\hat{\mathbf{y}}=G_{\mathbf{\theta}}(\mathbf{x} \mid \mathbf{\gamma}, \mathbf{\beta}), \quad(\mathbf{\gamma}, \mathbf{\beta})=\mathcal{M}(\Psi)
$$

After obtaining $(\mathbf{\gamma}, \mathbf{\beta})$ from conditions, the transformation is carried out by scaling and shifting feature maps of a specific layer:

$$
\operatorname{SFT}(\mathbf{F} \mid \mathbf{\gamma}, \mathbf{\beta})=\mathbf{\gamma} \odot \mathbf{F}+\mathbf{\beta}
$$

where $\mathbf{F}$ denotes the feature maps, whose dimension is the same as $\gamma$ and $\mathbf{\beta}$, and $\odot$ is referred to element-wise multiplication, i.e., Hadamard product. Since the spatial dimensions are preserved, the SFT layer not only performs feature-wise manipulation but also spatial-wise transformation.

Source	MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com