**WaveGAN** is a generative adversarial network for unsupervised synthesis of raw-waveform audio (as opposed to image-like spectrograms). 

The WaveGAN architecture is based off [DCGAN](https://paperswithcode.com/method/dcgan). The DCGAN generator uses the [transposed convolution](https://paperswithcode.com/method/transposed-convolution) operation to iteratively upsample low-resolution feature maps into a high-resolution image. WaveGAN modifies this transposed [convolution](https://paperswithcode.com/method/convolution) operation to widen its receptive field, using a longer one-dimensional filters of length 25 instead of two-dimensional filters of size 5x5, and upsampling by a factor of 4 instead of 2 at each layer. The discriminator is modified in a similar way, using length-25 filters in one dimension and increasing stride
from 2 to 4. These changes result in WaveGAN having the same number of parameters, numerical
operations, and output dimensionality as DCGAN. An additional layer is added afterwards to allow for more audio samples. Further changes include:

1. Flattening 2D convolutions into 1D (e.g. 5x5 2D conv becomes length-25 1D).
2. Increasing the stride factor for all convolutions (e.g. stride 2x2 becomes stride 4).
3. Removing [batch normalization](https://paperswithcode.com/method/batch-normalization) from the generator and discriminator.
4. Training using the [WGAN](https://paperswithcode.com/method/wgan)-GP strategy.

**Weight Modulation** is an alternative to [adaptive instance normalization](https://paperswithcode.com/method/adaptive-instance-normalization) for use in generative adversarial networks, specifically it is introduced in [StyleGAN2](https://paperswithcode.com/method/stylegan2). The purpose of [instance normalization](https://paperswithcode.com/method/instance-normalization) is to remove the effect of $s$ - the scales of the features maps - from the statistics of the [convolution](https://paperswithcode.com/method/convolution)’s output feature maps. Weight modulation tries to achieve this goal more directly. Assuming that input activations are i.i.d. random variables with unit standard deviation. After modulation and convolution, the output activations have standard deviation of:

$$ \sigma\_{j} = \sqrt{{\sum\_{i,k}w\_{ijk}'}^{2}} $$

i.e., the outputs are scaled by the $L\_{2}$ norm of the corresponding weights. The subsequent normalization aims to restore the outputs back to unit standard deviation. This can be achieved if we scale (“demodulate”) each output feature map $j$ by $1/\sigma\_{j}$ . Alternatively, we can again bake this into the convolution weights:

$$ w''\_{ijk} = w'\_{ijk} / \sqrt{{\sum\_{i, k}w'\_{ijk}}^{2} + \epsilon} $$

where $\epsilon$ is a small constant to avoid numerical issues.

Weight Demodulation

Analyzing and Improving the Image Quality of StyleGAN

WaveGAN

Adversarial Audio Synthesis

**CSPDarknet53** is a convolutional neural network and backbone for object detection that uses [DarkNet-53](https://paperswithcode.com/method/darknet-53). It employs a CSPNet strategy to partition the feature map of the base layer into two parts and then merges them through a cross-stage hierarchy. The use of a split and merge strategy allows for more gradient flow through the network. 

This CNN is used as the backbone for [YOLOv4](https://paperswithcode.com/method/yolov4).

Source	Adversarial Audio Synthesis
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com