AICurious Logo

What is: DVD-GAN?

SourceAdversarial Video Generation on Complex Datasets
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

DVD-GAN is a generative adversarial network for video generation built upon the BigGAN architecture.

DVD-GAN uses two discriminators: a Spatial Discriminator D_S\mathcal{D}\_{S} and a Temporal Discriminator D_T\mathcal{D}\_{T}. D_S\mathcal{D}\_{S} critiques single frame content and structure by randomly sampling kk full-resolution frames and judging them individually. The temporal discriminator D_T\mathcal{D}\_{T} must provide GG with the learning signal to generate movement (not evaluated by D_S\mathcal{D}\_{S}).

The input to GG consists of a Gaussian latent noise zN(0,I)z \sim N\left(0, I\right) and a learned linear embedding e(y)e\left(y\right) of the desired class yy. Both inputs are 120-dimensional vectors. GG starts by computing an affine transformation of [z;e(y)]\left[z; e\left(y\right)\right] to a [4,4,ch_0]\left[4, 4, ch\_{0}\right]-shaped tensor. [z;e(y)]\left[z; e\left(y\right)\right] is used as the input to all class-conditional Batch Normalization layers throughout GG. This is then treated as the input (at each frame we would like to generate) to a Convolutional GRU.

This RNN is unrolled once per frame. The output of this RNN is processed by two residual blocks. The time dimension is combined with the batch dimension here, so each frame proceeds through the blocks independently. The output of these blocks has width and height dimensions which are doubled (we skip upsampling in the first block). This is repeated a number of times, with the output of one RNN + residual group fed as the input to the next group, until the output tensors have the desired spatial dimensions.

The spatial discriminator D_S\mathcal{D}\_{S} functions almost identically to BigGAN’s discriminator. A score is calculated for each of the uniformly sampled kk frames (default k=8k = 8) and the D_S\mathcal{D}\_{S} output is the sum over per-frame scores. The temporal discriminator D_T\mathcal{D}\_{T} has a similar architecture, but pre-processes the real or generated video with a 2×22 \times 2 average-pooling downsampling function ϕ\phi. Furthermore, the first two residual blocks of D_T\mathcal{D}\_{T} are 3-D, where every convolution is replaced with a 3-D convolution with a kernel size of 3×3×33 \times 3 \times 3. The rest of the architecture follows BigGAN.