**Dilated Sliding Window Attention** is an attention pattern for attention-based models. It was proposed as part of the [Longformer](https://paperswithcode.com/method/longformer) architecture. It is motivated by the fact that non-sparse attention in the original [Transformer](https://paperswithcode.com/method/transformer) formulation has a [self-attention component](https://paperswithcode.com/method/scaled) with $O\left(n^{2}\right)$ time and memory complexity where $n$ is the input sequence length and thus, is not efficient to scale to long inputs. 

Compared to a [Sliding Window Attention](https://paperswithcode.com/method/sliding-window-attention) pattern, we can further increase the receptive field without increasing computation by making the sliding window "dilated". This is analogous to [dilated CNNs](https://paperswithcode.com/method/dilated-convolution) where the window has gaps of size dilation $d$. Assuming a fixed $d$ and $w$ for all layers, the receptive field is $l × d × w$, which can reach tens of thousands of tokens even for small values of $d$.

**Adversarially Learned Inference (ALI)** is a generative modelling approach that casts the learning of both an inference machine (or encoder) and a deep directed generative model (or decoder) in an GAN-like adversarial framework. A discriminator is trained to discriminate joint samples of the data and the corresponding latent variable from the encoder (or approximate posterior) from joint samples from the decoder while in opposition, the encoder and the decoder are trained together to fool the discriminator. Not is the discriminator asked to distinguish synthetic samples from real data, but it is required it to distinguish between two joint distributions over the data space and the latent variables.

An ALI differs from a [GAN](https://paperswithcode.com/method/gan) in two ways:

- The generator has two components: the encoder, $G\_{z}\left(\mathbf{x}\right)$, which maps data samples $x$ to $z$-space, and the decoder $G\_{x}\left(\mathbf{z}\right)$, which maps samples from the prior $p\left(\mathbf{z}\right)$ (a source of noise) to the input space.
- The discriminator is trained to distinguish between joint pairs $\left(\mathbf{x}, \tilde{\mathbf{z}} = G\_{\mathbf{x}}\left(\mathbf{x}\right)\right)$ and $\left(\tilde{\mathbf{x}} =
G\_{x}\left(\mathbf{z}\right), \mathbf{z}\right)$, as opposed to marginal samples $\mathbf{x} \sim q\left(\mathbf{x}\right)$ and $\tilde{\mathbf{x}} ∼ p\left(\mathbf{x}\right)$.

Adversarially Learned Inference

Dilated Sliding Window Attention

Longformer: The Long-Document Transformer

**Channel Shuffle** is an operation to help information flow across feature channels in convolutional neural networks. It was used as part of the [ShuffleNet](https://paperswithcode.com/method/shufflenet) architecture. 

If we allow a group [convolution](https://paperswithcode.com/method/convolution) to obtain input data from different groups, the input and output channels will be fully related. Specifically, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups, then feed each group in the next layer with different subgroups. 

The above can be efficiently and elegantly implemented by a channel shuffle operation: suppose a convolutional layer with $g$ groups whose output has $g \times n$ channels; we first reshape the output channel dimension into $\left(g, n\right)$, transposing and then flattening it back as the input of next layer. Channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training.

Source	Longformer: The Long-Document Transformer
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com