**SortCut Sinkhorn Attention** is a variant of [Sparse Sinkhorn Attention](https://paperswithcode.com/method/sparse-sinkhorn-attention) where a post-sorting truncation of the input sequence is performed, essentially performing a hard top-k operation on the input sequence blocks within the computational graph. While most attention models mainly re-weight or assign near-zero weights during training, this allows for explicitly and dynamically truncate the input sequence. Specifically:

$$ Y = \text{Softmax}\left(Q{\psi\_{S}}\left(K\right)^{T}\_{\left[:n\right]}\right)\psi\_{S}\left(V\right)\_{\left[:n\right]} $$

where $n$ is the Sortfut budget hyperparameter.

**TD-VAE**, or **Temporal Difference VAE**, is a generative sequence model that learns representations containing explicit beliefs about states several steps into the future, and that can be rolled out directly without single-step transitions. TD-VAE is trained on pairs of temporally separated time points, using an analogue of [temporal difference learning](https://paperswithcode.com/method/td-lambda) used in reinforcement learning.

TD-VAE

Temporal Difference Variational Auto-Encoder

SortCut Sinkhorn Attention

Sparse Sinkhorn Attention

The **Exponential Linear Squashing Activation Function**, or **ELiSH**, is an activation function used for neural networks. It shares common properties with [Swish](https://paperswithcode.com/method/swish), being made up of an [ELU](https://paperswithcode.com/method/elu) and a [Sigmoid](https://paperswithcode.com/method/sigmoid-activation):

$$f\left(x\right) = \frac{x}{1+e^{-x}} \text{ if } x \geq 0 $$
$$f\left(x\right) = \frac{e^{x} - 1}{1+e^{-x}} \text{ if } x < 0 $$

The Sigmoid part of **ELiSH** improves information flow, while the linear parts solve issues of vanishing gradients.

Source	Sparse Sinkhorn Attention
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com