What is: Temporal Adaptive Module?

TAM is designed to capture complex temporal relationships both efficiently and flexibly, It adopts an adaptive kernel instead of self-attention to capture global contextual information, with lower time complexity than GLTR.

TAM has two branches, a local branch and a global branch. Given the input feature map $X\in \mathbb{R}^{C\times T\times H\times W}$ , global spatial average pooling $\text{GAP}$ is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features. The local branch can be written as \begin{align} s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X))))) \end{align} \begin{align} X^1 &= s X \end{align} Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the $c$ -th channel, the kernel can be written as

\begin{align} \Theta_c = \text{Softmax}(\text{FC}_2(\delta(\text{FC}_1(\text{GAP}(X)_c)))) \end{align}

where $\Theta_c \in \mathbb{R}^{K}$ and $K$ is the adaptive kernel size. Finally, TAM convolves the adaptive kernel $\Theta$ with $X_\text{out}^1$ : \begin{align} Y = \Theta \otimes X^1 \end{align}

With the help of the local branch and global branch, TAM can capture the complex temporal structures in video and enhance per-frame features at low computational cost. Due to its flexibility and lightweight design, TAM can be added to any existing 2D CNNs.

Source	TAM: Temporal Adaptive Module for Video Recognition
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com