AICurious Logo

What is: Temporal Adaptive Module?

SourceTAM: Temporal Adaptive Module for Video Recognition
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

TAM is designed to capture complex temporal relationships both efficiently and flexibly, It adopts an adaptive kernel instead of self-attention to capture global contextual information, with lower time complexity than GLTR.

TAM has two branches, a local branch and a global branch. Given the input feature map XRC×T×H×WX\in \mathbb{R}^{C\times T\times H\times W}, global spatial average pooling GAP\text{GAP} is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features. The local branch can be written as \begin{align} s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X))))) \end{align} \begin{align} X^1 &= s X \end{align} Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the cc-th channel, the kernel can be written as

\begin{align} \Theta_c = \text{Softmax}(\text{FC}_2(\delta(\text{FC}_1(\text{GAP}(X)_c)))) \end{align}

where ΘcRK\Theta_c \in \mathbb{R}^{K} and KK is the adaptive kernel size. Finally, TAM convolves the adaptive kernel Θ\Theta with Xout1 X_\text{out}^1: \begin{align} Y = \Theta \otimes X^1 \end{align}

With the help of the local branch and global branch, TAM can capture the complex temporal structures in video and enhance per-frame features at low computational cost. Due to its flexibility and lightweight design, TAM can be added to any existing 2D CNNs.