What is: Spatio-Temporal Attention LSTM?

In human action recognition, each type of action generally only depends on a few specific kinematic joints. Furthermore, over time, multiple actions may be performed. Motivated by these observations, Song et al. proposed a joint spatial and temporal attention network based on LSTM, to adaptively find discriminative features and keyframes. Its main attention-related components are a spatial attention sub-network, to select important regions, and a temporal attention sub-network, to select key frames. The spatial attention sub-network can be written as: \begin{align} s_{t} &= U_{s}\tanh(W_{xs}X_{t} + W_{hs}h_{t-1}^{s} + b_{si}) + b_{so} \end{align} \begin{align} \alpha_{t} &= \text{Softmax}(s_{t}) \end{align} \begin{align} Y_{t} &= \alpha_{t} X_{t} \end{align} where $X_{t}$ is the input feature at time $t$ , $U_{s}$ , $W_{hs}$ , $b_{si}$ , and $b_{so}$ are learnable parameters, and $h_{t-1}^{s}$ is the hidden state at step $t-1$ . Note that use of the hidden state $h$ means the attention process takes temporal relationships into consideration.

The temporal attention sub-network is similar to the spatial branch and produces its attention map using: \begin{align} \beta_{t} = \delta(W_{xp}X_{t} + W_{hp}h_{t-1}^{p} + b_{p}). \end{align} It adopts a ReLU function instead of a normalization function for ease of optimization. It also uses a regularized objective function to improve convergence.

Overall, this paper presents a joint spatiotemporal attention method to focus on important joints and keyframes, with excellent results on the action recognition task.

Source	An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com