**Embeddings from Language Models**, or **ELMo**, is a type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus.

A biLM combines both a forward and backward LM.  ELMo jointly maximizes the log likelihood of the forward and backward directions. To add ELMo to a supervised model, we freeze the weights of the biLM and then concatenate the ELMo vector $\textbf{ELMO}^{task}_k$ with $\textbf{x}_k$ and pass the ELMO enhanced representation $[\textbf{x}_k; \textbf{ELMO}^{task}_k]$ into the task RNN. Here $\textbf{x}_k$ is a context-independent token representation for each token position. 

Image Source: [here](https://medium.com/@duyanhnguyen_38925/create-a-strong-text-classification-with-the-help-from-elmo-e90809ba29da)

TAM is designed to capture complex temporal relationships both  efficiently and  flexibly,
It adopts an adaptive kernel instead of self-attention to capture  global contextual information, with lower time complexity 
than GLTR.

TAM has two branches, a local branch and a global branch. Given the input feature map $X\in \mathbb{R}^{C\times T\times H\times W}$,  global spatial average pooling $\text{GAP}$ is first applied to the feature map to ensure TAM has a low computational cost. Then the local branch in TAM employs several 1D convolutions with ReLU nonlinearity across the temporal domain to produce location-sensitive importance maps for enhancing frame-wise features.
The local branch can be written as
\begin{align}
    s &= \sigma(\text{Conv1D}(\delta(\text{Conv1D}(\text{GAP}(X)))))
\end{align}
\begin{align}
    X^1 &= s X
\end{align}
Unlike the local branch, the global branch is location invariant and focuses on generating a channel-wise adaptive kernel based on global temporal information in each channel. For the $c$-th channel, the  kernel can be written as

\begin{align}
    \Theta_c = \text{Softmax}(\text{FC}_2(\delta(\text{FC}_1(\text{GAP}(X)_c)))) 
\end{align}

where $\Theta_c \in \mathbb{R}^{K}$ and $K$ is the adaptive kernel size. Finally, TAM  convolves the adaptive kernel $\Theta$ with $ X_\text{out}^1$:
\begin{align}
    Y = \Theta \otimes  X^1
\end{align}

With the help of the local branch and global branch,
TAM can capture the complex temporal structures in video and 
enhance per-frame features at low computational cost.
Due to its flexibility and lightweight design,
TAM can be added to any existing 2D CNNs.

Source	Deep contextualized word representations
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com