**Co-Scale Conv-Attentional Image Transformer** (CoaT) is a [Transformer](https://paperswithcode.com/method/transformer)-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other. Second, the conv-attentional mechanism is designed by realizing a relative position embedding formulation in the factorized attention module with an efficient [convolution](https://paperswithcode.com/method/convolution)-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities.

A **Deep Boltzmann Machine (DBM)** is a three-layer generative model. It is similar to a [Deep Belief Network](https://paperswithcode.com/method/deep-belief-network), but instead allows bidirectional connections in the bottom layers. Its energy function is  as an extension of the energy function of the RBM:

$$ E\left(v, h\right) = -\sum^{i}\_{i}v\_{i}b\_{i} - \sum^{N}\_{n=1}\sum_{k}h\_{n,k}b\_{n,k}-\sum\_{i, k}v\_{i}w\_{ik}h\_{k} - \sum^{N-1}\_{n=1}\sum\_{k,l}h\_{n,k}w\_{n, k, l}h\_{n+1, l}$$

for a DBM with $N$ hidden layers.

Source: [On the Origin of Deep Learning](https://arxiv.org/pdf/1702.07800.pdf)

Deep Boltzmann Machine

CoaT

Co-Scale Conv-Attentional Image Transformers

SRM combines style transfer with an attention mechanism. Its main contribution is style pooling which utilizes both mean and standard deviation of the input features to improve its capability to capture global information. It also adopts a lightweight channel-wise fully-connected (CFC) layer, in place of the original fully-connected layer, to reduce the computational requirements.
Given an input feature map $X \in \mathbb{R}^{C \times H \times W}$, SRM first collects global information by using style pooling ($\text{SP}(\cdot)$) which combines global average pooling and global standard deviation pooling. 
Then a channel-wise fully connected ($\text{CFC}(\cdot)$) layer (i.e. fully connected per channel), batch normalization $\text{BN}$ and sigmoid function $\sigma$ are used  to provide the attention vector. Finally,   as in an SE block, the input features are multiplied by the attention vector. Overall, an SRM can be written as:
\begin{align}
    s = F_\text{srm}(X, \theta) & = \sigma (\text{BN}(\text{CFC}(\text{SP}(X))))
\end{align}
\begin{align}
    Y & = s  X
\end{align}
The SRM block improves both squeeze and excitation modules, yet can be added after each residual unit like an SE block.

Source	Co-Scale Conv-Attentional Image Transformers
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com