**Mask R-CNN** extends [Faster R-CNN](http://paperswithcode.com/method/faster-r-cnn) to solve instance segmentation tasks. It achieves this by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. In principle, Mask R-CNN is an intuitive extension of Faster [R-CNN](https://paperswithcode.com/method/r-cnn), but constructing the mask branch properly is critical for good results. 

Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is evident in how [RoIPool](http://paperswithcode.com/method/roi-pooling), the *de facto* core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, Mask R-CNN utilises a simple, quantization-free layer, called [RoIAlign](http://paperswithcode.com/method/roi-align), that faithfully preserves exact spatial locations. 

Secondly, Mask R-CNN *decouples* mask and class prediction: it predicts a binary mask for each class independently, without competition among classes, and relies on the network's RoI classification branch to predict the category. In contrast, an [FCN](http://paperswithcode.com/method/fcn) usually perform per-pixel multi-class categorization, which couples segmentation and classification.

In the field of scene segmentation,
encoder-decoder structures cannot make use of the global relationships 
between objects, whereas RNN-based structures 
heavily rely on the output of the long-term memorization.
To address the above problems, 
Fu et al. proposed a novel framework, 
 the dual attention network (DANet), 
for natural scene image segmentation. 
Unlike CBAM and BAM, it adopts a self-attention mechanism 
instead of simply stacking convolutions to compute the spatial attention map,
which enables the network to capture global information directly. 

DANet uses in parallel a position attention module and a channel attention module to capture feature dependencies in spatial and channel domains. Given the input feature map $X$, convolution layers are applied first in the position attention module to obtain new feature maps. Then the position attention module selectively aggregates the features at each position using a weighted sum of features at all positions, where the weights are determined by feature similarity between corresponding pairs of positions. The channel attention module has a similar form except for dimensional reduction to model cross-channel relations. Finally the outputs from the two branches are fused to obtain final feature representations. For simplicity, we reshape the feature map $X$ to $C\times (H \times W)$ whereupon the overall process can be written as 
\begin{align}
    Q,\quad K,\quad V &= W_qX,\quad W_kX,\quad W_vX
\end{align}
\begin{align}
    Y^\text{pos} &=  X+ V\text{Softmax}(Q^TK)
\end{align}
\begin{align}
    Y^\text{chn} &=  X+ \text{Softmax}(XX^T)X 
\end{align}
\begin{align}
    Y &= Y^\text{pos} + Y^\text{chn}
\end{align}
where $W_q$, $W_k$, $W_v \in \mathbb{R}^{C\times C}$ are used to generate new feature maps.   

The position attention module enables
DANet to capture long-range contextual information
and adaptively integrate similar features at any scale
from a global viewpoint,
while the channel attention module is responsible for 
enhancing useful channels 
as well as suppressing noise. 
Taking spatial and channel 
relationships into consideration explicitly
improves the feature representation for scene segmentation.
However, it is computationally costly, especially for large input feature maps.

DANet

Dual Attention Network for Scene Segmentation

Mask R-CNN

A **Graph Convolutional Network**, or **GCN**, is an approach for semi-supervised learning on graph-structured data. It is based on an efficient variant of [convolutional neural networks](https://paperswithcode.com/methods/category/convolutional-neural-networks) which operate directly on graphs. The choice of convolutional architecture is motivated via a localized first-order approximation of spectral graph convolutions. The model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes.

Source	Mask R-CNN
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com