What is: Deformable Convolutional Networks?

Deformable ConvNets do not learn an affine transformation. They divide convolution into two steps, firstly sampling features on a regular grid $\mathcal{R}$ from the input feature map, then aggregating sampled features by weighted summation using a convolution kernel. The process can be written as: \begin{align} Y(p_{0}) &= \sum_{p_i \in \mathcal{R}} w(p_{i}) X(p_{0} + p_{i}) \end{align} \begin{align} \mathcal{R} &= {(-1,-1), (-1, 0), \dots, (1, 1)} \end{align} The deformable convolution augments the sampling process by introducing a group of learnable offsets $\Delta p_{i}$ which can be generated by a lightweight CNN. Using the offsets $\Delta p_{i}$ , the deformable convolution can be formulated as: \begin{align} Y(p_{0}) &= \sum_{p_i \in \mathcal{R}} w(p_{i}) X(p_{0} + p_{i} + \Delta p_{i}). \end{align} Through the above method, adaptive sampling is achieved. However, $\Delta p_{i}$ is a floating point value unsuited to grid sampling. To address this problem, bilinear interpolation is used. Deformable RoI pooling is also used, which greatly improves object detection.

Deformable ConvNets adaptively select the important regions and enlarge the valid receptive field of convolutional neural networks; this is important in object detection and semantic segmentation tasks.

Source	Deformable Convolutional Networks
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com