**3D ResNet-RS** is an architecture and scaling strategy for 3D ResNets for video recognition. The key additions are:

- **3D ResNet-D stem**: The [ResNet-D](https://paperswithcode.com/method/resnet-d) stem is adapted to 3D inputs by using three consecutive [3D convolutional layers](https://paperswithcode.com/method/3d-convolution). The first convolutional layer employs a temporal kernel size of 5 while the remaining two convolutional layers employ a temporal kernel size of 1.

- **3D Squeeze-and-Excitation**:  [Squeeze-and-Excite](https://paperswithcode.com/method/squeeze-and-excitation-block) is adapted to spatio-temporal inputs by using a 3D [global average pooling](https://paperswithcode.com/method/global-average-pooling) operation for the squeeze operation. A SE ratio of 0.25 is applied in each 3D bottleneck block for all experiments.

- **Self-gating**: A self-gating module is used in each 3D bottleneck block after the SE module.

**DEXTR**, or **Deep Extreme Cut**, obtains an object segmentation from its four extreme points: the left-most, right-most, top, and bottom pixels. The annotated extreme points are given as a guiding signal to the input of the network. To this end, we create a [heatmap](https://paperswithcode.com/method/heatmap) with activations in the regions of extreme points. We center a 2D Gaussian around each of the points, in order to create a single heatmap. The heatmap is concatenated with the RGB channels of the input image, to form a 4-channel input for the CNN. In order to focus on the object of interest, the input is cropped by the bounding box, formed from the extreme point annotations. To include context on the resulting
crop, we relax the tight bounding box by several pixels. After the pre-processing step that comes exclusively from the extreme clicks, the input consists of an RGB crop including an object, plus its extreme points. 

[ResNet](https://paperswithcode.com/method/resnet)-101 is chosen as backbone of the architecture. We remove the fully connected layers as well as the [max pooling](https://paperswithcode.com/method/max-pooling) layers in the last two stages to preserve acceptable output resolution for dense prediction, and we introduce atrous convolutions in the last two stages to maintain the same receptive field. After the last ResNet-101 stage, we introduce a pyramid scene parsing module to aggregate global context to the final feature map. The output of the CNN is a probability map representing whether a pixel belongs to the object that we want to segment or not. The CNN is trained to minimize the standard cross entropy loss, which takes into account that different classes occur with different frequency in a dataset.

DEXTR

Deep Extreme Cut: From Extreme Points to Object Segmentation

3D ResNet-RS

Revisiting 3D ResNets for Video Recognition

**LayerDrop** is a form of structured [dropout](https://paperswithcode.com/method/dropout) for [Transformer](https://paperswithcode.com/method/transformer) models which has a regularization effect during training and allows for efficient pruning at inference time. It randomly drops layers from the Transformer according to an "every other" strategy where pruning with a rate $p$ means dropping the layers at depth $d$ such that $d = 0\left\(\text{mod}\left(\text{floor}\left(\frac{1}{p}\right)\right)\right)$.

Source	Revisiting 3D ResNets for Video Recognition
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com