**MiVOS** is a video object segmentation model which decouples interaction-to-mask and mask propagation. By decoupling interaction from propagation, MiVOS is versatile and not limited by the type of interactions. It uses three modules: Interaction-to-Mask, Propagation and Difference-Aware Fusion. Trained separately, the interaction module converts user interactions to an object mask, which is then temporally propagated by our propagation module using a novel top-filtering strategy in reading the space-time memory. To effectively take the user's intent into account, a novel difference-aware module is proposed to learn how to properly fuse the masks before and after each interaction, which are aligned with the target frames by employing the space-time memory.

**Twins-SVT** is a type of [vision transformer](https://paperswithcode.com/methods/category/vision-transformer) which utilizes a [spatially separable attention mechanism](https://paperswithcode.com/method/spatially-separable-self-attention) (SSAM) which is composed of two types of attention operations—(i) locally-grouped self-attention (LSA), and (ii) global sub-sampled attention (GSA), where LSA captures the fine-grained and short-distance information and GSA deals with the long-distance and global information. On top of this, it utilizes [conditional position encodings](https://paperswithcode.com/method/conditional-positional-encoding) as well as the architectural design of the [Pyramid Vision Transformer](https://paperswithcode.com/method/pvt).

Twins-SVT

Twins: Revisiting the Design of Spatial Attention in Vision Transformers

MiVOS

Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion

**Voxel R-CNN** is a voxel-based two stage framework for 3D object detection. It consists of a 3D backbone network, a 2D bird-eye-view (BEV) Region Proposal Network and a detect head. Voxel RoI Pooling is devised to extract RoI features directly from raw features for further refinement. 

End-to-end, the point clouds are first divided into regular voxels and fed into the 3D backbone network for feature extraction. Then, the 3D feature volumes are converted into BEV representation, on which the 2D backbone and [RPN](https://paperswithcode.com/method/rpn) are applied for region proposal generation. Subsequently, [Voxel RoI Pooling](https://paperswithcode.com/method/voxel-roi-pooling) directly extracts RoI features from the 3D feature volumes. Finally the RoI features are exploited in the detect head for further box refinement.

Source	Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com