**Residual Multi-Layer Perceptrons**, or **ResMLP**, is an architecture built entirely upon [multi-layer perceptrons](https://paperswithcode.com/methods/category/feedforward-networks) for image classification. It is a simple [residual network](https://paperswithcode.com/method/residual-connection) that alternates (i) a [linear layer](https://paperswithcode.com/method/linear-layer) in which image patches interact, independently and identically across channels, and (ii) a two-layer [feed-forward network](https://paperswithcode.com/method/feedforward-network) in which channels interact independently per patch. At the end of the network, the patch representations are average pooled, and fed to a linear classifier.

[Layer normalization](https://paperswithcode.com/method/layer-normalization) is replaced with a simpler [affine transformation](https://paperswithcode.com/method/affine-operator), thanks to the absence of self-attention layers which makes training more stable. The affine operator is applied at the beginning ("pre-normalization") and end ("post-normalization") of each residual block. As a pre-normalization, Aff replaces LayerNorm without using channel-wise statistics. Initialization is achieved as $\mathbf{\alpha}=\mathbf{1}$, and $\mathbf{\beta}=\mathbf{0}$. As a post-normalization, Aff is similar to [LayerScale](https://paperswithcode.com/method/layerscale) and $\mathbf{\alpha}$ is initialized with the same small value.

**CT3D** is a two-stage 3D object detection framework that leverages a high-quality region proposal network and a Channel-wise [Transformer](https://paperswithcode.com/method/transformer) architecture. The proposed CT3D simultaneously performs proposal-aware embedding and channel-wise context aggregation for the point features within each proposal. Specifically, CT3D uses a proposal's keypoints for spatial contextual modelling and learns attention propagation in the encoding module, mapping the proposal to point embeddings. Next, a new channel-wise decoding module enriches the query-key interaction via channel-wise re-weighting to effectively merge multi-level contexts, which contributes to more accurate object predictions. 

In CT3D, the raw points are first fed into the [RPN](https://paperswithcode.com/method/rpn) for generating 3D proposals. Then the raw points along with the corresponding proposals are processed by the channel-wise Transformer composed of the proposal-to-point encoding module and the channel-wise decoding module. Specifically, the proposal-to-point encoding module is to modulate each point feature with global proposal-aware context information. After that, the encoded point features are transformed into an effective proposal feature representation by the
channel-wise decoding module for confidence prediction and box regression.

CT3D

Improving 3D Object Detection with Channel-wise Transformer

ResMLP

ResMLP: Feedforward networks for image classification with data-efficient training

**Sparsemax** is a type of activation/output function similar to the traditional [softmax](https://paperswithcode.com/method/softmax), but able to output sparse probabilities. 

$$ \text{sparsemax}\left(z\right) = \arg\_{p∈\Delta^{K−1}}\min||\mathbf{p} - \mathbf{z}||^{2} $$

Source	ResMLP: Feedforward networks for image classification with data-efficient training
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com