**Squared ReLU** is an activation function used in the [Primer](https://paperswithcode.com/method/primer) architecture in the feedforward block of the [Transformer](https://paperswithcode.com/methods/category/transformers) layer. It is simply squared [ReLU](https://paperswithcode.com/method/relu) activations.

The effectiveness of higher order polynomials can also be observed in other effective [Transformer](https://paperswithcode.com/method/transformer) nonlinearities, such as [GLU](https://paperswithcode.com/method/glu) variants like [ReGLU](https://paperswithcode.com/method/reglu) and point-wise activations like [approximate GELU](https://paperswithcode.com/method/gelu). However, squared ReLU has drastically different asymptotics as $x \rightarrow \inf$ compared to the most commonly used activation functions: [ReLU](https://paperswithcode.com/method/relu), [GELU](https://paperswithcode.com/method/gelu) and [Swish](https://paperswithcode.com/method/swish). Squared ReLU does have significant overlap with ReGLU and in fact is equivalent when ReGLU’s $U$ and $V$ weight matrices are the same and squared ReLU is immediately preceded by a linear transformation with weight matrix $U$. This leads the authors to believe that squared ReLUs capture the benefits of these GLU variants, while being simpler, without additional parameters, and delivering better quality.

**Feature Fusion Module v1** is a feature fusion module from the [M2Det](https://paperswithcode.com/method/m2det) object detection model, and feature fusion modules are crucial for constructing the final multi-level feature pyramid. They use [1x1 convolution](https://paperswithcode.com/method/1x1-convolution) layers to compress the channels of the input features and use concatenation operation to aggregate these feature map. FFMv1 takes two feature maps with different scales in backbone as input, it adopts one upsample operation to rescale the deep features to the same scale before the concatenation operation.

FFMv1

M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Squared ReLU

Primer: Searching for Efficient Transformers for Language Modeling

A **Gated Recurrent Unit**, or **GRU**, is a type of recurrent neural network. It is similar to an [LSTM](https://paperswithcode.com/method/lstm), but only has two gates - a reset gate and an update gate - and notably lacks an output gate. Fewer parameters means GRUs are generally easier/faster to train than their LSTM counterparts.

Image Source: [here](https://www.google.com/url?sa=i&url=https%3A%2F%2Fcommons.wikimedia.org%2Fwiki%2FFile%3AGated_Recurrent_Unit%2C_type_1.svg&psig=AOvVaw3EmNX8QXC5hvyxeenmJIUn&ust=1590332062671000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCMiev9-eyukCFQAAAAAdAAAAABAR)

Source	Primer: Searching for Efficient Transformers for Language Modeling
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com