Unlike [batch normalization](https://paperswithcode.com/method/batch-normalization), **Layer Normalization** directly estimates the normalization statistics from the summed inputs to the neurons within a hidden layer so the normalization does not introduce any new dependencies between training cases. It works well for [RNNs](https://paperswithcode.com/methods/category/recurrent-neural-networks) and improves both the training time and the generalization performance of several existing RNN models. More recently, it has been used with [Transformer](https://paperswithcode.com/methods/category/transformers) models.

We compute the layer normalization statistics over all the hidden units in the same layer as follows:

$$ \mu^{l} = \frac{1}{H}\sum^{H}\_{i=1}a\_{i}^{l} $$

$$ \sigma^{l} = \sqrt{\frac{1}{H}\sum^{H}\_{i=1}\left(a\_{i}^{l}-\mu^{l}\right)^{2}}  $$

where $H$ denotes the number of hidden units in a layer. Under layer normalization, all the hidden units in a layer share the same normalization terms $\mu$ and $\sigma$, but different training cases have different normalization terms. Unlike batch normalization, layer normalization does not impose any constraint on the size of the mini-batch and it can be used in the pure online regime with batch size 1.

**ConvLSTM** is a type of recurrent neural network for spatio-temporal prediction that has convolutional structures in both the input-to-state and state-to-state transitions. The ConvLSTM determines the future state of a certain cell in the grid by the inputs and past states of its local neighbors. This can easily be achieved by using a [convolution](https://paperswithcode.com/method/convolution) operator in the state-to-state and input-to-state transitions (see Figure). The key equations of ConvLSTM are shown  below, where $∗$ denotes the convolution operator and $\odot$ the Hadamard product:

$$ i\_{t} = \sigma\left(W\_{xi} ∗ X\_{t} + W\_{hi} ∗ H\_{t−1} + W\_{ci} \odot \mathcal{C}\_{t−1} + b\_{i}\right) $$

$$ f\_{t} = \sigma\left(W\_{xf} ∗ X\_{t} + W\_{hf} ∗ H\_{t−1} + W\_{cf} \odot \mathcal{C}\_{t−1} + b\_{f}\right) $$

$$ \mathcal{C}\_{t} = f\_{t} \odot \mathcal{C}\_{t−1} + i\_{t} \odot \text{tanh}\left(W\_{xc} ∗ X\_{t} + W\_{hc} ∗ \mathcal{H}\_{t−1} + b\_{c}\right) $$

$$ o\_{t} = \sigma\left(W\_{xo} ∗ X\_{t} + W\_{ho} ∗ \mathcal{H}\_{t−1} + W\_{co} \odot \mathcal{C}\_{t} + b\_{o}\right) $$

$$ \mathcal{H}\_{t} = o\_{t} \odot \text{tanh}\left(C\_{t}\right) $$

If we view the states as the hidden representations of moving objects, a ConvLSTM with a larger transitional kernel should be able to capture faster motions while one with a smaller kernel can capture slower motions. 

To ensure that the states have the same number of rows and same number of columns as the inputs, padding is needed before applying the convolution operation. Here, padding of the hidden states on the boundary points can be viewed as using the state of the outside world for calculation. Usually, before the first input comes, we initialize all the states of the [LSTM](https://paperswithcode.com/method/lstm) to zero which corresponds to "total ignorance" of the future.

ConvLSTM

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Layer Normalization

**PP-YOLO** is an object detector based on [YOLOv3](https://paperswithcode.com/method/yolov3). It mainly tries to combine various existing tricks that almost not increase the number of model parameters and FLOPs, to achieve the goal of improving the accuracy of detector as much as possible while ensuring that the speed is almost unchanged. Some of these changes include:

- Changing the [DarkNet-53](https://paperswithcode.com/method/darknet-53) backbone with ResNet50-vd. Some of the convolutional layers in ResNet50-vd are also replaced with [deformable convolutional layers](https://paperswithcode.com/method/deformable-convolution).
- A larger batch size is used - changing from 64 to 192.
- An exponentially moving average is used for the parameters.
- [DropBlock](https://paperswithcode.com/method/dropblock) is applied to the [FPN](https://paperswithcode.com/method/fpn).
- An IoU loss is used.
- An IoU prediction branch is added to measure the accuracy of localization.
- [Grid Sensitive](https://paperswithcode.com/method/grid-sensitive) is used, similar to [YOLOv4](https://paperswithcode.com/method/yolov4).
- [Matrix NMS](https://paperswithcode.com/method/matrix-nms) is used.
- [CoordConv](https://paperswithcode.com/method/coordconv) is used for the [FPN](https://paperswithcode.com/method/fpn), replacing the 1x1 convolution layer, and also the first convolution layer in the detection head.
- [Spatial Pyramid Pooling](https://paperswithcode.com/method/spatial-pyramid-pooling) is used for the top feature map.

Source	Layer Normalization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com