AICurious Logo

What is: ReZero?

SourceReZero is All You Need: Fast Convergence at Large Depth
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

ReZero is a normalization approach that dynamically facilitates well-behaved gradients and arbitrarily deep signal propagation. The idea is simple: ReZero initializes each layer to perform the identity operation. For each layer, a residual connection is introduced for the input signal xx and one trainable parameter α\alpha that modulates the non-trivial transformation of a layer F(x)F(\mathbf{x}):

x_i+1=x_i+αiF(x_i)\mathbf{x}\_{i+1}=\mathbf{x}\_{i}+\alpha_{i} F\left(\mathbf{x}\_{i}\right)

where α=0\alpha=0 at the beginning of training. Initially the gradients for all parameters defining FF vanish, but dynamically evolve to suitable values during initial stages of training. The architecture is illustrated in the Figure.