AICurious Logo

What is: T-Fixup?

SourceImproving Transformer Optimization Through Better Initialization
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

T-Fixup is an initialization method for Transformers that aims to remove the need for layer normalization and warmup. The initialization procedure is as follows:

  • Apply Xavier initialization for all parameters excluding input embeddings. Use Gaussian initialization N(0,d12)\mathcal{N}\left(0, d^{-\frac{1}{2}}\right) for input embeddings where dd is the embedding dimension.
  • Scale v_d\mathbf{v}\_{d} and w_d\mathbf{w}\_{d} matrices in each decoder attention block, weight matrices in each decoder MLP block and input embeddings x\mathbf{x} and y\mathbf{y} in encoder and decoder by (9N)14(9 N)^{-\frac{1}{4}}
  • Scale v_e\mathbf{v}\_{e} and w_e\mathbf{w}\_{e} matrices in each encoder attention block and weight matrices in each encoder MLP block by 0.67N140.67 N^{-\frac{1}{4}}