• Parallel Layers – We use a “parallel” formulation in each Transformer block (Wang & Komatsuzaki, 2021), rather than the standard “serialized” formulation. Specifically, the standard formulation can be written as:   
    y = x + MLP(LayerNorm(x + Attention(LayerNorm(x)))   

Whereas the parallel formulation can be written as:   
    y = x + MLP(LayerNorm(x)) + Attention(LayerNorm(x))   

The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused. Ablation experiments showed a small quality degradation at 8B scale but no quality degradation at 62B scale, so we extrapolated that the effect of parallel layers should be quality neutral at the 540B scale.

**ProxylessNet-CPU** is an image model learnt with the [ProxylessNAS](https://paperswithcode.com/method/proxylessnas) [neural architecture search](https://paperswithcode.com/method/neural-architecture-search) algorithm that is optimized for CPU devices. It uses inverted residual blocks (MBConvs) from [MobileNetV2](https://paperswithcode.com/method/mobilenetv2) as its basic building block.

ProxylessNet-CPU

ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware

Parallel Layers

PaLM: Scaling Language Modeling with Pathways

EncAttAgg introduced two attenders to tackle two problems: 1) We introduce a mutual attender layer to efficiently obtain the entity-pair-specific mention representations. 2) We introduce an integration attender to weight mention pairs of a target entity pair.

Source	PaLM: Scaling Language Modeling with Pathways
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com