**TernaryBERT** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model which ternarizes the weights of a pretrained [BERT](https://paperswithcode.com/method/bert) model to $\{-1,0,+1\}$, with different granularities for word embedding and weights in the Transformer layer. Instead of directly using knowledge distillation to compress a model, it is used to improve the performance of ternarized student model with the same size as the teacher model. In this way, we transfer the knowledge from the highly-accurate teacher model to the ternarized student model with smaller capacity.

**HetPipe** is a hybrid parallel method that integrates pipelined model parallelism (PMP) with data parallelism (DP). In HetPipe, a group of multiple GPUs, called a virtual worker, processes minibatches in a pipelined manner, and multiple such virtual workers employ data parallelism for higher performance.

HetPipe

TernaryBERT

TernaryBERT: Distillation-aware Ultra-low Bit BERT

**NormFormer** is a type of [Pre-LN](https://paperswithcode.com/method/layer-normalization) transformer that adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first [fully connected layer](https://paperswithcode.com/method/position-wise-feed-forward-layer). The modifications introduce a small number of additional learnable parameters, which provide a cost-effective way for each layer to change the magnitude of its features, and therefore the magnitude of the gradients to subsequent components.

Source	TernaryBERT: Distillation-aware Ultra-low Bit BERT
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com