**OPT** is a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. The model uses an AdamW optimizer and weight decay of 0.1. It follows a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in the smaller models, and decaying down to 10% of the maximum LR over 300B tokens. The batch sizes range from 0.5M to 4M depending on the model size and is kept constant throughout the course of training.

**GAN Least Squares Loss** is a least squares loss function for generative adversarial networks. Minimizing this objective function is equivalent to minimizing the Pearson $\chi^{2}$ divergence. The objective function (here for [LSGAN](https://paperswithcode.com/method/lsgan)) can be defined as:

$$ \min\_{D}V\_{LS}\left(D\right) = \frac{1}{2}\mathbb{E}\_{\mathbf{x} \sim p\_{data}\left(\mathbf{x}\right)}\left[\left(D\left(\mathbf{x}\right) - b\right)^{2}\right] + \frac{1}{2}\mathbb{E}\_{\mathbf{z}\sim p\_{data}\left(\mathbf{z}\right)}\left[\left(D\left(G\left(\mathbf{z}\right)\right) - a\right)^{2}\right] $$

$$ \min\_{G}V\_{LS}\left(G\right) = \frac{1}{2}\mathbb{E}\_{\mathbf{z} \sim p\_{\mathbf{z}}\left(\mathbf{z}\right)}\left[\left(D\left(G\left(\mathbf{z}\right)\right) - c\right)^{2}\right] $$

where $a$ and $b$ are the labels for fake data and real data and $c$ denotes the value that $G$ wants $D$ to believe for fake data.

GAN Least Squares Loss

Least Squares Generative Adversarial Networks

OPT: Open Pre-trained Transformer Language Models

**BARThez** is a self-supervised transfer learning model for the French language based on [BART](https://paperswithcode.com/method/bart). Compared to existing [BERT](https://paperswithcode.com/method/bert)-based French language models such as [CamemBERT](https://paperswithcode.com/paper/camembert-a-tasty-french-language-model) and [FlauBERT](https://paperswithcode.com/paper/flaubert-unsupervised-language-model-pre), BARThez is well-suited for generative tasks, since not only its encoder but also its decoder is pretrained.

Source	OPT: Open Pre-trained Transformer Language Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com