Chinchilla is a 70B parameters model trained as a compute-optimal model with 1.4 trillion tokens. Findings suggest that these types of models are trained optimally by equally scaling both model size and training tokens. It uses the same compute budget as Gopher but with 4x more training data. Chinchilla and Gopher are trained for the same number of FLOPs. It is trained using [MassiveText](/dataset/massivetext) using a slightly modified SentencePiece tokenizer. More architectural details in the paper.

A **Dueling Network** is a type of Q-Network that has two streams to separately estimate (scalar) state-value and the advantages for each action. Both streams share a common convolutional feature learning module. The two streams are combined via a special aggregating layer to produce an
estimate of the state-action value function Q as shown in the figure to the right.

The last module uses the following mapping:

$$ Q\left(s, a, \theta, \alpha, \beta\right) =V\left(s, \theta, \beta\right) + \left(A\left(s, a, \theta, \alpha\right) - \frac{1}{|\mathcal{A}|}\sum\_{a'}A\left(s, a'; \theta, \alpha\right)\right) $$

This formulation is chosen for identifiability so that the advantage function has zero advantage for the chosen action, but instead of a maximum we use an average operator to increase the stability of the optimization.

Dueling Network

Dueling Network Architectures for Deep Reinforcement Learning

Chinchilla

Training Compute-Optimal Large Language Models

An attention mechanism for content-based filtering of multi-level features. For example, recurrent features obtained by forward and backward passes of a bidirectional RNN block can be combined using attention feature filters, with unprocessed input features/embeddings as queries and recurrent features as keys/values.

Source	Training Compute-Optimal Large Language Models
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com