**AdamW** is a stochastic optimization method that modifies the typical implementation of weight decay in [Adam](https://paperswithcode.com/method/adam), by decoupling [weight decay](https://paperswithcode.com/method/weight-decay) from the gradient update. To see this, $L\_{2}$ regularization in Adam is usually implemented with the below modification where $w\_{t}$ is the rate of the weight decay at time $t$:

$$ g\_{t} = \nabla{f\left(\theta\_{t}\right)} + w\_{t}\theta\_{t}$$

while AdamW adjusts the weight decay term to appear in the gradient update:

$$ \theta\_{t+1, i} = \theta\_{t, i} - \eta\left(\frac{1}{\sqrt{\hat{v}\_{t} + \epsilon}}\cdot{\hat{m}\_{t}} + w\_{t, i}\theta\_{t, i}\right), \forall{t}$$

**Expected Sarsa** is like [Q-learning](https://paperswithcode.com/method/q-learning) but instead of taking the maximum over next state-action pairs, we use the expected value, taking into account how likely each action is under the current policy.

$$Q\left(S\_{t}, A\_{t}\right) \leftarrow Q\left(S\_{t}, A\_{t}\right) + \alpha\left[R_{t+1} + \gamma\sum\_{a}\pi\left(a\mid{S\_{t+1}}\right)Q\left(S\_{t+1}, a\right) - Q\left(S\_{t}, A\_{t}\right)\right] $$

Except for this change to the update rule, the algorithm otherwise follows the scheme of Q-learning. It is more computationally expensive than [Sarsa](https://paperswithcode.com/method/sarsa) but it eliminates the variance due to the random selection of $A\_{t+1}$.

Source: Sutton and Barto, Reinforcement Learning, 2nd Edition

Expected Sarsa

AdamW

Decoupled Weight Decay Regularization

This is **Graph Transformer** method, proposed as a generalization of [Transformer](https://paperswithcode.com/method/transformer) Neural Network architectures, for arbitrary graphs.

Compared to the original Transformer, the highlights of the presented architecture are:

- The attention mechanism is a function of neighborhood connectivity for each node in the graph.  
- The position encoding is represented by Laplacian eigenvectors, which naturally generalize the sinusoidal positional encodings often used in NLP.  
- The [layer normalization](https://paperswithcode.com/method/layer-normalization) is replaced by a [batch normalization](https://paperswithcode.com/method/batch-normalization) layer.  
- The architecture is extended to have edge representation, which can be critical to tasks with rich information on the edges, or pairwise interactions (such as bond types in molecules, or relationship type in KGs. etc).

Source	Decoupled Weight Decay Regularization
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com