**Natural Gradient Descent** is an approximate second-order optimisation method. It has an interpretation as optimizing over a Riemannian manifold using an intrinsic distance metric, which implies the updates are invariant to transformations such as whitening. By using the positive semi-definite (PSD) Gauss-Newton matrix to approximate the (possibly negative definite) Hessian, NGD can often work better than exact second-order methods.

Given the gradient of $z$, $g = \frac{\delta{f}\left(z\right)}{\delta{z}}$, NGD computes the update as:

$$\Delta{z} = \alpha{F}^{−1}g$$

where the Fisher information matrix $F$ is defined as:

$$ F = \mathbb{E}\_{p\left(t\mid{z}\right)}\left[\nabla\ln{p}\left(t\mid{z}\right)\nabla\ln{p}\left(t\mid{z}\right)^{T}\right] $$

The log-likelihood function $\ln{p}\left(t\mid{z}\right)$ typically corresponds to commonly used error functions such as the cross entropy loss.

Source: [LOGAN](https://paperswithcode.com/method/logan)

Image: [Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
](https://arxiv.org/abs/1905.10961)

**MacBERT** is a [Transformer](https://paperswithcode.com/methods/category/transformers)-based model for Chinese NLP that alters [RoBERTa](https://paperswithcode.com/method/roberta) in several ways, including a modified masking strategy. Instead of masking with [MASK] token, which never appears in the fine-tuning stage, MacBERT masks the word with its similar word. Specifically MacBERT shares the same pre-training tasks as [BERT](https://paperswithcode.com/method/bert) with several modifications. For the MLM task, the following modifications are performed:

- Whole word masking is used as well as Ngram masking strategies for selecting candidate tokens for masking, with a percentage of
40%, 30%, 20%, 10% for word-level unigram to 4-gram.
- Instead of masking with [MASK] token, which never appears in the fine-tuning stage, similar words are used for the masking purpose. A similar word is obtained by using Synonyms toolkit which is based on word2vec similarity calculations. If an N-gram is selected to mask, we will find similar words individually. In rare cases, when there is no similar word, we will degrade to use random word replacement.
- A percentage of 15% input words is used for masking, where 80% will replace with similar words, 10% replace with a random word, and keep with original words for the rest of 10%.

MacBERT

Revisiting Pre-Trained Models for Chinese Natural Language Processing

Natural Gradient Descent

**NPID++** (Non-Parametric Instance Discrimination) is a self-supervision approach that takes a non-parametric classification approach. It approves upon [NPID](https://paperswithcode.com/method/npid) by using more negative samples and training for more epochs.

Year	1998
Data Source	CC BY-SA - https://paperswithcode.com