**Stochastic Gradient Descent** is an iterative optimization technique that uses minibatches of data to form an expectation of the gradient, rather than the full gradient using all available data. That is for weights $w$ and a loss function $L$ we have:

$$ w\_{t+1} = w\_{t} - \eta\hat{\nabla}\_{w}{L(w\_{t})} $$

Where $\eta$ is a learning rate. SGD reduces redundancy compared to batch gradient descent - which recomputes gradients for similar examples before each parameter update - so it is usually much faster.

(Image Source: [here](http://rasbt.github.io/mlxtend/user_guide/general_concepts/gradient-optimization/))

Neural message-passing algorithms for semi-supervised classification on graphs have recently achieved great success. However, for classifying a node these methods only consider nodes that are a few propagation steps away and the size of this utilized neighbourhood is hard to extend. This paper uses the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct a simple model, personalized propagation of neural predictions (PPNP), and its fast approximation, APPNP. Our model's training time is on par or faster and its number of parameters is on par or lower than previous models. It leverages a large, adjustable neighbourhood for classification and can be easily combined with any neural network. We show that this model outperforms several recently proposed methods for semi-supervised classification in the most thorough study done so far for GCN-like models.

APPNP

Predict then Propagate: Graph Neural Networks meet Personalized PageRank

**Global-Local Attention** is a type of attention mechanism used in the [ETC](https://paperswithcode.com/method/etc) architecture. ETC receives two separate input sequences: the global input $x^{g} = (x^{g}\_{1}, \dots, x^{g}\_{n\_{g}})$ and the long input $x^{l} = (x^{l}\_{1}, \dots x^{l}\_{n\_{l}})$. Typically, the long input contains the input a [standard Transformer](https://paperswithcode.com/method/transformer) would receive, while the global input contains a much smaller number of auxiliary tokens ($n\_{g}  \ll n\_{l}$). Attention is then split into four separate pieces: global-to-global (g2g), global-tolong (g2l), long-to-global (l2g), and long-to-long (l2l). Attention in the l2l piece (the most computationally expensive piece) is restricted to a fixed radius $r \ll n\_{l}$. To compensate for this limited attention span, the tokens in the global input have unrestricted attention, and thus long input tokens can transfer information to each other through global input tokens. Accordingly, g2g, g2l, and l2g pieces of attention are unrestricted.

Year	1951
Data Source	CC BY-SA - https://paperswithcode.com