**AdaGrad** is a stochastic optimization method that adapts the learning rate to the parameters. It performs smaller updates for parameters associated with frequently occurring features, and larger updates for parameters associated with infrequently occurring features. In its update rule, Adagrad modifies the general learning rate $\eta$ at each time step $t$ for every parameter $\theta\_{i}$ based on the past gradients for $\theta\_{i}$: 

$$ \theta\_{t+1, i} = \theta\_{t, i} - \frac{\eta}{\sqrt{G\_{t, ii} + \epsilon}}g\_{t, i} $$

The benefit of AdaGrad is that it eliminates the need to manually tune the learning rate; most leave it at a default value of $0.01$. Its main weakness is the accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training, causing the learning rate to shrink and becoming infinitesimally small.

Image: [Alec Radford](https://twitter.com/alecrad)

Chain-of-thought prompts contain a series of intermediate reasoning steps, and they are shown to significantly improve the ability of large language models to perform certain tasks that involve complex reasoning (e.g., arithmetic, commonsense reasoning, symbolic reasoning, etc.)

CoT Prompting

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

AdaGrad

A **Split Attention** block enables attention across feature-map groups. As in [ResNeXt blocks](https://paperswithcode.com/method/resnext-block), the feature can be divided into several groups, and the number of feature-map groups is given by a cardinality hyperparameter $K$. The resulting feature-map groups are called cardinal groups. Split Attention blocks introduce a new radix hyperparameter $R$ that indicates the number of splits within a cardinal group, so the total number of feature groups is $G = KR$. We may apply a series of transformations {$\mathcal{F}\_1, \mathcal{F}\_2, \cdots\mathcal{F}\_G$} to each individual group, then the intermediate representation of each group is $U\_i = \mathcal{F}\_i\left(X\right)$, for $i \in$ {$1, 2, \cdots{G}$}.

A combined representation for each cardinal group can be obtained by fusing via an element-wise summation across multiple splits. The representation for $k$-th cardinal group is 
$\hat{U}^k = \sum_{j=R(k-1)+1}^{R k} U_j $, where $\hat{U}^k \in \mathbb{R}^{H\times W\times C/K}$ for $k\in{1,2,...K}$, and $H$, $W$ and $C$ are the block output feature-map sizes. 
Global contextual information with embedded channel-wise statistics can be gathered with [global average pooling](https://paperswithcode.com/method/global-average-pooling) across spatial dimensions  $s^k\in\mathbb{R}^{C/K}$. Here the $c$-th component is calculated as:

$$
    s^k\_c = \frac{1}{H\times W} \sum\_{i=1}^H\sum\_{j=1}^W \hat{U}^k\_c(i, j).
$$

A weighted fusion of the cardinal group representation $V^k\in\mathbb{R}^{H\times W\times C/K}$ is aggregated using [channel-wise soft attention](https://paperswithcode.com/method/channel-wise-soft-attention), where each feature-map channel is produced using a weighted combination over splits. The $c$-th channel is calculated as:

$$
    V^k_c=\sum_{i=1}^R a^k_i(c) U_{R(k-1)+i} ,
$$

where $a_i^k(c)$ denotes a (soft) assignment weight given by:

$$
a_i^k(c) =
\begin{cases}
  \frac{exp(\mathcal{G}^c_i(s^k))}{\sum_{j=0}^R exp(\mathcal{G}^c_j(s^k))} & \quad\textrm{if } R>1, \\
   \frac{1}{1+exp(-\mathcal{G}^c_i(s^k))} & \quad\textrm{if } R=1,\\
\end{cases}
$$

and mapping $\mathcal{G}_i^c$ determines the weight of each split for the $c$-th channel based on the global context representation $s^k$.

Year	2011
Data Source	CC BY-SA - https://paperswithcode.com