**Generalized Mean Pooling (GeM)** computes the generalized mean of each channel in a tensor. Formally:

$$ \textbf{e} = \left[\left(\frac{1}{|\Omega|}\sum\_{u\in{\Omega}}x^{p}\_{cu}\right)^{\frac{1}{p}}\right]\_{c=1,\cdots,C} $$

where $p > 0$ is a parameter. Setting this exponent as $p > 1$ increases the contrast of the pooled feature map and focuses on the salient features of the image. GeM is a generalization of the [average pooling](https://paperswithcode.com/method/average-pooling) commonly used in classification networks ($p = 1$) and of spatial max-pooling layer ($p = \infty$).

Source: [MultiGrain](https://paperswithcode.com/method/multigrain)

Image Source: [Eva Mohedano](https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.slideshare.net%2Fxavigiro%2Fd1l5-contentbased-image-retrieval-upc-2018-deep-learning-for-computer-vision&psig=AOvVaw2-9Hx23FNGFDe4GHU22Oo5&ust=1591798200590000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCOiP-9P09OkCFQAAAAAdAAAAABAD)

**RepPoints** is a representation for object detection that consists of a set of points which indicate the spatial extent of an object and semantically significant local areas. This representation is learned via weak localization supervision from rectangular ground-truth boxes and implicit recognition feedback. Based on the richer RepPoints representation, the authors develop an anchor-free object detector that yields improved performance compared to using bounding boxes.

RepPoints

RepPoints: Point Set Representation for Object Detection

Generalized Mean Pooling

**Location Sensitive Attention** is an attention mechanism that extends the [additive attention mechanism](https://paperswithcode.com/method/additive-attention) to use cumulative attention weights from previous decoder time steps as an additional feature. This encourages the model to move forward consistently through the input, mitigating potential failure modes where some subsequences are repeated or ignored by the decoder.

Starting with additive attention where $h$ is a sequential representation from a BiRNN encoder and ${s}\_{i-1}$ is the $(i − 1)$-th state of a recurrent neural network (e.g. a [LSTM](https://paperswithcode.com/method/lstm) or [GRU](https://paperswithcode.com/method/gru)):

$$ e\_{i, j} = w^{T}\tanh\left(W{s}\_{i-1} + Vh\_{j} + b\right) $$

where $w$ and $b$ are vectors, $W$ and $V$ are matrices. We extend this to be location-aware by making it take into account the alignment produced at the previous step. First, we extract $k$ vectors
$f\_{i,j} \in \mathbb{R}^{k}$ for every position $j$ of the previous alignment $\alpha\_{i−1}$ by convolving it with a matrix $F \in R^{k\times{r}}$:

$$ f\_{i} = F ∗ \alpha\_{i−1} $$

These additional vectors $f\_{i,j}$ are then used by the scoring mechanism $e\_{i,j}$:

$$ e\_{i,j} = w^{T}\tanh\left(Ws\_{i−1} + Vh\_{j} + Uf\_{i,j} + b\right) $$

Year	2000
Data Source	CC BY-SA - https://paperswithcode.com