**DSelect-k** is a continuously differentiable and sparse gate for Mixture-of-experts (MoE), based on a novel binary encoding formulation. Given a user-specified parameter $k$, the gate selects at most $k$ out of the $n$ experts. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. This explicit control over sparsity leads to a cardinality-constrained optimization problem, which is computationally challenging. To circumvent this challenge, the authors use a unconstrained reformulation that is equivalent to the original problem. The reformulated problem uses a binary encoding scheme to implicitly enforce the cardinality constraint. By carefully smoothing the binary encoding variables, the reformulated problem can be effectively optimized using first-order methods such as [SGD](https://paperswithcode.com/method/sgd).

The motivation for this method is that  existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods.

**Grid Sensitive** is a trick for object detection introduced by [YOLOv4](https://paperswithcode.com/method/yolov4). When we decode the coordinate of the bounding box center $x$ and $y$, in original [YOLOv3](https://paperswithcode.com/method/yolov3), we can get them by

$$
\begin{aligned}
&x=s \cdot\left(g\_{x}+\sigma\left(p\_{x}\right)\right) \\
&y=s \cdot\left(g\_{y}+\sigma\left(p\_{y}\right)\right)
\end{aligned}
$$

where $\sigma$ is the sigmoid function, $g\_{x}$ and $g\_{y}$ are integers and $s$ is a scale factor. Obviously, $x$ and $y$ cannot be exactly equal to $s \cdot g\_{x}$ or $s \cdot\left(g\_{x}+1\right)$. This makes it difficult to predict the centres of bounding boxes that just located on the grid boundary. We can address this problem, by changing the equation to

$$
\begin{aligned}
&x=s \cdot\left(g\_{x}+\alpha \cdot \sigma\left(p\_{x}\right)-(\alpha-1) / 2\right) \\
&y=s \cdot\left(g\_{y}+\alpha \cdot \sigma\left(p\_{y}\right)-(\alpha-1) / 2\right)
\end{aligned}
$$

This makes it easier for the model to predict bounding box center exactly located on the grid boundary. The FLOPs added by Grid Sensitive are really small, and can be totally ignored.

Grid Sensitive

YOLOv4: Optimal Speed and Accuracy of Object Detection

DSelect-k

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Gravity is a kinematic approach to optimization based on gradients.

Source	DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com