**VoTr** is a [Transformer](https://paperswithcode.com/method/transformer)-based 3D backbone for 3D object detection from point clouds. It contains a series of sparse and submanifold voxel modules. Submanifold voxel modules perform multi-head self-attention strictly on the non-empty voxels, while sparse voxel modules can extract voxel features at empty locations. Long-range relationships between voxels are captured via self-attention.

Given the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, VoTr uses a sparse voxel module and a submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, two attention mechanisms are used for [multi-head attention](https://paperswithcode.com/method/multi-head-attention) in those two modules: Local Attention and Dilated Attention. Furthermore [Fast Voxel Query](https://paperswithcode.com/method/fast-voxel-query) is used to accelerate the querying process in multi-head attention.

**Wasserstein Gradient Penalty Loss**, or **WGAN-GP Loss**, is a loss used for generative adversarial networks that augments the Wasserstein loss with a gradient norm penalty for random samples $\mathbf{\hat{x}} \sim \mathbb{P}\_{\hat{\mathbf{x}}}$ to achieve Lipschitz continuity:

$$ L = \mathbb{E}\_{\mathbf{\hat{x}} \sim \mathbb{P}\_{g}}\left[D\left(\tilde{\mathbf{x}}\right)\right] - \mathbb{E}\_{\mathbf{x} \sim \mathbb{P}\_{r}}\left[D\left(\mathbf{x}\right)\right] + \lambda\mathbb{E}\_{\mathbf{\hat{x}} \sim \mathbb{P}\_{\hat{\mathbf{x}}}}\left[\left(||\nabla\_{\tilde{\mathbf{x}}}D\left(\mathbf{\tilde{x}}\right)||\_{2}-1\right)^{2}\right]$$

It was introduced as part of the [WGAN-GP](https://paperswithcode.com/method/wgan-gp) overall model.

WGAN-GP Loss

Improved Training of Wasserstein GANs

VoTr

Voxel Transformer for 3D Object Detection

As the size and complexity of models and datasets grow, so does the need for communication-efficient variants of stochastic gradient descent that can be deployed to perform parallel model training. One popular communication-compression method for data-parallel [SGD](https://paperswithcode.com/method/sgd) is QSGD (Alistarh et al., 2017), which quantizes and encodes gradients to reduce communication costs. The baseline variant of QSGD provides strong theoretical guarantees, however, for practical purposes, the authors proposed a heuristic variant which we call QSGDinf, which demonstrated impressive empirical gains for distributed training of large neural networks. In this paper, we build on this work to propose a new gradient quantization scheme, and show that it has both stronger theoretical guarantees than QSGD, and matches and exceeds the empirical performance of the QSGDinf heuristic and of other compression methods.

Source	Voxel Transformer for 3D Object Detection
Year	2000
Data Source	CC BY-SA - https://paperswithcode.com