AICurious Logo

What is: Voxel Transformer?

SourceVoxel Transformer for 3D Object Detection
Year2000
Data SourceCC BY-SA - https://paperswithcode.com

VoTr is a Transformer-based 3D backbone for 3D object detection from point clouds. It contains a series of sparse and submanifold voxel modules. Submanifold voxel modules perform multi-head self-attention strictly on the non-empty voxels, while sparse voxel modules can extract voxel features at empty locations. Long-range relationships between voxels are captured via self-attention.

Given the fact that non-empty voxels are naturally sparse but numerous, directly applying standard Transformer on voxels is non-trivial. To this end, VoTr uses a sparse voxel module and a submanifold voxel module, which can operate on the empty and non-empty voxel positions effectively. To further enlarge the attention range while maintaining comparable computational overhead to the convolutional counterparts, two attention mechanisms are used for multi-head attention in those two modules: Local Attention and Dilated Attention. Furthermore Fast Voxel Query is used to accelerate the querying process in multi-head attention.