Paper review: SECOND - Sparsely Embedded Convolutional Detection

Share on:

LiDAR-based or RGB-D-based object detection is used in numerous applications, ranging from autonomous driving to robot vision. In this note, we review SECOND: Sparsely Embedded Convolutional Detection, a SOTA 3D object detection network in 2018. This note only sums up the main points of the paper. If you want to know the details, please refer to the full paper and the official source code.

The structure of our proposed SECOND detector. Source: SECOND paper

The structure of our proposed SECOND detector. Source: SECOND paper

1. Main contributions

There are 3 main contribution points of SECOND:

  • SECOND is a Voxel-based 3D object detection network. However, it applies sparse convolution and investigates an improved sparse convolution method, which can significantly enhance the training and inference speed.
  • A new angle loss function to improve the orientation estimation.
  • A new data augmentation method for point cloud to enhance the convergence speed and performance.

2. The network architecture

The pipeline of SECOND detector

The pipeline of SECOND detector

SECOND converts raw point cloud into voxel features and coordinates (1), feeds them through voxel feature encoding layers and sparse convolution layers (2). Finally, an RPN (Region Proposal Network) generates the detection (3). Let's talk deeply about the steps in the SECOND pipeline.

2.1. Point cloud grouping

Point cloud grouping

Point cloud grouping

Point cloud grouping is the first step of the SECOND's pipeline. First, SECOND crops the point cloud based on the distribution of objects based on our dataset. After that, it preallocates a memory based on a specified limit of voxel number. Then, it uses an iterative process to assign points to associated voxels.

2.2. Feature extraction

Voxel feature encoding:

In feature extraction blocks, there are voxel feature encoding (VFE) layers and sparse convolution layers. The design of VFE layers are taken from VoxelNet paper.

Voxel feature encoding layer. Source: https://www.researchgate.net/figure/Structure-of-voxel-feature-extraction-network_fig2_338876233

Voxel feature encoding layer. Source: https://www.researchgate.net/figure/Structure-of-voxel-feature-extraction-network_fig2_338876233

A VFE takes all points in the same voxel as input. The point-wise features are passed through fully connected layers, BatchNorm, ReLU to extract pointwise features. Then, elementwise max-pooling is applied to obtain the locally aggregated features. Finally, they concatenate these features with pointwise features to get the final voxelwise feature encoding.

Sparse convolution layers:

According to the paper, the point clouds in KITTI will generate 5k–8k voxels with a sparsity of nearly 0.005. By applying sparse convolution layers, SECOND can reduce a huge number of calculations in empty space. They also consider submanifold convolution to prevent generating too many active locations. One of the main contributions of SECOND is the design of a new Rule Generation Algorithm to bring the rule generation process (of spare convolution) to the GPU. In the previous implementations of sparse convolution, they often use a hash table, which is slow and requires a lot of CPU-GPU data transfer. The new GPU implementation of SECOND brings down the time for the rule generation step.

Table from SECOND paper: Comparison of the execution speeds of various convolution implementations. SparseConvNet is the official implementation of submanifold convolution. All benchmarks were run on a GTX
1080 Ti GPU with the data from the KITTI dataset.

Table from SECOND paper: Comparison of the execution speeds of various convolution implementations. SparseConvNet is the official implementation of submanifold convolution. All benchmarks were run on a GTX 1080 Ti GPU with the data from the KITTI dataset.

2.3. Detection generation: SSD-based Region Proposal Network with anchors

In the final step, SECOND generate detections from feature map using a Region Proposal Network (RPN). The idea and architecture are based on SSD. They apply some convolution, BatchNorm and ReLU to the extracted feature maps and then regress object class,hree $1 \times 1$ convolution. Like SSD, Object anchor boxes are carefully selected based offsets and direction using ton the dataset.

SSD-like Region Proposal Network

SSD-like Region Proposal Network

3. Loss functions

3.1. Angle loss

VoxelNet directly predicts the radian offsets. This causes a problem when comparing 2 boxes (prediction, ground truth) having 0 and $\pi$ offset angles. Although these boxes are almost the same, the loss function output a large value. SECOND proposes a new angle loss function that can handle this situation:

$$ L\theta = SmoothL1(sin(\theta_p − \theta_t)), $$

However, because this loss treats boxes with opposite directions as being the same, they add a direction classifier, which uses softmax loss to distinguish the direction of the objects.

3.2. Focal loss for Classification

To handle the imbalance between the number of anchor boxes (~70k in KITTI) and the number of objects (~4-6 positives), SECOND uses focal loss for classification.

3.3. Total training loss

Combining all the loss functions above, we have a total training loss:

$$ L_{total} = \beta_1L_{cls} + \beta_2(L_{reg−\theta} + L_{reg-other}) + \beta_3L_{dir} $$

where $L_{cls}$ is the classification loss,

$L_{reg−other}$ is the regression loss for location and dimension,

$L_{reg−\theta}$ is the angle loss,

and $ L_{dir} $ is the direction classification loss.

4. Data augmentation

Three main methods for data augmentation in SECOND are:

  • (1) Sample Ground Truths from the Database: copy object points and labels from ground truth to training point clouds. Check collision to prevent impossible outcomes
  • (2) Object Noise: augment each object independently with random rotations and linear transformation
  • (3) Global Rotation and Scaling

5. References

Subscribe for Updates

* indicates required

Related Posts