- Published on
Paper review: SECOND - Sparsely Embedded Convolutional Detection
- Authors
- Name
- Viet Anh
- @vietanhdev
LiDAR-based or RGB-D-based object detection is used in numerous applications, ranging from autonomous driving to robot vision. In this note, we review SECOND: Sparsely Embedded Convolutional Detection, a SOTA 3D object detection network in 2018. This note only sums up the main points of the paper. If you want to know the details, please refer to the full paper and the official source code.

1. Main contributions
There are 3 main contribution points of SECOND:
- SECOND is a Voxel-based 3D object detection network. However, it applies sparse convolution and investigates an improved sparse convolution method, which can significantly enhance the training and inference speed.
- A new angle loss function to improve the orientation estimation.
- A new data augmentation method for point cloud to enhance the convergence speed and performance.
2. The network architecture

SECOND converts raw point cloud into voxel features and coordinates (1), feeds them through voxel feature encoding layers and sparse convolution layers (2). Finally, an RPN (Region Proposal Network) generates the detection (3). Let's talk deeply about the steps in the SECOND pipeline.
2.1. Point cloud grouping

Point cloud grouping is the first step of the SECOND's pipeline. First, SECOND crops the point cloud based on the distribution of objects based on our dataset. After that, it preallocates a memory based on a specified limit of voxel number. Then, it uses an iterative process to assign points to associated voxels.
2.2. Feature extraction
Voxel feature encoding:
In feature extraction blocks, there are voxel feature encoding (VFE) layers and sparse convolution layers. The design of VFE layers are taken from VoxelNet paper.

A VFE takes all points in the same voxel as input. The point-wise features are passed through fully connected layers, BatchNorm, ReLU to extract pointwise features. Then, elementwise max-pooling is applied to obtain the locally aggregated features. Finally, they concatenate these features with pointwise features to get the final voxelwise feature encoding.
Sparse convolution layers:
According to the paper, the point clouds in KITTI will generate 5k–8k voxels with a sparsity of nearly 0.005. By applying sparse convolution layers, SECOND can reduce a huge number of calculations in empty space. They also consider submanifold convolution to prevent generating too many active locations. One of the main contributions of SECOND is the design of a new Rule Generation Algorithm to bring the rule generation process (of spare convolution) to the GPU. In the previous implementations of sparse convolution, they often use a hash table, which is slow and requires a lot of CPU-GPU data transfer. The new GPU implementation of SECOND brings down the time for the rule generation step.

2.3. Detection generation: SSD-based Region Proposal Network with anchors
In the final step, SECOND generate detections from feature map using a Region Proposal Network (RPN). The idea and architecture are based on SSD. They apply some convolution, BatchNorm and ReLU to the extracted feature maps and then regress object class,hree convolution. Like SSD, Object anchor boxes are carefully selected based offsets and direction using ton the dataset.

3. Loss functions
3.1. Angle loss
VoxelNet directly predicts the radian offsets. This causes a problem when comparing 2 boxes (prediction, ground truth) having 0 and offset angles. Although these boxes are almost the same, the loss function output a large value. SECOND proposes a new angle loss function that can handle this situation:
However, because this loss treats boxes with opposite directions as being the same, they add a direction classifier, which uses softmax loss to distinguish the direction of the objects.
3.2. Focal loss for Classification
To handle the imbalance between the number of anchor boxes (~70k in KITTI) and the number of objects (~4-6 positives), SECOND uses focal loss for classification.
3.3. Total training loss
Combining all the loss functions above, we have a total training loss:
where is the classification loss,
is the regression loss for location and dimension,
is the angle loss,
and is the direction classification loss.
4. Data augmentation
Three main methods for data augmentation in SECOND are:
- (1) Sample Ground Truths from the Database: copy object points and labels from ground truth to training point clouds. Check collision to prevent impossible outcomes
- (2) Object Noise: augment each object independently with random rotations and linear transformation
- (3) Global Rotation and Scaling
5. References
- SECOND https://www.researchgate.net/publication/328158485_SECOND_Sparsely_Embedded_Convolutional_Detection.
- VoxelNet https://ieeexplore.ieee.org/document/8578570.
- Apple's new self-driving car tech: Voxelnet is quite Awesome https://www.techexplorist.com/apples-new-self-driving-car-tech-voxelnet-quite-awesome/8925/.
- My slides: Second-3D-Object-Detection.pdf.