Advanced driver-assistance system on Jetson Nano Part 3

Recently, I have built a prototype of an advanced driver-assistance system (ADAS) using a Jetson Nano computer. In this project, I have successfully deployed 3 deep neural networks and some computer vision algorithms on a super cheap hardware of Jetson Nano. In the last two posts, I have introduced the system in hardware and software design. In this week, I write about two machine learning modules: Object Detection Module and Lane Detection Module. I will focus on three core deep neural networks: an object detection network based on CenterNet, a ResNet-18 based traffic sign classification network and a U-Net based lane line segmentation network. I also introduce some experimental results in training and optimizing these networks.

I. Object Detection Module

Based on hardware constraints described in the first post of this series, in this section, I will introduce one of the key modules in machine learning block - Object Detection Module. This module is responsible for detect front obstacle objects such as other vehicles or pedestrians, and traffic signs. These results can be used for forward collision warning and over-speed warning. To provide these functions, the module contains two main components: a CenterNet based object detection neural network and a ResNet-18 based traffic sign classification network.

1. CenterNet-based Object detection network

Traffic object detection is a key deep neural network which contributes in forward collision warning and overspeed warning functions in my system.

Background

Recently, a trend in object detection improvement is to treat object detection as key point estimation problem. CenterNet Objects as Points introduced in 2019 uses keypoint estimation to find object center points and regresses to all other object properties, such as size, 3D location, orientation, and even pose. The simplicity of this method allows CenterNet to run at a very high speed and outperform a range of state-of-the-art algorithms. In this object detection module, I choose CenterNet as the main object detection network because its simplicity and efficiency make it suitable for embedded hardware. I trained CenterNet using Berkeley DeepDrive (BDD) dataset with 10 classes: person, rider, car, bus, truck, bike, motor, traffic light, traffic sign and train. I have a blog post about CenterNet here (It's only available in Vietnamese).

Backbones

In this project, I use MobileNetV2 and ResNet-18 as the backbone of CenterNet.

Nowadays, state-of-the-art CNN architectures go deeper and deeper. While AlexNet had solely 5 convolutional layers, the VGG network and GoogleNet had 19 and 22 layers respectively. However, increasing network depth does not work by merely stacking layers along. Deep networks are hard to train because of the vanishing gradient problem - as the gradient is back-propagated to earlier layers, repeated multiplication may make the gradient become very small. As a result, once the network goes deeper, its performance gets saturated or begins to degrade quickly. The core idea of ResNet solution is introducing an “identity shortcut connection” that skips one or more layers. Following is the image of how to construct an “identity shortcut connection”. Instead of learning a direct mapping of $x \rightarrow y$ with a function $H(x)$ , let us define the residual function using $F(x) = H(x) - x$ , which can be reframed into $H(x) = F(x) + x$ , where $F(x)$ and $x$ represents the stacked non-linear layers and the identity function respectively. The author's hypothesis is that it is easy to optimize the residual mapping function $F(x)$ than to optimize the original, unreferenced mapping $H(x)$ . By this way, they can construct networks with much more layers. ResNet-18 is a lightweight 18-layer network with residual blocks.

Residual Block - Image from ResNet paper

MobileNetV2 is a lightweight architecture targeting in mobile and embedded devices. This network also uses residual architecture like ResNet. Moreover, depth-wise separable convolution is used which dramatically reduce the complexity cost and model size of the network.

Dataset

BDD100K is used as the main dataset for object detection. In the original download website, authors divided their dataset into three subsets: training, validation, and test, which I will call original training set, original validation set, and original test set, respectively. I could not download the original test set because it was not available during my project schedule, so I only use original training and original validation sets of this dataset. BDD100K original validation set is setup up as test set in my experiments. Besides, I randomly split original training set into training set and validation set with ratio 85:15 for my experiments. Below is the distribution of the datasets used in this project.

Table: Object detection test set: 10000 images (~12.5% number of images in the whole dataset)

Class	person	rider	car	bus	truck	bike	motor	traffic light	traffic sign	train
Number of images	13262	649	102506	1597	4245	1007	452	26885	34908	15

Table: Object detection training set: 59383 images (~74% number of images in the whole dataset)

Class	person	rider	car	bus	truck	bike	motor	traffic light	traffic sign	train
Number of images	77637	3853	605279	9950	25493	6157	2576	157491	203297	119

Table: Object detection validation set: 10480 images (~13.5% number of images in the whole dataset)

Class	person	rider	car	bus	truck	bike	motor	traffic light	traffic sign	train
Number of images	13712	664	107932	1722	25493	1053	426	28626	36389	17

All datasets are converted into COCO-like object detection format.

Experiments

ResNet-18 and MobileNetV2 are two lightweight backbones which are used in my experiments. For training, I use the source code from CenterNet's authors with some modifications:

ResNet-18 backbone was implemented in the original source code from CenterNet's authors. This source code was written with PyTorch framework. I trained this network with three image sizes 224x224, 384x384, 512x512 and batch size 32. Learning rate is set to $10^{-4}$ , reduced to $10^{-5}$ on epoch 90, $10^{-6}$ on epoch 120. These setups were trained in 140 epochs.
Besides ResNet-18, MobileNetV2 backbone was added into the original source code. I trained this network with three image sizes 224x224, 384x384, 512x512 and batch size 32. Learning rate is set to $5x10^{-4}$ , reduced to $5x10^{-5}$ on epoch 35. These setups were trained in 70 epochs.

Result

The mean average precision and the inference speed of trained models are described in below table (system configuration: Intel Core i5 8400 and NVIDIA RTX 2070). As shown in the table, the inference time of CenterNet – MobileNet V2 models are lower than CenterNet – ResNet-18 models. However, these models are also less accurate than CenterNet – ResNet-18 models.

Table: Mean average accuracy and inference time of CenterNet models. mAP: mean average precision, IoU: intersection over union.

Model optimization for embedded hardware

After training CenterNet using PyTorch framework, we obtain model files in PyTorch model format (.pth). In order to optimize inference speed on NVIDIA Jetson Nano, we need to convert these models to TensorRT engine file. The conversion is done via an intermediate format called ONNX (Open Neural Network Exchange). PyTorch model is converted to ONNX format first using PyTorch ONNX module (step 1). After that, we convert ONNX model to TensorRT engine for each inference platform (step 2). Because the conversion from ONNX to TensorRT engine takes a long time, in my implementation, I serialize TensorRT engine to hard disk after converting and load it every time the program starts. In this step, we have to notice that TensorRT engine is built differently on different computer hardware. Therefore, we need to rebuild the engine if we need to inference on other hardware configuration.

After the engine conversion, I test the result on the test set. The accuracy and inference time of the models are described in following table. The inference time of CenterNet – MobileNet V2 models are lower than CenterNet – ResNet-18 models. However, these models are also less accurate than CenterNet – ResNet-18 models. The accuracy difference between the original models and the converted models is negligible. The difference in accuracy between float 16 and float 32 precision is also small (mAP columns). However, the efficiency of the conversion process can be clearly seen (comparing the inference time of models for the original PyTorch framework and the converted models).

Table: Accuracy and inference time comparison after TensorRT engine conversion – Object detection model. The evaluation is done on the test set

Based on the evaluation result, we can see that CenterNet with two lightweight backbones ResNet-18 and MobileNetV2 can achieve an acceptable accuracy and relatively high inference speed. The speed of these networks can be pushed further by running in float 16 precision without a noticeable reduction in accuracy. ResNet-18 backbone can be optimized better than MobileNetV2 in term of speed, which can be recognized by a larger reduction in inference time after converting. As CenterNet model with ResNet-18 backbone and input image size 384x384 provides the best balance between speed and accuracy on Jetson Nano, it is chosen as the main architecture for this project.

2. Traffic sign classification network

Due to the limitation of BDD dataset - It's only contains 1 class for traffic signs (without specifying the sign type), I had to train another neural network to recognize sign types. Because of the high speed and accuracy, ResNet-18 was also chosen for this task. I trained the model using Tensorflow and Keras frameworks. However, to optimize the speed, I also convert final model to TensorRT format.