Object Detection for Autonomous Vehicles

Results and Discussion

Results, in terms of mAP score on the validation set, for the 3 types of models I tried (Faster R-CNN with ResNet-50 backbone, Faster R-CNN with MobileNetV3 backbone, and YOLOv4) when experimenting with batch size and initial learning rate are below.

Faster R-CNN with ResNet-50 Backbone
Batch Size Initial Learning Rate mAP on Val Set
9 0.001 30.7%
9 0.0002 38.0%
9 0.00005 40.8%
Faster R-CNN with MobileNetV3 Backbone
Batch Size Initial Learning Rate mAP on Val Set
36 0.001 25.4%
36 0.0002 27.0%
36 0.00005 27.3%
18 0.0002 26.7%
18 0.00005 26.7%
18 0.00001 23.9%
YOLOv4
Batch Size Initial Learning Rate mAP on Val Set
64 0.001 43.35%
64 0.0005 45.77%
64 0.0001 44.05%
32 0.0001 45.09%
32 0.00005 42.92%

For Faster R-CNN, a learning rate of 0.00005 yields the highest mAP score on the validation set when using either the ResNet-50 or MobileNetV3 backbone. It comes as no surprise that a lower learning rate yielded a higher mAP score, as I used a pretrained Faster R-CNN model, so using too high of a learning rate initially could change the model’s weights too much in the first few epochs of finetuning. However, YOLOv4 yielded higher mAP scores on the validation set when using a higher learning rate than 0.00005, specifically a learning rate of 0.0005 with a batch size of 64 and 0.0001 with a batch size of 32.

In terms of batch size, with a ResNet-50 backbone for Faster R-CNN, I was restricted to using a batch size of 9 due to GPU memory limitations. However, with MobileNetV3, a higher batch size, 36 instead of 18, resulted in a higher mAP score on the validation set when using lower learning rates. Using a batch size of 64 instead of 32 also resulted in the highest mAP score out of the YOLOv4 experiments when using a learning rate of 0.0005 with a batch size of 64. It makes sense that a higher batch size, up to a point, would yield better results, as the gradient step is averaged over more samples.

In addition to mAP score, time to finetune and evaluate the models is an important consideration. However, since I ran all experiments on Google Colaboratory, it is difficult to compare the models on finetuning and evaluation time, as the amount of resources allocated is unpredictable, so even finetuning the same model multiple times took quite different amounts of time to finetune each time.

Based on the mAP scores on the validation set in the tables above, the best performing Faster R-CNN model is with a ResNet-50 backbone, batch size of 9, and learning rate of 0.00005 and the best performing YOLOv4 model is with a batch size of 64 and learning rate of 0.0005. The results of evaluating these two models on the test set is below.

mAP on Test Set for Best Performing Models
Model mAP on Test Set
Faster R-CNN with ResNet-50 37.6%
YOLOv4 44.27%
AP on each Object Class for Best Performing Models
Object Class Model AP on Test Set
pedestrian Faster R-CNN with ResNet-50 64.5%
YOLOv4 62.97%
rider Faster R-CNN with ResNet-50 55.5%
YOLOv4 49.98%
car Faster R-CNN with ResNet-50 80.4%
YOLOv4 80.93%
truck Faster R-CNN with ResNet-50 65.6%
YOLOv4 63.04%
bus Faster R-CNN with ResNet-50 68.2%
YOLOv4 64.83%
train Faster R-CNN with ResNet-50 0%
YOLOv4 2.16%
motorcycle Faster R-CNN with ResNet-50 50.1%
YOLOv4 49.69%
bicycle Faster R-CNN with ResNet-50 54.6%
YOLOv4 58.12%
traffic light Faster R-CNN with ResNet-50 63.3%
YOLOv4 60.23%
traffic sign Faster R-CNN with ResNet-50 68.7%
YOLOv4 73.04%
other person Faster R-CNN with ResNet-50 0%
YOLOv4 0.26%
other vehicle Faster R-CNN with ResNet-50 17.6%
YOLOv4 9.14%
trailer Faster R-CNN with ResNet-50 0%
YOLOv4 1.12%

As expected, YOLOv4 outperforms Faster R-CNN on the test set, achieving about 7% higher mAP score on the test set, which is in line with comparisons of these two models on datasets like COCO, for example [10]. In terms of AP for each object class, YOLOv4 and Faster R-CNN achieve quite similar scores for each object class, performing fairly well on most classes, except train, other person, other vehicle, and trailer. For both models, the AP for each object class seems to roughly correlate with the number of times an object from that class appears in images in the dataset. For example, the object classes with the lowest number of appearances in the dataset, each appearing less than 1,000 times, are train, other person, other vehicle, and trailer, which are the same classes that both models performed the worst on. Further, both models performed the best on detecting cars, which appear in the dataset far more than any other object class, specifically 803,540 times. This pattern makes sense, as the model can learn to better detect an object when it has more examples of that object, or in other words, that object occurs more often in the dataset. Augmenting the Berkeley DeepDrive dataset with more images for classes like trailer or train, for example, which appear in very few images, by either adding new images or transformations of the existing images in the dataset with these types of objects, could help improve both models’ ability to detect these types of objects.

Overall, I learned a lot while doing this project. I gained a better understanding of how different object detection models work, how to finetune multiple object detection models, and evaluation of object detection (i.e., mAP, IoU, etc.).