Object Detection for Autonomous Vehicles

Results and Discussion

Results, in terms of mAP score on the validation set, for the 3 types of models I tried (Faster R-CNN with ResNet-50 backbone, Faster R-CNN with MobileNetV3 backbone, and YOLOv4) when experimenting with batch size and initial learning rate are below.

Faster R-CNN with ResNet-50 Backbone
Batch Size	Initial Learning Rate	mAP on Val Set
9	0.001	30.7%
9	0.0002	38.0%
9	0.00005	40.8%

Faster R-CNN with MobileNetV3 Backbone
Batch Size	Initial Learning Rate	mAP on Val Set
36	0.001	25.4%
36	0.0002	27.0%
36	0.00005	27.3%
18	0.0002	26.7%
18	0.00005	26.7%
18	0.00001	23.9%

YOLOv4
Batch Size	Initial Learning Rate	mAP on Val Set
64	0.001	43.35%
64	0.0005	45.77%
64	0.0001	44.05%
32	0.0001	45.09%
32	0.00005	42.92%

For Faster R-CNN, a learning rate of 0.00005 yields the highest mAP score on the validation set when using either the ResNet-50 or MobileNetV3 backbone. It comes as no surprise that a lower learning rate yielded a higher mAP score, as I used a pretrained Faster R-CNN model, so using too high of a learning rate initially could change the model’s weights too much in the first few epochs of finetuning. However, YOLOv4 yielded higher mAP scores on the validation set when using a higher learning rate than 0.00005, specifically a learning rate of 0.0005 with a batch size of 64 and 0.0001 with a batch size of 32.

In terms of batch size, with a ResNet-50 backbone for Faster R-CNN, I was restricted to using a batch size of 9 due to GPU memory limitations. However, with MobileNetV3, a higher batch size, 36 instead of 18, resulted in a higher mAP score on the validation set when using lower learning rates. Using a batch size of 64 instead of 32 also resulted in the highest mAP score out of the YOLOv4 experiments when using a learning rate of 0.0005 with a batch size of 64. It makes sense that a higher batch size, up to a point, would yield better results, as the gradient step is averaged over more samples.

In addition to mAP score, time to finetune and evaluate the models is an important consideration. However, since I ran all experiments on Google Colaboratory, it is difficult to compare the models on finetuning and evaluation time, as the amount of resources allocated is unpredictable, so even finetuning the same model multiple times took quite different amounts of time to finetune each time.

Based on the mAP scores on the validation set in the tables above, the best performing Faster R-CNN model is with a ResNet-50 backbone, batch size of 9, and learning rate of 0.00005 and the best performing YOLOv4 model is with a batch size of 64 and learning rate of 0.0005. The results of evaluating these two models on the test set is below.

mAP on Test Set for Best Performing Models
Model	mAP on Test Set
Faster R-CNN with ResNet-50	37.6%
YOLOv4	44.27%

AP on each Object Class for Best Performing Models
Object Class	Model	AP on Test Set
pedestrian	Faster R-CNN with ResNet-50	64.5%
pedestrian	YOLOv4	62.97%
rider	Faster R-CNN with ResNet-50	55.5%
rider	YOLOv4	49.98%
car	Faster R-CNN with ResNet-50	80.4%
car	YOLOv4	80.93%
truck	Faster R-CNN with ResNet-50	65.6%
truck	YOLOv4	63.04%
bus	Faster R-CNN with ResNet-50	68.2%
bus	YOLOv4	64.83%
train	Faster R-CNN with ResNet-50	0%
train	YOLOv4	2.16%
motorcycle	Faster R-CNN with ResNet-50	50.1%
motorcycle	YOLOv4	49.69%
bicycle	Faster R-CNN with ResNet-50	54.6%
bicycle	YOLOv4	58.12%
traffic light	Faster R-CNN with ResNet-50	63.3%
traffic light	YOLOv4	60.23%
traffic sign	Faster R-CNN with ResNet-50	68.7%
traffic sign	YOLOv4	73.04%
other person	Faster R-CNN with ResNet-50	0%
other person	YOLOv4	0.26%
other vehicle	Faster R-CNN with ResNet-50	17.6%
other vehicle	YOLOv4	9.14%
trailer	Faster R-CNN with ResNet-50	0%
trailer	YOLOv4	1.12%

As expected, YOLOv4 outperforms Faster R-CNN on the test set, achieving about 7% higher mAP score on the test set, which is in line with comparisons of these two models on datasets like COCO, for example [10]. In terms of AP for each object class, YOLOv4 and Faster R-CNN achieve quite similar scores for each object class, performing fairly well on most classes, except train, other person, other vehicle, and trailer. For both models, the AP for each object class seems to roughly correlate with the number of times an object from that class appears in images in the dataset. For example, the object classes with the lowest number of appearances in the dataset, each appearing less than 1,000 times, are train, other person, other vehicle, and trailer, which are the same classes that both models performed the worst on. Further, both models performed the best on detecting cars, which appear in the dataset far more than any other object class, specifically 803,540 times. This pattern makes sense, as the model can learn to better detect an object when it has more examples of that object, or in other words, that object occurs more often in the dataset. Augmenting the Berkeley DeepDrive dataset with more images for classes like trailer or train, for example, which appear in very few images, by either adding new images or transformations of the existing images in the dataset with these types of objects, could help improve both models’ ability to detect these types of objects.

Overall, I learned a lot while doing this project. I gained a better understanding of how different object detection models work, how to finetune multiple object detection models, and evaluation of object detection (i.e., mAP, IoU, etc.).

CSE 455 Computer Vision Project

Erica Eaton

Object Detection for Autonomous Vehicles

Results and Discussion