Object Detection for Autonomous Vehicles
Results and Discussion
Results, in terms of mAP score on the validation set, for the 3 types of models I tried (Faster R-CNN with
ResNet-50 backbone, Faster R-CNN with MobileNetV3 backbone, and YOLOv4) when experimenting with batch size
and initial learning rate are below.
Faster R-CNN with ResNet-50 Backbone
Batch Size |
Initial Learning Rate |
mAP on Val Set |
9 |
0.001 |
30.7% |
9 |
0.0002 |
38.0% |
9 |
0.00005 |
40.8% |
Faster R-CNN with MobileNetV3 Backbone
Batch Size |
Initial Learning Rate |
mAP on Val Set |
36 |
0.001 |
25.4% |
36 |
0.0002 |
27.0% |
36 |
0.00005 |
27.3% |
18 |
0.0002 |
26.7% |
18 |
0.00005 |
26.7% |
18 |
0.00001 |
23.9% |
YOLOv4
Batch Size |
Initial Learning Rate |
mAP on Val Set |
64 |
0.001 |
43.35% |
64 |
0.0005 |
45.77% |
64 |
0.0001 |
44.05% |
32 |
0.0001 |
45.09% |
32 |
0.00005 |
42.92% |
For Faster R-CNN, a learning rate of 0.00005 yields the highest mAP score on the validation set when using either
the ResNet-50 or MobileNetV3 backbone. It comes as no surprise that a lower learning rate yielded a higher mAP
score, as I used a pretrained Faster R-CNN model, so using too high of a learning rate initially could change the
model’s weights too much in the first few epochs of finetuning. However, YOLOv4 yielded higher mAP scores on the
validation set when using a higher learning rate than 0.00005, specifically a learning rate of 0.0005 with a batch
size of 64 and 0.0001 with a batch size of 32.
In terms of batch size, with a ResNet-50 backbone for Faster R-CNN, I was restricted to using a batch size of 9 due
to GPU memory limitations. However, with MobileNetV3, a higher batch size, 36 instead of 18, resulted in a higher
mAP score on the validation set when using lower learning rates. Using a batch size of 64 instead of 32 also
resulted in the highest mAP score out of the YOLOv4 experiments when using a learning rate of 0.0005 with a batch
size of 64. It makes sense that a higher batch size, up to a point, would yield better results, as the gradient
step is averaged over more samples.
In addition to mAP score, time to finetune and evaluate the models is an important consideration. However, since
I ran all experiments on Google Colaboratory, it is difficult to compare the models on finetuning and evaluation
time, as the amount of resources allocated is unpredictable, so even finetuning the same model multiple times
took quite different amounts of time to finetune each time.
Based on the mAP scores on the validation set in the tables above, the best performing Faster R-CNN model is with a
ResNet-50 backbone, batch size of 9, and learning rate of 0.00005 and the best performing YOLOv4 model is with a
batch size of 64 and learning rate of 0.0005. The results of evaluating these two models on the test set is below.
mAP on Test Set for Best Performing Models
Model |
mAP on Test Set |
Faster R-CNN with ResNet-50 |
37.6% |
YOLOv4 |
44.27% |
AP on each Object Class for Best Performing Models
Object Class |
Model |
AP on Test Set |
pedestrian |
Faster R-CNN with ResNet-50 |
64.5% |
YOLOv4 |
62.97% |
rider |
Faster R-CNN with ResNet-50 |
55.5% |
YOLOv4 |
49.98% |
car |
Faster R-CNN with ResNet-50 |
80.4% |
YOLOv4 |
80.93% |
truck |
Faster R-CNN with ResNet-50 |
65.6% |
YOLOv4 |
63.04% |
bus |
Faster R-CNN with ResNet-50 |
68.2% |
YOLOv4 |
64.83% |
train |
Faster R-CNN with ResNet-50 |
0% |
YOLOv4 |
2.16% |
motorcycle |
Faster R-CNN with ResNet-50 |
50.1% |
YOLOv4 |
49.69% |
bicycle |
Faster R-CNN with ResNet-50 |
54.6% |
YOLOv4 |
58.12% |
traffic light |
Faster R-CNN with ResNet-50 |
63.3% |
YOLOv4 |
60.23% |
traffic sign |
Faster R-CNN with ResNet-50 |
68.7% |
YOLOv4 |
73.04% |
other person |
Faster R-CNN with ResNet-50 |
0% |
YOLOv4 |
0.26% |
other vehicle |
Faster R-CNN with ResNet-50 |
17.6% |
YOLOv4 |
9.14% |
trailer |
Faster R-CNN with ResNet-50 |
0% |
YOLOv4 |
1.12% |
As expected, YOLOv4 outperforms Faster R-CNN on the test set, achieving about 7% higher mAP score on the test
set, which is in line with comparisons of these two models on datasets like COCO, for example
[10]. In terms
of AP for each object class, YOLOv4 and Faster R-CNN achieve quite similar scores for each object class,
performing fairly well on most classes, except train, other person, other vehicle, and trailer. For both models,
the AP for each object class seems to roughly correlate with the number of times an object from that class appears
in images in the dataset. For example, the object classes with the lowest number of appearances in the dataset,
each appearing less than 1,000 times, are train, other person, other vehicle, and trailer, which are the same
classes that both models performed the worst on. Further, both models performed the best on detecting cars, which
appear in the dataset far more than any other object class, specifically 803,540 times. This pattern makes sense,
as the model can learn to better detect an object when it has more examples of that object, or in other words,
that object occurs more often in the dataset. Augmenting the Berkeley DeepDrive dataset with more images for
classes like trailer or train, for example, which appear in very few images, by either adding new images or
transformations of the existing images in the dataset with these types of objects, could help improve both models’
ability to detect these types of objects.
Overall, I learned a lot while doing this project. I gained a better understanding of how different object
detection models work, how to finetune multiple object detection models, and evaluation of object detection
(i.e., mAP, IoU, etc.).