For the fifth and final project of Udacity’s first term of the Self-Driving Car Engineer program, I decided to go a more challenging way and implement a more modern approach. The task was to develop an HOG approach for object/vehicle detection in a video. The HOG method uses a sliding window approach and feeds the features of each window into a support vector machine.
However, I wanted to try some of the recent deep learning approaches. After studying methods with region proposals like Faster R-CNN and its predecessors, I went with the YOLO v2 paper by Redmon and Farhadi. This method was chosen because it doesn’t rely on a seperate region proposal network and is therefore much faster (real time possible). YOLOv2 divides the image into subsections and then uses anchors to allow several classifiers to specialize in different aspect ratios and sizes. Another possible choice would have been SSD. I also used tensorflow’s object detection API with Faster RCNN (Inception Resnet V2 architecure) on a video I took, which can be watched here.
Since the paper wasn’t precise enough in many aspects, I had to improvise and just guess the implementation details. The authors used their own framework darknet for their implementation. I also relied on the preceding paper of yolo. The cost function I used is based on this preceding paper:
The tensorflow related implementations I found on github were all just some kind of wrapper for darknet. To understand all details I decided to implement everything from scratch with tensorflow. To speed things up, I used multithreading with queues whenever possible, especially during preprocessing (although this couldn’t be used for real time inference).
For training I used Udacity’s public datasets.
Videos with predictions can watched below:
Predictions for the first dataset:
Predictions for the second dataset:
This is a video from another project where I recorded a highway drive and used tensorflow’s object detection API with a Faster R-CNN Inception Resnet: