Object Detection with Fast R-CNN

Wednesday, 8 January 2020

Object Detection with Fast R-CNN

Objective

Improve R-CNN method. The problem with it was speed, memory footprint and accuracy.

Method

To eliminate processing each RoI in the input image separately (passing each RoI in the input image through CNN) don't apply Region Proposals directly on an input image but rather on its convolutional feature map. If comparing to R-CNN, Fast R-CNN actually swaps the location of Region Proposal method and CNN in the network architecture.

Network Architecture

Input:

image
set of predefined labels (categories/classes)

Entire image runs through some convolutional layers all at once to give this high resolution convolutional feature map corresponding to the entire image. E.g. if we have 5 convolutional layers, the output from the 5th is denoted as conv5 so we'd have conv5 feature map .

Fast R-CNN Network Architecture
Image credit: Ross Girshick: "Fast R-CNN"

Fixed Region Proposal method (e.g. Selective Search) is still used. But but rather than cropping out the pixels of the image corresponding to the region proposals, instead we imagine projecting those region proposals onto this convolutional feature map and then taking crops from the convolutional feature map corresponding to each proposal rather than taking crops directly from the image.

This allows us to reuse a lot of this expensive convolutional computation across the entire image when we have many crops per image.

Fully connected layers downstream are expecting some fixed-size input so now we need to do some reshaping of those crops from the convolutional feature map and they do that in a differentiable way using RoI pooling layer. ROI pooling looks kind of like max pooling.

Once you have these warped crops from the convolutional feature map then you can run these things through some fully connected layers and predict (for each RoI):

classification scores - Softmax classifier (Linear + softmax)
offsets to the bounding boxes - Bounding box regressors (linear regression)

Fast R-CNN Network Architecture
Image source: Fei-Fei Li, Justin Johnson, Serena Yeung (Stanford University School of Engineering): Convolutional Neural Networks for Visual Recognition, Lecture 11 - Detection and Segmentation

Training & Loss Function

During training phase, we have multi-task loss which trades off between the two constraints listed above:

Total Loss = Log loss (Softmax classifier) + Smooth L1 loss (BBox regressor)

During back propagation we can back-prop through this entire thing and learn it all jointly.

Benchmarks and Problems

Training times (hours):

R-CNN: 84
SPP-Net: 25.5
Fast R-CNN: 8.75

Test times (seconds, Including/Not Including Region Proposals):

R-CNN: 49/47
SPP-Net: 4.3/2.3
Fast R-CNN: 2.3/0.32

In terms of speed if we look at R-CNN versus Fast R-CNN versus SPP-net which is kind of in between the two, then we can see that at training time fast R-CNN is ~ 10 times faster to train because we're sharing all this computation between different feature maps.

And now at test time Fast R-CNN is super fast and in fact Fast R-CNN is so fast at test time that its computation time is actually dominated by computing region proposals.

Computing these 2000 region proposals using Selective Search takes ~ 2 seconds and now once we've got all these region proposals then because we're processing them all sort of in a shared way by sharing these expensive convolutions across the entire image that we can process all of these region proposals in less than a second altogether. So Fast R-CNN ends up being bottlenecked by just the computing of these region proposals.