Object Detection with R-CNN

Wednesday 8 January 2020

Object Detection with R-CNN

Objective

Object Detection with Sliding Window Classification is computationally expensive for its using a huge number of crops which are passed into classifier. The number of these crops somehow had to be reduced, we needed to find some intelligent way how to choose only certain crops (how to reduce their number).

Solution

Region Proposals. They typically don't use deep learning but are slightly more traditional computer vision or image processing techniques. For a given input image Region Proposal algorithm gives a list of regions (can be thousand) in the image where an object might be present. This algorithm might be looking for edges and try to draw boxes that contain closed edges. These region proposal networks will basically look for blobby regions in our input image and then give us some set of candidate proposal regions where objects might be potentially found.

These are relatively fast-ish to run. One common example of a region proposal method is Selective Search which gives you 2000 region proposals where objects are likely to be found. This happens in couple of seconds on CPU.

There'll be a lot of noise in those. Most of them will not be true objects but there's a pretty high recall.

If there is an object in the image then it does tend to get covered by these region proposals from Selective Search.

So now rather than applying our classification network to every possible location and scale in the image instead what we can do is first apply one of these region proposal networks to get some set
of proposal regions where objects are likely located and now apply a convolutional network for classification to each of these proposal regions and this will end up being much more computationally tractable than trying to do all possible locations and scales.

This idea all came together in the paper submitted in 2014 by Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: "Rich feature hierarchies for accurate object detection and semantic segmentation".

Network

Input image is passed through Region Proposal Network/Method to get proposals (which are also called Regions of Interest or RoIs). Selective Search gives something like 2000 regions of interest. they are typically rectangular. Region Proposals are not learned, they are output of handcrafted, fixed algorithm. Selective Search is one of them but R-CNN does not require this proposal method to be Selective Search. It could be any other which does the job.

R-CNN network.
Image credit: Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: "Rich feature hierarchies for accurate object detection and semantic segmentation"

One of the problems here is that these regions in the input image could have different sizes but convolutional networks for classification all want images of the same input size typically due to the fully connected net layers. We need to take each of these region proposals and warp them to that fixed square size that is expected as input to our downstream network.

R-CNN Network Architecture
Image modified. Original image taken from: Fei-Fei Li, Justin Johnson, Serena Yeung (Stanford University School of Engineering): Convolutional Neural Networks for Visual Recognition, Lecture 11 - Detection and Segmentation

So we'll crop out those regions corresponding to the region proposals, we'll warp them to that fixed size, and then we'll run each of them through a convolutional network.

For each warped RoI CNN outputs two predictions:

category labels

for this it uses classification via SVM (Support Vector Machine) [BK: Why? Why not Softmax layer (which is used later in Fast R-CNN)?]
it would output "background" for those RoIs that don't correspond to true objects

correction to the bounding box

this is 4 numbers that are an offset or a correction to the box that was predicted at the region proposal stage
for this it uses regression
this is necessary as input region proposals are kind of generally in the right position for an object but they might not be perfect.

Offset is not always inside the region of interest. You might suppose the region of interest put a box around a person but missed the head then you could imagine the network inferring that oh this is a person but people usually have heads so the network showed the box should be a little bit higher. So sometimes the final predicted boxes will be outside the region of interest.

Loss Function

As CNN does two tasks (classification and regression) this is a multi-task loss (Hinge for SVM + L2 for regression).

Training

This is fully supervised in the sense that our training data consists of images and each image has all the object categories marked with bounding boxes for each instance of that category.

Problems

R-CNN is still computationally expensive because if we've got 2000 region proposals,
we're running each of those proposals independently, that can be pretty expensive.

It relies on fixed region proposal network, region proposals are not learned.

It'd take hundreds of gigabytes of disk space to store all these features.

Training is super slow since you have to make all these different forward and backward passes through the image and it took something like 84 hours they've recorded for training time.

Test time is also super slow, something like roughly 30 seconds per image because you need to run thousands of forward passes through the convolutional network for each of these region proposals so this ends up being pretty slow.