Object Detection with OverFeat

Thursday, 9 January 2020

Object Detection with OverFeat

Objective

Improve Sliding Window approach not by reducing number of crops (like in Region-proposal based solutions) but by passing image through CNN only once.

Output

Class predictions for each crop (position of the sliding window).

Method

Based on convolutional implementation of fully connected layers => NN = CNN + softmax layer.

Input image goes only once through single forward-propagation NN => computationally efficient.
This is why this method belongs to groups of detectors called Single-shot detectors.

Class probabilities for all locations are predicted at once.

Idea published by Pierre Sermanet et al.: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks" (2013)

ImageFeat single propagation through CNN

Image source: Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks"

CNN (like AlexNet) typically has the following structure (conv-pooling pair is usually repeated several times):

[conv --> pooling] --> FC --> FC --> Softmax

FC layers require fixed-sized input which, in turn, when back-propagated, adds requirement to first convolutional layer (1st layer in the network) to also accept input of the certain (fixed) size. This is for example, why in some Region-based detectors, we need to warp Regions of Interest before they go through CNN.

But what happens if we don't impose this restriction? The output of the last layer before the first FC layer will just be tensor of different (larger) dimensions.

If we then replace FC layers with convolution layers so for some NxN input we get 1x1 output (1x1xC actually where C is the name of classes), we'll get fully convolutional NN which can accept input image of any size and which will consequently output tensor of various dimensions. But it turns out that classification result (predictions) in the output layer spatially match the scaled "crop" of the input image. E.g. if input is (N+2)x(N+2) the output will be 2 x 2(x C) and in the upper left vector will contain C predictions for NxN upper left crop of the input image. Similarly, lower-right vector will contain C predictions for NxN lower right crop of the input image. So in one pass through CNN we essentially get predictions for Sliding Window crops! Sliding Window stride is defined here by the size of the kernel in the 1st convolutional layer.

In other words, we can read the upper image in the following way:

If we design fully convolutional NN so for input of size N x N x 3 (3 for RGB channels) it outputs tensor of size 1 x 1 x C (C is number of classes) then if we pass to it input image of size (N+2) x (N+2) x 3 we'll have at the output a tensor of size 2x2xC. Each element of that 2 x 2 output is a vector of C values - predictions of classes and each output spatially matches the N x N crop in the input image.

OverFeat is based on the idea to replace FC layers with Conv layers as this would then allow passing the image of any size into the ConvNet.

1st row shows ConvNet which outputs class prediction vector for an input image of certain size (14x14 in this case). The output is (1x1xC) a single vector containing C elements - class predictions (C is number of classes).

If we pass through this ConvNet an image of some other dimensions. we’ll get larger matrix at the output (2x2xC) which would contain class predictions for 4 positions of sliding window. And we’ve got these predictions in one go!