Thursday, 9 January 2020

Object Detection with OverFeat


Improve Sliding Window approach not by reducing number of crops (like in Region-proposal based solutions) but by passing image through CNN only once.


Class predictions for each crop (position of the sliding window).


Based on convolutional implementation of fully connected layers => NN = CNN + softmax layer.

Input image goes only once through single forward-propagation NN => computationally efficient.
This is why this method belongs to groups of detectors called Single-shot detectors.

Class probabilities for all locations are predicted at once.

Idea published by Pierre Sermanet et al.: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks" (2013)

ImageFeat single propagation through CNN

Image source: Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks"

CNN (like AlexNet) typically has the following structure (conv-pooling pair is usually repeated several times):

[conv --> pooling] --> FC --> FC --> Softmax

FC layers require fixed-sized input which, in turn, when back-propagated, adds requirement to first convolutional layer (1st layer in the network) to also accept input of the certain (fixed) size. This is for example, why in some Region-based detectors, we need to warp Regions of Interest before they go through CNN.

But what happens if we don't impose this restriction? The output of the last layer before the first FC layer will just be tensor of different (larger) dimensions. 

If we then replace FC layers with convolution layers so for some NxN input we get 1x1 output (1x1xC actually where C is the name of classes), we'll get fully convolutional NN which can accept input image of any size and which will consequently output tensor of various dimensions. But it turns out that classification result (predictions) in the output layer spatially match the scaled "crop" of the input image. E.g. if input is (N+2)x(N+2) the output will be 2 x 2(x C) and in the upper left vector will contain C predictions for NxN upper left crop of the input image. Similarly, lower-right vector will contain C predictions for NxN lower right crop of the input image. So in one pass through CNN we essentially get predictions for Sliding Window crops! Sliding Window stride is defined here by the size of the kernel in the 1st convolutional layer.

In other words, we can read the upper image in the following way:

If we design fully convolutional NN so for input of size N x N x 3 (3 for RGB channels) it outputs tensor of size 1 x 1 x C (C is number of classes) then if we pass to it input image of size (N+2) x (N+2) x 3 we'll have at the output a tensor of size 2x2xC. Each element of that 2 x 2 output is a vector of C values - predictions of classes and each output spatially matches the N x N crop in the input image.

OverFeat is based on the idea to replace FC layers with Conv layers as this would then allow passing the image of any size into the ConvNet.

1st row shows ConvNet which outputs class prediction vector for an input image of certain size (14x14 in this case). The output is (1x1xC) a single vector containing C elements - class predictions (C is number of classes).

If we pass through this ConvNet an image of some other dimensions. we’ll get larger matrix at the output (2x2xC) which would contain class predictions for 4 positions of sliding window. And we’ve got these predictions in one go!


Bounding boxes match crops, they are not learned (not part of the output prediction) => low accuracy.


[1] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks" (2013)

Object Localization in Overfeat - Towards Data Science

Convolutional Implementation of Sliding Windows - Object detection | Coursera


Fully Connected Layers in Convolutional Neural Networks: The Complete Guide -

tensorflow - Is it possible to give variable sized images as input to a convolutional neural network? - Cross Validated


Danny said...

There are many things needed for a student to get to a destination. Students have to face many exams to overcome their phases. You can check mat practice test online to get idea about the final question patterns. You can also find several other exam questions like SAT, TOEFL, HESI, etc.

micheal pan said...

BE SMART AND BECOME RICH IN LESS THAN 3DAYS....It all depends on how fast 
you can be to get the new PROGRAMMED blank ATM card that is capable of
hacking into any ATM machine,anywhere in the world. I got to know about 
this BLANK ATM CARD when I was searching for job online about a month 
ago..It has really changed my life for good and now I can say I'm rich and 
I can never be poor again. The least money I get in a day with it is about 
$50,000.(fifty thousand USD) Every now and then I keeping pumping money 
into my account. Though is illegal,there is no risk of being caught 
,because it has been programmed in such a way that it is not traceable,it 
also has a technique that makes it impossible for the CCTVs to detect 
you..For details on how to get yours today, email the hackers on : ( ). Tell your 
loved once too, and start to live large. That's the simple testimony of how 
my life changed for good...Love you all ...the email address again is ;