My Public Notepad: Object Detection

Showing posts with label Object Detection. Show all posts

Thursday, 9 January 2020

Object Detection with OverFeat

Objective

Improve Sliding Window approach not by reducing number of crops (like in Region-proposal based solutions) but by passing image through CNN only once.

Output

Class predictions for each crop (position of the sliding window).

Method

Based on convolutional implementation of fully connected layers => NN = CNN + softmax layer.

Input image goes only once through single forward-propagation NN => computationally efficient.
This is why this method belongs to groups of detectors called Single-shot detectors.

Class probabilities for all locations are predicted at once.

Idea published by Pierre Sermanet et al.: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks" (2013)

ImageFeat single propagation through CNN

Image source: Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks"

CNN (like AlexNet) typically has the following structure (conv-pooling pair is usually repeated several times):

[conv --> pooling] --> FC --> FC --> Softmax

FC layers require fixed-sized input which, in turn, when back-propagated, adds requirement to first convolutional layer (1st layer in the network) to also accept input of the certain (fixed) size. This is for example, why in some Region-based detectors, we need to warp Regions of Interest before they go through CNN.

But what happens if we don't impose this restriction? The output of the last layer before the first FC layer will just be tensor of different (larger) dimensions.

If we then replace FC layers with convolution layers so for some NxN input we get 1x1 output (1x1xC actually where C is the name of classes), we'll get fully convolutional NN which can accept input image of any size and which will consequently output tensor of various dimensions. But it turns out that classification result (predictions) in the output layer spatially match the scaled "crop" of the input image. E.g. if input is (N+2)x(N+2) the output will be 2 x 2(x C) and in the upper left vector will contain C predictions for NxN upper left crop of the input image. Similarly, lower-right vector will contain C predictions for NxN lower right crop of the input image. So in one pass through CNN we essentially get predictions for Sliding Window crops! Sliding Window stride is defined here by the size of the kernel in the 1st convolutional layer.

In other words, we can read the upper image in the following way:

If we design fully convolutional NN so for input of size N x N x 3 (3 for RGB channels) it outputs tensor of size 1 x 1 x C (C is number of classes) then if we pass to it input image of size (N+2) x (N+2) x 3 we'll have at the output a tensor of size 2x2xC. Each element of that 2 x 2 output is a vector of C values - predictions of classes and each output spatially matches the N x N crop in the input image.

OverFeat is based on the idea to replace FC layers with Conv layers as this would then allow passing the image of any size into the ConvNet.

1st row shows ConvNet which outputs class prediction vector for an input image of certain size (14x14 in this case). The output is (1x1xC) a single vector containing C elements - class predictions (C is number of classes).

If we pass through this ConvNet an image of some other dimensions. we’ll get larger matrix at the output (2x2xC) which would contain class predictions for 4 positions of sliding window. And we’ve got these predictions in one go!

Problem

Bounding boxes match crops, they are not learned (not part of the output prediction) => low accuracy.

References:

[1] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks" (2013)

Object Localization in Overfeat - Towards Data Science

Convolutional Implementation of Sliding Windows - Object detection | Coursera

overfeat_eric.pdf

Fully Connected Layers in Convolutional Neural Networks: The Complete Guide - MissingLink.ai

tensorflow - Is it possible to give variable sized images as input to a convolutional neural network? - Cross Validated

Wednesday, 8 January 2020

Object Detection

Objective

Classify and locate multiple objects in the image. This task is different from Classification + Localization as number of outputs can vary.

Input

image
fixed set of labels (categories/classes)

Output

Bounding boxes around each object
for each bounding box a confidence score which describes how confident is the model to say that it contains an object of the certain class

We don't know how many objects will image contain ahead of time.

Object Detection.
(SSD method used)

Method

If there is one object, our system will output 5 numbers (class prediction + 4 for bounding box coordinates). If there are N objects, our system will output N * 5 numbers. For this reason it's very tricky to think of Object Detection as a regression paradigm.

One of the early attempts to solve the problem of Object Detection was Haar Cascades proposed by Viola and Jones in 2001. But the great quality of results came only after deep learning was introduced.

Performance of Object Detection systems (measured in mAP - mean Average Precision) was increasing but started stagnating (around 40%) up to 2012 after which deep CNN started being used and mAP jumped to over 50% and started increasing to over 90% nowadays.

Non-maximum suppression is a post-processing step which discards all bounding boxes for which the confidence score is below a pre-set threshold

Object detection models can be grouped in the following way:

Traditional:

3 stages:

Informative Region Selection (generation of candidate bounding boxes): sliding window
Feature extraction: SIFT (Scale-Invariant Feature Transform), HOG, Haar-like
Classification: SVM, AdaBoost, Deformable Part-based Model (DPM)

Examples:

Haar cascade classifier
Histogram of Oriented Gradient (HOG) features

Problems:

bounding boxes generated by sliding window are inefficient, redundant and inaccurate
manually engineered low-level feature descriptors
neither features nor bounding boxes are learned

Deep Learning (DNN)-based:

emerged with DNN/CNN (in 2012)
2 types:

Two-stage (multi-shot) object detectors

2 phases:

propose regions
for each region sequentially perform classification (find class probabilities) and regression (bounding box coordinates)

Sliding Window, Region-proposal based (R-CNN, Fast R-CNN, Faster R-CNN), SPP-Net

One-stage (single-shot) object detectors

require only a single pass through the neural network and predict all the bounding boxes in one go
much faster and much more suitable for embedded/mobile devices
Examples: