Object Detection

Wednesday, 8 January 2020

Object Detection

Objective

Classify and locate multiple objects in the image. This task is different from Classification + Localization as number of outputs can vary.

Input

image
fixed set of labels (categories/classes)

Output

Bounding boxes around each object
for each bounding box a confidence score which describes how confident is the model to say that it contains an object of the certain class

We don't know how many objects will image contain ahead of time.

Object Detection.
(SSD method used)

Method

If there is one object, our system will output 5 numbers (class prediction + 4 for bounding box coordinates). If there are N objects, our system will output N * 5 numbers. For this reason it's very tricky to think of Object Detection as a regression paradigm.

One of the early attempts to solve the problem of Object Detection was Haar Cascades proposed by Viola and Jones in 2001. But the great quality of results came only after deep learning was introduced.

Performance of Object Detection systems (measured in mAP - mean Average Precision) was increasing but started stagnating (around 40%) up to 2012 after which deep CNN started being used and mAP jumped to over 50% and started increasing to over 90% nowadays.

Non-maximum suppression is a post-processing step which discards all bounding boxes for which the confidence score is below a pre-set threshold

Object detection models can be grouped in the following way:

Traditional:

3 stages:

Informative Region Selection (generation of candidate bounding boxes): sliding window
Feature extraction: SIFT (Scale-Invariant Feature Transform), HOG, Haar-like
Classification: SVM, AdaBoost, Deformable Part-based Model (DPM)

Examples:

Haar cascade classifier
Histogram of Oriented Gradient (HOG) features

Problems:

bounding boxes generated by sliding window are inefficient, redundant and inaccurate
manually engineered low-level feature descriptors
neither features nor bounding boxes are learned

Deep Learning (DNN)-based:

emerged with DNN/CNN (in 2012)
2 types:

Two-stage (multi-shot) object detectors

2 phases:

propose regions
for each region sequentially perform classification (find class probabilities) and regression (bounding box coordinates)

Sliding Window, Region-proposal based (R-CNN, Fast R-CNN, Faster R-CNN), SPP-Net

One-stage (single-shot) object detectors

require only a single pass through the neural network and predict all the bounding boxes in one go
much faster and much more suitable for embedded/mobile devices
Examples: