Object Detection with SSD

Saturday, 11 January 2020

Object Detection with SSD

SSD (Single Shot Multibox Detector) is a method for object detection (object localization and classification) which uses a single Deep Neural Network (DNN). Single Shot means that object detection is performed in a single forward pass of the DNN.

This method was proposed by Wei Liu et al. in December 2015 and revised last time in December 2016: SSD: Single Shot MultiBox Detector.

Objective

Fast Object Detection.

Method

The SSD network, built on the VGG-16 network, performs the task of object detection and localization in a single forward pass of the network. This approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple features with different resolutions to naturally handle objects of various sizes. [source]

Here are some key points from the paper's abstraction:

SSD uses single deep neural network
SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location

BK: note here that different aspect ratios and scales are not applied to anchor boxes in the image but feature map

At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.
Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.
Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For 300×300 input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for 500×500 input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model.

SSD Framework

Image source: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg: SSD: Single Shot MultiBox Detector

We know that deeper Conv layers in CNNs extract/learn more complex features.
Feature maps preserve spatial structure of the input image but at lower resolution.

Lecture 11: Detection and Localization

If we take some CNN (like ResNet) pretrained for image recognition (image classification) and remove its last FC layers we'll get as its output a feature map as described above.

Now we can do something which YOLO does on the image - divide feature map into a grid cells and apply equidistant detector which predicts anchor boxes.

--------------
Given our input image (3 * H * W) you imagine dividing that input image into some coarse S * S grid, and now within each of those grid cells you imagine some set of B base bounding boxes (e.g. B = 3 base bounding boxes like a tall one, a wide one, and a square one but in practice you would use more than three). These bounding boxes are centered at each grid cell.

Now for each of these grid cells (S x S) network has to predict two things:

for each of these base bounding boxes (B): an offset off the base bounding box to predict what is the true location of the object off this base bounding box.

This prediction has two components:

bounding box coordinates: dx, dy, dh , dw
confidence

So the final output has B * 5 values

classification scores for each of C classes (including background as a class)

At the end we end up predicting from our input image this giant tensor:

S * S * (B * 5 + C)

So that's just where we have B base bounding boxes, we have five numbers for each giving our offset and our confidence for that base bounding box and C classification scores for our C categories.

So then we kind of see object detection as this input of an image, output of this three dimensional tensor and you can imagine just training this whole thing with a giant convolutional network.

And that's kind of what these single shot methods do where they just, and again matching the ground truth objects into these potential base boxes becomes a little bit hairy but that's what these methods do.
--------------

Architecture

SSD has two components:

base (backbone) model
SSD head

Backbone model:

usually a pre-trained image classification network as a feature extractor from which the final fully connected classification layer has been removed; such NN is able to extract semantic meaning from the input image while preserving the spatial structure of the image albeit at a lower resolution
VGG-16 or ResNet trained on ImageNet

SSD head:

one or more convolutional layers added to the backbone
outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the final layers activations

SSD vs YOLO Network Architecture
Image source: Wei Liu et al.: "SSD: Single Shot MultiBox Detector"

Examples

Tensorflow Object Detection API comes with pretrained models where ssd_inception_v2_coco_2017_11_17 is one of them.

TensorRT/samples/opensource/sampleUffSSD at master · NVIDIA/TensorRT · GitHub

TensorFlow implementation of SSD, which actually differs from the original paper, in that it has an inception_v2 backbone. For more information about the actual model, download ssd_inception_v2_coco. The TensorFlow SSD network was trained on the InceptionV2 architecture using the MSCOCO dataset which has 91 classes (including the background class). The config details of the network can be found here.

Logo detection in Images using SSD - Towards Data Science
TensorFlow Object Detection API with Single Shot MultiBox Detector (SSD) - YouTube

ssd_mobilenet_v1_coco_2017_11_17