Object Localization (Classification with Localization)

Tuesday, 7 January 2020

Object Localization (Classification with Localization)

Goal

Predict what is the main subject of the image and its location.

Input

image
list of labels (categories/classes)

Output

prediction of the class of the main subject of the image
prediction of the position of that object in the image (its bounding box - a minimal rectangle that completely contains it)

Method

Traditional: feature detection (HOG, Haar-like, ...) + classification (SVM,...)

CNN: Feature extraction + classification and bounding box prediction (regression).
Model learns both class and location.

Architecture

Typical architecture:

CNN where feature vector is fully connected to softmax layer (classifier - outputs class probabilities) and to 4-node layer (regressor - outputs bounding box coordinates and dimensions):

input layer
DNN (e.g. AlexNet)
feature vector (the output of convolution part of the network which summarizes the content of the image); 4096 nodes
fully connected layer that outputs class scores; connects 4096 feature vector nodes with e.g. 1000 nodes for each class; classification problem.
another fully connected layer that outputs bounding box coordinates: connects 4096 nodes of feature vector layer with 4 nodes (height, width and coordinates of the center) in the Box Coordinates layer; treats localization as regression problem

Feature (machine learning) - Wikipedia
In simple words what do you mean by feature vector in image processing ?
(28) What is a feature vector? - Quora

Training

Training Data

Fully supervised setting: for each image we have annotated ground truth (correct) label and box coordinates.

Loss Function

During training (backpropagation) phase, if assuming fully supervised setting, we have two losses:

one for the predicted category, the one which describes difference between correct label and predicted class scores: Softmax Loss (this is actually a cross-entropy loss, which is standard loss function for Softmax layer [(28) Is the softmax loss the same as the cross-entropy loss? - Quora])
another one for the predicted box coordinates; L2 (Least Square Errors) Loss - gives a measure of dissimilarity between predicted and ground truth bounding box [What Are L1 and L2 Loss Functions?]
total loss function is multi-task loss: weighted sum of these two losses

Human Pose Estimation

This idea of predicting the fixed number of positions in the image is also applied to Human Pose Estimation:

input: person in the image
output: position/coordinates of the joints (e.g. 14 joints: left/right foot, knee, hip, shoulder, elbow, hand; neck, head top)

References:

Stanford University School of Engineering: Fei-Fei Li, Justin Johnson, Serena Yeung: Convolutional Neural Networks for Visual Recognition: Lecture 11 | Detection and Segmentation. Link:
Lecture 11 | Detection and Segmentation - YouTube

1 comment:

Dave said...: Are you interested in trading bitcoin binary and forex trade where you can earn 100% of your investment daily If you invest as low as $200 you will get a profit of $2,000 after 72 hoursand he deals with any kind of hack if you are intrested you can contact him via email: hackintechnology@gmail.com +12132951376(WHATSAPP) no force but i am sure you would come back thanking me; 7 March 2021 at 20:36