Tuesday, 7 January 2020

Object Localization (Classification with Localization)

Goal


Predict what is the main subject of the image and its location.

Input 

  • image
  • list of labels (categories/classes)

Output

  • prediction of the class of the main subject of the image
  • prediction of the position of that object in the image (its bounding box - a minimal rectangle that completely contains it)


Method


Traditional: feature detection (HOG, Haar-like, ...) + classification (SVM,...)

CNN: Feature extraction + classification and bounding box prediction (regression).
           Model learns  both class and location.
 

Architecture


Typical architecture:

CNN where feature vector is fully connected to softmax layer (classifier - outputs class probabilities) and to 4-node layer (regressor - outputs bounding box coordinates and dimensions):

  • input layer
  • DNN (e.g. AlexNet)
  • feature vector (the output of convolution part of the network which summarizes the content of the image); 4096 nodes
  • fully connected layer that outputs class scores; connects 4096 feature vector nodes with e.g. 1000 nodes for each class; classification problem.
  • another fully connected layer that outputs bounding box coordinates: connects 4096 nodes of feature vector layer with 4 nodes (height, width and coordinates of the center) in the Box Coordinates layer; treats localization as regression problem
    Fully supervised setting: for each image we have annotated ground truth (correct) label and box coordinates.

    Loss Function


    During training (backpropagation) phase, if assuming fully supervised setting, we have two losses:
    • one for the predicted category, the one which describes difference between correct label and predicted class scores: Softmax Loss (this is actually a cross-entropy loss, which is standard loss function for Softmax layer [(28) Is the softmax loss the same as the cross-entropy loss? - Quora])
    • another one for the predicted box coordinates; L2 (Least Square Errors)  Loss - gives a measure of dissimilarity between predicted and ground truth bounding box [What Are L1 and L2 Loss Functions?]
    • total loss function is multi-task loss: weighted sum of these two losses

      Human Pose Estimation


      This idea of predicting the fixed number of positions in the image is also applied to Human Pose Estimation:

      • input: person in the image
      • output: position/coordinates of the joints (e.g. 14 joints: left/right foot, knee, hip, shoulder, elbow, hand; neck, head top)

      References:


      Stanford University School of Engineering: Fei-Fei Li, Justin Johnson, Serena Yeung: Convolutional Neural Networks for Visual Recognition: Lecture 11 | Detection and Segmentation. Link:
      Lecture 11 | Detection and Segmentation - YouTube

      No comments: