Wednesday 8 January 2020

Object Detection


Classify and locate multiple objects in the image. This task is different from Classification + Localization as number of outputs can vary.


  • image
  • fixed set of labels (categories/classes)


  • Bounding boxes around each object 
  • for each bounding box a confidence score which describes how confident is the model to say that it contains an object of the certain class 

We don't know how many objects will image contain ahead of time.

Object Detection.
(SSD method used)


If there is one object, our system will output 5 numbers (class prediction + 4 for bounding box coordinates). If there are N objects, our system will output N * 5 numbers. For this reason it's very tricky to think of Object Detection as a regression paradigm.

One of the early attempts to solve the problem of Object Detection was Haar Cascades proposed by Viola and Jones in 2001. But the great quality of results came only after deep learning was introduced.

Performance of Object Detection systems (measured in mAP - mean Average Precision) was increasing but started stagnating (around 40%) up to 2012 after which deep CNN started being used and mAP jumped to over 50% and started increasing to over 90% nowadays.

Non-maximum suppression is a post-processing step which discards all bounding boxes for which the confidence score is below a pre-set threshold

Object detection models can be grouped in the following way:

  • Traditional:
    • 3 stages:
      • Informative Region Selection (generation of candidate bounding boxes): sliding window
      • Feature extraction: SIFT (Scale-Invariant Feature Transform), HOG, Haar-like
      • Classification: SVM, AdaBoost, Deformable Part-based Model (DPM)
    • Examples:
      • Haar cascade classifier
      • Histogram of Oriented Gradient (HOG) features
    • Problems:
      • bounding boxes generated by sliding window are inefficient, redundant and inaccurate
      • manually engineered low-level feature descriptors
      • neither features nor bounding boxes are learned
  • Deep Learning (DNN)-based:
    • emerged with DNN/CNN (in 2012)
    • 2 types:
      • Two-stage (multi-shot) object detectors
        • 2 phases:
          • propose regions
          • for each region sequentially perform classification (find class probabilities) and regression (bounding box coordinates)
        • Sliding Window, Region-proposal based (R-CNN, Fast R-CNN, Faster R-CNN), SPP-Net
      • One-stage (single-shot) object detectors


    Stanford University School of Engineering: Fei-Fei Li, Justin Johnson, Serena Yeung: Convolutional Neural Networks for Visual Recognition: Lecture 11 | Detection and Segmentation. Link:
    Lecture 11 | Detection and Segmentation - YouTube


    Computer Vision: Crash Course Computer Science #35 - YouTube

    Paul Viola, Michael Jones: "Rapid Object Detection using a Boosted Cascade of Simple
    Features" (2001)

    Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, Xindong Wu: Object Detection with Deep Learning: A Review (Apr 2019)

    No comments: