Monday 6 January 2020

Image Recognition (Object Classification)


Predict what is the main subject of the image.
This is maybe the most basic possible task in computer vision.


  • image
  • predefined set of labels/categories


Category of the main object in the image (or probability distribution over the classes).

Image recognition


Process usually has two stages:
  • feature extraction
  • classification

Implementation #1: HOG + SVM

Use Histogram of Oriented Gradients (HOG) for feature extraction. The final HOG feature vector can be fed into a classifier like SVM.

Implementation #2: CNN

That input image will go through some deep convolutional network, that network will give us some feature vector of maybe 4096 dimensions in the case of AlexNet RGB and then from that final feature vector we'll have some final fully-connected layer that gives us 1000 numbers for the different class scores that we care about where 1000 is maybe the number of classes in ImageNet in this example.

E.g. if the model is trained on three labels (cat, dog, other), classifier will output confidence for each class - three numbers like 0.6 (dog), 0.3 (cat), 0.1 (other); their sum must be the probability of 1.

So the network takes input image and outputs a single category label saying what is the content of this entire image as a whole.

Feature Extraction - Historical Overview 

Computer sees a W * H grid of pixels, each represented with 3 RGB channel values. It sees W * H * 3 numbers between 0 and 255.

The problem: Semantic Gap - gap between the matrix of values and the notion of the object (e.g. "cat") that algorithm has to bridge.

  • viewpoint variation - the same object looks different if observed from different angles
  • illumination - the source of light might be in front of, from the side or behind the object; scene can be bright or dark
  • deformation - object can take many different shapes and deformations
  • occlusion - only part of the object might be visible
  • background clutter - object blends to the background
  • intraclass variation - object comes in various shapes, sizes, ages...
Algorithm has to be robust. 

    Early expert systems used hand-coded rules of if..else statements to process data and make decisions.

      Traditional machine perception - hand-tuned features: raw data is passed to feature extraction system which extracts features that are believed by the designer of the system to be important for discriminating different types of objects in the image; feature vector is then passed to a linear classifier which attempts to classify the object.

      function classify(image: Image): Label {

         // hardcoded, handcrafted rules

         // only experts in this particular domain can write these rules (“expert system”)

         // designer of the system defines features that he believes are differentiating
         // handle all natural variations in look, shape, rotation, scaling, different viewpoints, occlusion,   illumination…
         // lots of if..else, switch
         // this algorithm can’t be reused - it applies only to this particular domain/example
         return label;

        Problems with this approach:
        • we have to think of all possible valid cases and exceptions
        • it’s very difficult to list all the features that are reliable for detecting the objects we care about; they are not robust to natural variations in the presence of that objects (e.g. lighting, changes in the perspective; transition, scaling, rotation, weight)
        • very brittle systems which would fail on unexpected examples because of these variations 
        • the logic applies to a specific task, at the specific point in time in a single domain
        • very often it is not possible to design rules by hand

        Handcrafting, using geometry, hardcoding, scaling models...doesn’t work in computer vision as images are not perfect, objects are distorted, blended in the background.

        Expert systems with rules defined by humans have failed.

        Attempts have been made: found edges, corners, boundaries...and write explicit set of rules to recognize one particular object. 

        This is not scalable approach. 

        Humans learn from experience by observing huge amount of examples. This data-driven approach can be applied to computers: 
        • collect dataset of images and labels
        • use machine learning to train a classifier
        • evaluate the classifier on new images
        Learning from experience can be narrowed down into two steps: training a model and inference (prediction, test):

        function train(images: Image[], labels: Label[]): model: Model {

           // Machine Learning:

           //    define a model

           //    forward and backward propagation…
           //    minimizing cost function…
           return model;

        function predict(model: Model, image: Image): label: Label {
           // Use trained model to predict output label:
           //    forward propagation
           return label;

        Two technological advancements were necessary for this approach to become possible:
        • Big Data - access to millions of digital images (mostly crated by social media users)
        • Fast CPUs and GPUs


        Training Data 

        Labeled images. Labels are either strings or numbers that correspond to strings.

        If we want to train system to recognize only one particular class (so the output is 1 - image contains that object or 0 - image does not contain that object) we'd use 2 types of training images:

        • closely cropped images of chosen object (nothing else is in the image apart from that object; object takes space of the whole image), labeled with 1 (object present)
        • images which don’t contain that object (“background”), labeled  with 0 (object not present)

        Loss Function

        For CNNs is cross-entropy loss (loss function of Softmax layer).

        neural networks - What is the loss function used for CNN? - Cross Validated


        Image recognition only gives a summary of what is in the image as a whole, it doesn’t work so well when the image has multiple objects of interest.


        Stanford University School of Engineering: Fei-Fei Li, Justin Johnson, Serena Yeung: Convolutional Neural Networks for Visual Recognition: Lecture 11 | Detection and Segmentation. Link:
        Lecture 11 | Detection and Segmentation - YouTube

        Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS

        Outline of object recognition - Wikipedia

        What Is Image Recognition? - Towards Data Science

        No comments: