Image Recognition (Object Classification)

Monday, 6 January 2020

Image Recognition (Object Classification)

Goal

Predict what is the main subject of the image.
This is maybe the most basic possible task in computer vision.

Input

image
predefined set of labels/categories

Output

Category of the main object in the image (or probability distribution over the classes).

Method

Process usually has two stages:

feature extraction
classification

Implementation #1: HOG + SVM

Use Histogram of Oriented Gradients (HOG) for feature extraction. The final HOG feature vector can be fed into a classifier like SVM.

Implementation #2: CNN

That input image will go through some deep convolutional network, that network will give us some feature vector of maybe 4096 dimensions in the case of AlexNet RGB and then from that final feature vector we'll have some final fully-connected layer that gives us 1000 numbers for the different class scores that we care about where 1000 is maybe the number of classes in ImageNet in this example.

E.g. if the model is trained on three labels (cat, dog, other), classifier will output confidence for each class - three numbers like 0.6 (dog), 0.3 (cat), 0.1 (other); their sum must be the probability of 1.

So the network takes input image and outputs a single category label saying what is the content of this entire image as a whole.

Feature Extraction - Historical Overview

Computer sees a W * H grid of pixels, each represented with 3 RGB channel values. It sees W * H * 3 numbers between 0 and 255.

The problem: Semantic Gap - gap between the matrix of values and the notion of the object (e.g. "cat") that algorithm has to bridge.

Challenges:

viewpoint variation - the same object looks different if observed from different angles
illumination - the source of light might be in front of, from the side or behind the object; scene can be bright or dark
deformation - object can take many different shapes and deformations
occlusion - only part of the object might be visible
background clutter - object blends to the background
intraclass variation - object comes in various shapes, sizes, ages...

Algorithm has to be robust.

Early expert systems used hand-coded rules of if..else statements to process data and make decisions.

Traditional machine perception - hand-tuned features: raw data is passed to feature extraction system which extracts features that are believed by the designer of the system to be important for discriminating different types of objects in the image; feature vector is then passed to a linear classifier which attempts to classify the object.

function classify(image: Image): Label {

// hardcoded, handcrafted rules

// only experts in this particular domain can write these rules (“expert system”)

// designer of the system defines features that he believes are differentiating

// handle all natural variations in look, shape, rotation, scaling, different viewpoints, occlusion, illumination…

// lots of if..else, switch

// this algorithm can’t be reused - it applies only to this particular domain/example

return label;

}

Problems with this approach:

we have to think of all possible valid cases and exceptions
it’s very difficult to list all the features that are reliable for detecting the objects we care about; they are not robust to natural variations in the presence of that objects (e.g. lighting, changes in the perspective; transition, scaling, rotation, weight)
very brittle systems which would fail on unexpected examples because of these variations
the logic applies to a specific task, at the specific point in time in a single domain
very often it is not possible to design rules by hand

Handcrafting, using geometry, hardcoding, scaling models...doesn’t work in computer vision as images are not perfect, objects are distorted, blended in the background.

Expert systems with rules defined by humans have failed.

Attempts have been made: found edges, corners, boundaries...and write explicit set of rules to recognize one particular object.

This is not scalable approach.

Humans learn from experience by observing huge amount of examples. This data-driven approach can be applied to computers:

collect dataset of images and labels
use machine learning to train a classifier
evaluate the classifier on new images

Learning from experience can be narrowed down into two steps: training a model and inference (prediction, test):

function train(images: Image[], labels: Label[]): model: Model {

// Machine Learning:

// define a model

// forward and backward propagation…

// minimizing cost function…

return model;

}

function predict(model: Model, image: Image): label: Label {

// Use trained model to predict output label:

// forward propagation

return label;

}

Two technological advancements were necessary for this approach to become possible:

Big Data - access to millions of digital images (mostly crated by social media users)
Fast CPUs and GPUs

Training

Training Data

Labeled images. Labels are either strings or numbers that correspond to strings.

If we want to train system to recognize only one particular class (so the output is 1 - image contains that object or 0 - image does not contain that object) we'd use 2 types of training images:

closely cropped images of chosen object (nothing else is in the image apart from that object; object takes space of the whole image), labeled with 1 (object present)
images which don’t contain that object (“background”), labeled with 0 (object not present)

Loss Function

For CNNs is cross-entropy loss (loss function of Softmax layer).

neural networks - What is the loss function used for CNN? - Cross Validated

Problems

Image recognition only gives a summary of what is in the image as a whole, it doesn’t work so well when the image has multiple objects of interest.

References:

Stanford University School of Engineering: Fei-Fei Li, Justin Johnson, Serena Yeung: Convolutional Neural Networks for Visual Recognition: Lecture 11 | Detection and Segmentation. Link:
Lecture 11 | Detection and Segmentation - YouTube

Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS

Outline of object recognition - Wikipedia

What Is Image Recognition? - Towards Data Science

My Public Notepad

Pages

Monday, 6 January 2020