Goal
Predict what is the main subject of the image.
This is maybe the most basic possible task in computer vision.
Input
- image
- predefined set of labels/categories
Method
Process usually has two stages:
- feature extraction
- classification
Implementation #1: HOG + SVM
Use Histogram of Oriented Gradients (HOG) for feature extraction. The final HOG feature vector can be fed into a classifier like SVM.
Implementation #2: CNN
That input image will go through some deep convolutional network, that network will give us some feature vector of maybe 4096 dimensions in the case of AlexNet RGB and then from that final feature vector we'll have some final fully-connected layer that gives us 1000 numbers for the different class scores that we care about where 1000 is maybe the number of classes in ImageNet in this example.
E.g. if the model is trained on three labels (cat, dog, other), classifier will output confidence for each class - three numbers like 0.6 (dog), 0.3 (cat), 0.1 (other); their sum must be the probability of 1.
So the network takes input image and outputs a single category label saying what is the content of this entire image as a whole.
Feature Extraction - Historical Overview
Computer sees a W * H grid of pixels, each represented with 3 RGB channel values. It sees W * H * 3 numbers between 0 and 255.
The problem: Semantic Gap - gap between the matrix of values and the notion of the object (e.g. "cat") that algorithm has to bridge.
Challenges:
- viewpoint variation - the same object looks different if observed from different angles
- illumination - the source of light might be in front of, from the side or behind the object; scene can be bright or dark
- deformation - object can take many different shapes and deformations
- occlusion - only part of the object might be visible
- background clutter - object blends to the background
- intraclass variation - object comes in various shapes, sizes, ages...
Algorithm has to be robust.
function classify(image: Image): Label {
// hardcoded, handcrafted rules
// only experts in this particular domain can write these rules (“expert system”)
// designer of the system defines features that he believes are differentiating
// handle all natural variations in look, shape, rotation, scaling, different viewpoints, occlusion, illumination…
// lots of if..else, switch
// this algorithm can’t be reused - it applies only to this particular domain/example
return label;
}
- we have to think of all possible valid cases and exceptions
- it’s very difficult to list all the features that are reliable for detecting the objects we care about; they are not robust to natural variations in the presence of that objects (e.g. lighting, changes in the perspective; transition, scaling, rotation, weight)
- very brittle systems which would fail on unexpected examples because of these variations
- the logic applies to a specific task, at the specific point in time in a single domain
- very often it is not possible to design rules by hand
Attempts have been made: found edges, corners, boundaries...and write explicit set of rules to recognize one particular object.
This is not scalable approach.
Humans learn from experience by observing huge amount of examples. This data-driven approach can be applied to computers:
- collect dataset of images and labels
- use machine learning to train a classifier
- evaluate the classifier on new images
Learning from experience can be narrowed down into two steps: training a model and inference (prediction, test):
function train(images: Image[], labels: Label[]): model: Model {
// Machine Learning:
// define a model
// forward and backward propagation…
// minimizing cost function…
return model;
}
function predict(model: Model, image: Image): label: Label {
// Use trained model to predict output label:
// forward propagation
return label;
}
- Big Data - access to millions of digital images (mostly crated by social media users)
- Fast CPUs and GPUs
Training
Training Data
Labeled images. Labels are either strings or numbers that correspond to strings.
If we want to train system to recognize only one particular class (so the output is 1 - image contains that object or 0 - image does not contain that object) we'd use 2 types of training images:
- closely cropped images of chosen object (nothing else is in the image apart from that object; object takes space of the whole image), labeled with 1 (object present)
- images which don’t contain that object (“background”), labeled with 0 (object not present)
Loss Function
For CNNs is cross-entropy loss (loss function of Softmax layer).
neural networks - What is the loss function used for CNN? - Cross Validated
Problems
Image recognition only gives a summary of what is in the image as a whole, it doesn’t work so well when the image has multiple objects of interest.
References:
Stanford University School of Engineering: Fei-Fei Li, Justin Johnson, Serena Yeung: Convolutional Neural Networks for Visual Recognition: Lecture 11 | Detection and Segmentation. Link:
Lecture 11 | Detection and Segmentation - YouTube
Object Detection for Dummies Part 1: Gradient Vector, HOG, and SS
Outline of object recognition - Wikipedia
What Is Image Recognition? - Towards Data Science
No comments:
Post a Comment