Input
- image (pixels)
- list of categories
Goal
Each pixel in the image to be classified (to be assigned a category label).
Don't differentiate instances (objects), only care about pixels.
Output
- Every input pixel is assigned a category
- Pixels of each category are painted with the same color e.g. grass, cat, tree, sky
- If two instances of the same object are next to each other, entire area will have the same label and will be painted with same color
Method
Approach #1: Sliding Window
Approach #1 is to use sliding window where we are moving a small window across the image and apply DNN classification to determine the class of the crop which is then assigned to the central pixel of the crop.
This would be very computationally expensive as we'd need to classify (push crop through CNN) separate crop for each pixel in the image.
This would also be very inefficient for not reusing shared features between overlapping patches. If two patches overlap then the convolutional features of these patches will end up going through the same convolutional layers and we can actually share a lot of computation when applying this to separate passes or applying this type of approach to separate patches of the image.
Approach #2: CNN, layers keep spatial size
input image --> [conv] --> output image
All computations are done in one pass.
- input 3 x H x W
- convolutions D x H x W: conv --> conv --> conv --> conv-->
- scores: C x H x W (C is the number of categories/labels)
- argmax ==> Predictions H x W
Final convolutional layer outputs tensor C x H x W.
The size of the output image has to be the same as the input image as we want to have classification for each pixel, output image has to be pixel-perfect, with sharp and clear borders between segments.
Using convolutional layers which are keeping the same spatial size as the input image is super expensive and would take lots of memory for the huge number of parameters required (high resolution input image, input in each layer has multiple channels...).
Approach #3: CNN, downsampling + upsampling
Design network as a bunch of convolutional layers, with downlsampling and upsampling of the feature map inside the network:
input image --> [conv --> downsampling] --> [conv --> upsampling] --> output image
Downsampling:
- spatial information gets lost
- e.g. max pooling, strided convolution
Upsampling: max unpooling or strided transpose convolution.
Put classification loss at every pixel at the output, take an average through space and train it through normal back propagation.
Creating training set is expensive and long manual process. Each pixel has to be labelled. There are some tools for drawing contours and filling in the regions.
Loss function: cross-entropy loss is computed for each pixel in the output and ground truth pixels; then sum or average is taken over space or mini-batch.
Training
Put classification loss at every pixel at the output, take an average through space and train it through normal back propagation.
Data
Creating training set is expensive and long manual process. Each pixel has to be labelled. There are some tools for drawing contours and filling in the regions.
Loss Function
Loss function: cross-entropy loss is computed for each pixel in the output and ground truth pixels; then sum or average is taken over space or mini-batch.
Problem
Individual instances of the same category are not differentiated. This is improved with Instance Segmentation.
Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation - YouTube
References:
Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation - YouTube
No comments:
Post a Comment