Sunday 12 January 2020

Semantic Segmentation

Input

  • image (pixels)
  • list of categories

Goal


Each pixel in the image to be classified (to be assigned a category label).
Don't differentiate instances (objects), only care about pixels.





Output

  • Every input pixel is assigned a category
  • Pixels of each category are painted with the same color e.g. grass, cat, tree, sky
  • If two instances of the same object are next to each other, entire area will have the same label and will be painted with same color

Method


Approach #1: Sliding Window 


Approach #1 is to use sliding window where we are moving a small window across the image and apply DNN classification to determine the class of the crop which is then assigned to the central pixel of the crop.

This would be very computationally expensive as we'd need to classify (push crop through CNN) separate crop for each pixel in the image.

This would also be very inefficient for not reusing shared features between overlapping patches. If two patches overlap then the convolutional features of these patches will end up going through the same convolutional layers and we can actually share a lot of computation when applying this to separate passes or applying this type of approach to separate patches of the image. 


Approach #2: CNN, layers keep spatial size


Using fully convolutional network where whole network is a giant stack of convolutional layers with no fully connected layers where each convolutional layer preserves the spatial size of the input:

input image --> [conv] --> output image
  • input 3 x H x W
  • convolutions D x H x W: conv --> conv --> conv --> conv--> 
  • scores: C x H x W (C is the number of categories/labels)
  • argmax ==> Predictions H x W
Final convolutional layer outputs tensor C x H x W.

The size of the output image has to be the same as the input image as we want to have classification for each pixel, output image has to be pixel-perfect, with sharp and clear borders between segments.

All computations are done in one pass. 

Using convolutional layers which are keeping the same spatial size as the input image is super expensive and would take lots of memory for the huge number of parameters required (high resolution input image, input in each layer has multiple channels...).

Approach #3: CNN, downsampling + upsampling



Design network as a bunch of convolutional layers, with downlsampling and upsampling of the feature map inside the network:

input image --> [conv --> downsampling] --> [conv --> upsampling] --> output image


Downsampling
  • spatial information gets lost
  • e.g. max pooling, strided convolution

Upsampling: max unpooling or strided transpose convolution.

Training


Put classification loss at every pixel at the output, take an average through space and train it through normal back propagation.

Data


Creating training set is expensive and long manual process. Each pixel has to be labelled. There are some tools for drawing contours and filling in the regions.

Loss Function


Loss function: cross-entropy loss is computed for each pixel in the output and ground truth pixels; then sum or average is taken over space or mini-batch.


Problem


Individual instances of the same category are not differentiated. This is improved with Instance Segmentation.

References:


Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation - YouTube

No comments: