Sunday 12 January 2020

Semantic Segmentation


  • image (pixels)
  • list of categories


Each pixel in the image to be classified (to be assigned a category label).
Don't differentiate instances (objects), only care about pixels.


  • Every input pixel is assigned a category
  • Pixels of each category are painted with the same color e.g. grass, cat, tree, sky
  • If two instances of the same object are next to each other, entire area will have the same label and will be painted with same color


Approach #1: Sliding Window 

Approach #1 is to use sliding window where we are moving a small window across the image and apply DNN classification to determine the class of the crop which is then assigned to the central pixel of the crop.

This would be very computationally expensive as we'd need to classify (push crop through CNN) separate crop for each pixel in the image.

This would also be very inefficient for not reusing shared features between overlapping patches. If two patches overlap then the convolutional features of these patches will end up going through the same convolutional layers and we can actually share a lot of computation when applying this to separate passes or applying this type of approach to separate patches of the image. 

Approach #2: CNN, layers keep spatial size

Using fully convolutional network where whole network is a giant stack of convolutional layers with no fully connected layers where each convolutional layer preserves the spatial size of the input:

input image --> [conv] --> output image
  • input 3 x H x W
  • convolutions D x H x W: conv --> conv --> conv --> conv--> 
  • scores: C x H x W (C is the number of categories/labels)
  • argmax ==> Predictions H x W
Final convolutional layer outputs tensor C x H x W.

The size of the output image has to be the same as the input image as we want to have classification for each pixel, output image has to be pixel-perfect, with sharp and clear borders between segments.

All computations are done in one pass. 

Using convolutional layers which are keeping the same spatial size as the input image is super expensive and would take lots of memory for the huge number of parameters required (high resolution input image, input in each layer has multiple channels...).

Approach #3: CNN, downsampling + upsampling

Design network as a bunch of convolutional layers, with downlsampling and upsampling of the feature map inside the network:

input image --> [conv --> downsampling] --> [conv --> upsampling] --> output image

  • spatial information gets lost
  • e.g. max pooling, strided convolution

Upsampling: max unpooling or strided transpose convolution.


Put classification loss at every pixel at the output, take an average through space and train it through normal back propagation.


Creating training set is expensive and long manual process. Each pixel has to be labelled. There are some tools for drawing contours and filling in the regions.

Loss Function

Loss function: cross-entropy loss is computed for each pixel in the output and ground truth pixels; then sum or average is taken over space or mini-batch.


Individual instances of the same category are not differentiated. This is improved with Instance Segmentation.


Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation - YouTube

No comments: