Object Detection with Sliding Window

Wednesday, 8 January 2020

Object Detection with Sliding Window

Objective

Take an input image and output:

predictions of bounding boxes (each box contains an object)
class scores for objects within bounding boxes

Solution

Turn this into a pure classification problem. Classification outputs only the class score for the entire image. So the idea here is that we'll take different crops from the input image, one by one and feed them through our previously trained convolutional network which does a classification decision on that input crop. Classifier is run at evenly spaced locations over the entire image.

In addition to object labels we'll have also a background as classification category. Now our network can predict background in case it doesn't see any of the categories that we care about.

Sliding Windows.
Original image of animals taken from 1zoom.me

So we have a rectangular "window" which slides across the input image and classifier outputs prediction only for this crop visible through that window. Window can take various sizes and aspect ratios and it can move in small or longer steps (strides) so for some crops classifier will output higher scores for some classes.

Network

Image --> [ Sliding Window cropping --> crop --> Classifier --> class scores ]

Process within angle brackets has to be repeated as many times as many crops we'll use.

Shortcoming

Because there could be any number of objects in this image, objects could appear at any location, at any size, at any aspect ratio in the image so if you want to do kind of a brute force sliding window approach you'd end up having to test many different crops.

And in the case where every one of those crops is going to be fed through a giant convolutional network, this would be completely computationally intractable. So in practice people don't ever do this sort of brute force sliding window approach for object detection using convolutional networks.

There are two main approaches which try to improve on Sliding Window.

One family of detectors is trying to reduce number of crops by proposing Regions of Interest (Region-proposal detectors). They still perform classification sequentially on each RoI.

Another approach is using a single pass of the image through CNN (Single-shot detectors). OverFeat is an example of such detector.