Friday, 31 January 2020

How to disable IPv6 on Ubuntu

Some web servers don't support IPv6 connections and might refuse such connections with 403 HTTP error (Forbidden).

In that case we need to disable IPv6 on machine's network interfaces. This is how to do it.

To check first that IPv6 traffic from your machine is enabled, go to some IP checker website (e.g. or and check what it detects (if it shows or not your IPv6 address).

You can also run

$ ifconfig | grep inet6

...and check if (local) IPv6 addresses are assigned to active interfaces.

To disable IPv6 do the following:

1) Open /etc/sysctl.conf

$ sudo nano /etc/sysctl.conf

2) Append the following lines to the existing configuration and save the file:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv6.conf.tun0.disable_ipv6 = 1

3) Instruct OS to re-read this config file:

$ sudo sysctl -p

To validate changes, repeat IPv6 validation steps described above.

Sunday, 19 January 2020

Building the Machine Learning Model

I came across this nice diagram which explains how to build the machine learning model so am sharing it with you. All credits go to its author, Chanin Nantasenamat.

Monday, 13 January 2020

Viber on PC not syncing? Here is the solution.

I've noticed that Viber on my Ubuntu PC stopped syncing messages with Viber app on my mobile phone. I didn't find solution on Viber Help pages so I had to find the fix myself. It's actually very simple: you just have to delete one file and restart the application, no Viber reinstall is needed!

Before everything, exit Viber application on PC.

Let's find all Viber files and directories:

$ sudo find / -name "*viber*"

NOTE: 440123456789 is the number of the mobile device to which you've been syncing messages so far.

/home/bojan/.ViberPC/440123456789/viber.db is file which contains all message history. Let's delete it: 

$ rm ~/.ViberPC/440123456789/viber.db 

Open Viber app on mobile phone and re-launch Viber on PC (from Applications or from Terminal, like here):

$ /opt/viber/Viber 
Attribute Qt::AA_ShareOpenGLContexts must be set before QCoreApplication is created.
qml: *** popupMode = 1920
qrc:/QML/DebugMenu.qml:262: TypeError: Cannot call method 'isWasabiEnabled' of undefined
qrc:/QML/DebugMenu.qml:289: TypeError: Cannot call method 'isSearchInCommunitiesForceEnabled' of undefined
qrc:/QML/DebugMenu.qml:296: TypeError: Cannot call method 'isOOABURISpamCheckerForceEnabled' of undefined
qrc:/QML/DebugMenu.qml:304: TypeError: Cannot call method 'isRateCallQualityForceEnabled' of undefined

We'll see prompts telling us to approve syncing on both PC and mobile applications:

Viber sync approval prompt on PC

Viber sync approval prompt on mobile phone
Viber sync start prompt on PC

After we approve syncing on both devices, syncing process will start:

Viber syncing message on mobile phone

Viber syncing message on PC
After the process completes your Viber on PC will be synced with mobile phone Viber app.

If removing viber.db does not help, delete also data.db:

 $ rm ~/.ViberPC/440123456789/data.db 

...and repeat the whole process.

Sunday, 12 January 2020

Instance Segmentation


  • image
  • predefined set of categories


Predict locations and identities of objects in that image similar to object detection, but rather than just predicting a bounding box for each of those objects, instead we want to predict a whole segmentation mask for each of those objects and predict which pixels in the input image corresponds to each object instance.

Instance Segmentation is a full problem, like a hybrid between semantic segmentation and object detection because like in object detection we can handle multiple objects and we differentiate the identities of different instances.


(Differentiate instances)

In the example above Instance Segmentation distinguishes between the three sheep instances.

The output is like in semantic segmentation where we have this pixel wise accuracy but here for each of these objects we also want to say which pixels belong to that object.


The idea is to get region and classification predictions (for each object) and then apply semantic segmentation onto each of these regions.

Mask R-CNN

And this ends up looking a lot like Faster R-CNN.

So it has this multi-stage processing approach where we take our whole input image, that whole input image goes into some convolutional network and some learned region proposal network that's exactly the same as Faster R-CNN and now once we have our learned region proposals (input image goes through CNN - RPN) then we project those proposals onto our convolutional feature map just like we did in Fast and Faster R-CNN.

But now rather than just making a classification and a bounding box for regression decision
for each of those boxes we in addition want to predict a segmentation mask for each of those region proposals. So now it kind of looks like a semantic segmentation problem inside each of the region proposals that we're getting from our region proposal network.

Mask R-CNN Architecture
Kaiming He Georgia Gkioxari Piotr Dollar Ross Girshick: Mask R-CNN

After we do this RoI aligning to warp our features corresponding to the region of proposal
into the right shape, then we have two different branches.

First branch at the top looks just like Faster R-CNN and it will predict classification scores telling us what is the category corresponding to that region  proposal or alternatively whether or not it's background. And we'll also predict some bounding box coordinates that regressed off the region proposal coordinates.

Mask R-CNN Architecture in detail

Image source: Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation

And now in addition we'll have this branch at the bottom which looks basically like a semantic segmentation mini network which will classify for each pixel in that input region proposal whether or not it's an object. This Mask R-CNN architecture just kind of unifies Faster R-CNN and Semantic Segmentation models into one nice jointly end-to-end trainable model.

It works really well, just look at the examples in the paper. They look kind of indistinguishable from ground truth.

Pose Estimation

Mask R-CNN also does pose estimation. You can do pose estimation by predicting these joint coordinates for each of the joints of the person.

Mask R-CNN can do joint object detection, pose estimation, and instance segmentation.
And the only addition we need to make is that for each of these region proposals we add an additional little branch that predicts these coordinates of the joints for the instance of the current region proposal.

Addition for pose estimation

Image source: Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation

As another layer has been added (another head coming out of the network) we need to add another loss to our multi-task loss.

Because it's built on the Faster R-CNN framework it runs relatively close to real time so this is running something like 5fps on a GPU because this is all sort of done in the single forward pass of the network.


How much training data do you need?

All of these instant segmentation results were trained on the Microsoft Coco data set. Microsoft Coco is roughly 200,000 training images. It has 80 categories that it cares about so in each of those 200,000 training images it has all the instances of those 80 categories labeled. So there's something like 200,000 images for training and there's something like I think an average of five or six instances per image. So it actually is quite a lot of data. And for Microsoft Coco for all the people in Microsoft Coco they also have all the joints annotated as well so this actually does have quite a lot of supervision at training time. It is trained with quite a lot of data.

Training: Future improvements

One really interesting topic to study moving forward is that we kind of know that if you have a lot of data to solve some problem, at this point we're relatively confident that you can stitch up some convolutional network that can probably do a reasonable job at that problem but figuring out ways to get performance like this with less training data is a super interesting and active area of research.
That's something people will be spending a lot of their efforts working on in the next few years.


Semantic Segmentation


  • image (pixels)
  • list of categories


Each pixel in the image to be classified (to be assigned a category label).
Don't differentiate instances (objects), only care about pixels.


  • Every input pixel is assigned a category
  • Pixels of each category are painted with the same color e.g. grass, cat, tree, sky
  • If two instances of the same object are next to each other, entire area will have the same label and will be painted with same color


Approach #1: Sliding Window 

Approach #1 is to use sliding window where we are moving a small window across the image and apply DNN classification to determine the class of the crop which is then assigned to the central pixel of the crop.

This would be very computationally expensive as we'd need to classify (push crop through CNN) separate crop for each pixel in the image.

This would also be very inefficient for not reusing shared features between overlapping patches. If two patches overlap then the convolutional features of these patches will end up going through the same convolutional layers and we can actually share a lot of computation when applying this to separate passes or applying this type of approach to separate patches of the image. 

Approach #2: CNN, layers keep spatial size

Using fully convolutional network where whole network is a giant stack of convolutional layers with no fully connected layers where each convolutional layer preserves the spatial size of the input:

input image --> [conv] --> output image
  • input 3 x H x W
  • convolutions D x H x W: conv --> conv --> conv --> conv--> 
  • scores: C x H x W (C is the number of categories/labels)
  • argmax ==> Predictions H x W
Final convolutional layer outputs tensor C x H x W.

The size of the output image has to be the same as the input image as we want to have classification for each pixel, output image has to be pixel-perfect, with sharp and clear borders between segments.

All computations are done in one pass. 

Using convolutional layers which are keeping the same spatial size as the input image is super expensive and would take lots of memory for the huge number of parameters required (high resolution input image, input in each layer has multiple channels...).

Approach #3: CNN, downsampling + upsampling

Design network as a bunch of convolutional layers, with downlsampling and upsampling of the feature map inside the network:

input image --> [conv --> downsampling] --> [conv --> upsampling] --> output image

  • spatial information gets lost
  • e.g. max pooling, strided convolution

Upsampling: max unpooling or strided transpose convolution.


Put classification loss at every pixel at the output, take an average through space and train it through normal back propagation.


Creating training set is expensive and long manual process. Each pixel has to be labelled. There are some tools for drawing contours and filling in the regions.

Loss Function

Loss function: cross-entropy loss is computed for each pixel in the output and ground truth pixels; then sum or average is taken over space or mini-batch.


Individual instances of the same category are not differentiated. This is improved with Instance Segmentation.


Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation - YouTube

Saturday, 11 January 2020

Object Detection with SSD

SSD (Single Shot Multibox Detector) is a method for object detection (object localization and classification) which uses a single Deep Neural Network (DNN). Single Shot means that object detection is performed in a single forward pass of the DNN.

This method was proposed by Wei Liu et al. in December 2015 and revised last time in December 2016: SSD: Single Shot MultiBox Detector.


Fast Object Detection.


The SSD network, built on the VGG-16 network, performs the task of object detection and localization in a single forward pass of the network. This approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple features with different resolutions to naturally handle objects of various sizes. [source]

Here are some key points from the paper's abstraction:
  • SSD uses single deep neural network
  • SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location
BK: note here that different aspect ratios and scales are not applied to anchor boxes in the image but feature map
  • At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. 
  • Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. 
  • Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For 300×300 input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for 500×500 input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model. 

SSD Framework

Image source: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg: SSD: Single Shot MultiBox Detector

We know that deeper Conv layers in CNNs extract/learn more complex features.
Feature maps preserve spatial structure of the input image but at lower resolution.

Lecture 11: Detection and Localization

If we take some CNN (like ResNet) pretrained for image recognition (image classification) and remove its last FC layers we'll get as its output a feature map as described above.

Now we can do something which YOLO does on the image - divide feature map into a grid cells and apply equidistant detector which predicts anchor boxes.

Given our input image (3 * H * W) you imagine dividing that input image into some coarse S * S grid, and now within each of those grid cells you imagine some set of B base bounding boxes (e.g. B = 3 base bounding boxes like a tall one, a wide one, and a square one but in practice you would use more than three). These bounding boxes are centered at each grid cell.

Now for each of these grid cells (S x S) network has to predict two things:

  • for each of these base bounding boxes (B): an offset off the base bounding box to predict what is the true location of the object off this base bounding box. 
    • This prediction has two components:
      • bounding box coordinates: dxdydh , dw
      • confidence
    • So the final output has B * 5 values
  • classification scores for each of C classes (including background as a class)

At the end we end up predicting from our input image this giant tensor:
S * S * (B * 5 + C)

So that's just where we have B base bounding boxes, we have five numbers for each giving our offset and our confidence for that base bounding box and C classification scores for our C categories.

So then we kind of see object detection as this input of an image, output of this three dimensional tensor and you can imagine just training this whole thing with a giant convolutional network.

And that's kind of what these single shot methods do where they just, and again matching the ground truth objects into these potential base boxes becomes a little bit hairy but that's what these methods do.


SSD has two components:

  • base (backbone) model
  • SSD head

Backbone model:

  • usually a pre-trained image classification network as a feature extractor from which the final fully connected classification layer has been removed; such NN is able to extract semantic meaning from the input image while preserving the spatial structure of the image albeit at a lower resolution
  • VGG-16 or ResNet trained on ImageNet 

SSD head:

  • one or more convolutional layers added to the backbone
  • outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the final layers activations

SSD vs YOLO Network Architecture
Image source: Wei Liu et al.: "SSD: Single Shot MultiBox Detector"


Tensorflow Object Detection API comes with pretrained models where ssd_inception_v2_coco_2017_11_17 is one of them.

TensorRT/samples/opensource/sampleUffSSD at master · NVIDIA/TensorRT · GitHub
TensorFlow implementation of SSD, which actually differs from the original paper, in that it has an inception_v2 backbone. For more information about the actual model, download ssd_inception_v2_coco. The TensorFlow SSD network was trained on the InceptionV2 architecture using the MSCOCO dataset which has 91 classes (including the background class). The config details of the network can be found here.
Logo detection in Images using SSD - Towards Data Science
TensorFlow Object Detection API with Single Shot MultiBox Detector (SSD) - YouTube



Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, 

Alexander C. Berg: SSD: Single Shot MultiBox Detector


RattyDAVE/pi-object-detection: Raspberry Pi Object detection.

SSD : Single Shot Detector for object detection using MultiBox

13.7. Single Shot Multibox Detection (SSD) — Dive into Deep Learning 0.7.1 documentation

Understanding SSD MultiBox — Real-Time Object Detection In Deep Learning

How single-shot detector (SSD) works? | ArcGIS for Developers

(20) Is SSD really better than YOLO? - Quora

Review: SSD — Single Shot Detector (Object Detection)

SSD object detection: Single Shot MultiBox Detector for real-time processing

What do we learn from single shot object detectors (SSD, YOLOv3), FPN & Focal loss (RetinaNet)?

Object Detection with YOLO

YOLO (You Only Look Once), together with SSD (Single Shot Detection), OverFeat and some other methods belongs to a family of Object Detection algorithms which are known as "single-shot" object detectors as entire image is taken ("looked") and passed forward through network only once.

YOLO paper ([1506.02640] You Only Look Once: Unified, Real-Time Object Detection) was submitted in June 2015 and revised last time in May 2016.


Fast Object Detection.

Evolution of the Idea

Let's look two families that tried to improve on basic Sliding Window detector and see their strengths and waknesses.

Most advanced Region-proposal detectors (Faster R-CNN):

  • RoIs are learned => good bounding box accuracy
  • RoIs are further processed separately => low speed

Fully-convolutional approach (OverFeat):

  • RoIs are not learned => low bounding box accuracy
  • RoIs are processed in one go => high speed

To advance the improvement, let’s take the best of both :

  • Learning RoIs
  • Processing them in one go

  • No regions, image as taken as a whole.
  • Image passed once (in a “single shot”) through Fully Convolutional NN (FCNN).
  • FCNN simultaneously predicts:
    • all bounding boxes (regression)
    • class probabilities (classification) for each BBox
  • Training only a single neural network is required
  • Faster inference ⇒ NN can be used for object detection in real-time videos
  • Network architecture is lighter ⇒ NN can be deployed on embedded/mobile devices

Implementations: YOLO, SSD, DetectNet

Region-based detectors are all doing two things: proposing potential bounding boxes (RoIs) and then performing classification on them.
After classification, post-processing is used to refine the bounding box, eliminate duplicate detections, and rescore the box based on other objects in the scene. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.[1]

The idea in YOLO/SSD is that take detection completely as regression problem. Rather than doing independent processing for each of these potential regions instead we want to try to treat this like a regression problem and just make all these predictions all at once with a single convolutional network.

A single neural network predicts bounding boxes and class probabilities directly from
full images in one evaluation.

YOLO Detection System
Image source: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"

Use equidistant grid or predictors which predict set of bounding boxes and apply classification within them.

Create grid of fixed, equidistant detectors: divide image into equal static S*S grid cells 

For each cell neural network predicts: 

  • B anchor boxes (which can go out of their cells; B is number of ground truth boxes used in labeling; they enable each cell to predict more than 1 object) and for each of them:
    • location and size ((x, y) - box center, w - width, h - height). All these variables are scaled to [0, 1] range. x and y are relative to (0, 0) point of the image and w and h are relative to image's total width and height. We can say that image's upper left corner is (0, 0) and lower right corner is (1, 1).
    • confidence score (a number between 0 and 1) made of:
      • probability that box contains some object (objectness):  P(object)
      • IOU (Intersection over Union) - how accurate predicted box matches the ground truth
  • (conditional) probability for each class c that this cell contains object of that class (if it contains any object at all): P(c|object)

Neural network output parameters (prediction) is a tensor with these dimensions:

YOLO turns object detection into regression problem (in contrast to classification used elsewhere).

Confidence score that given box contains a certain object (class) is:
P(c|object)*P(object)*IOU = P(c)* IOU

Non-max suppression: for each class promote only boxes with confidences above the threshold

Object is detected by one cell only - the one which contains its centre point. But one cell can predict multiple objects.

Cell predicts B anchor boxes to address cases where multiple objects have centrepoints in one cell.

k-Means clustering used on training sets to determine anchor boxes sizes and number (B).

Non-maximum Suppression: object has to be detected by one anchor box - the one with the highest confidence score and Intersection over Union (IoU) with nearby anchors will be taken, others will be discarded (iteratively).

YOLO Model
Image source: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"


7 * 7 grid, 2 anchor boxes (e.g. one horizontal and one vertical) per cell, 20 classes:
(7 * 7) * (2 * 5 + 20) = 1470 outputs (which is not much for typical neural network)

In original YOLO paper the output tensor has dimensions S x S x (B * 5 + C) but in YOLOv2 it was changed to S x S x (B * (5 + C)). (, This is also the output that Andrew Ng is using in his YOLO lecture.

Ground truth bounding box has the highest IoU with smaller horizontal anchor.
This means that when labeling this image, we'll set confidence score to 1 only for that anchor box while for others it will be 0.

So that's just where we have B base bounding boxes, we have five numbers for each giving our offset and our confidence for that base bounding box and C classification scores for our C categories.

So then we kind of see object detection as this input of an image, output of this three dimensional tensor and you can imagine just training this whole thing with a giant convolutional network.

And that's kind of what these single shot methods do where they just, and again matching the ground truth objects into these potential base boxes becomes a little bit hairy but that's what these methods do.

In practice anchor boxes are predetermined. k-means clustering is used on training set to find out what are the most common bounding boxes and they are then grouped in B groups.

BK: Naturally, objects come in such shapes/positions that can be surrounded by either horizontal or vertical bounding boxes of rectangular shape of various sizes and aspect ratios. We use such "average" anchor boxes when deciding to which of them to assign ground truth box. They are like some kind of a "reference" upon we label training sets. Yolo would also predict such bounding boxes and that's why it's not super accurate.

For each anchor box, system will then not be predicting its dimensions but its offset to the ground truth bounding box. Anchor box with the smallest offset is the one with highest IoU and it will be promoted to the predicted bounding box.


Darknet: CNN for the real-time object detection with high accuracy.

Single feedforward ConvNet that predicts BBs and class probabilities in a single evaluation

Built on GoogleNet which is 22 layers; added 2 convolutional layers and 2 fully connected layers (for inference and regression on the bounding box center coordinates as well as the size and width which can range over the whole image).

YOLO Architecture
Image source: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"


Fully supervised.

YOLO trains on full images.

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall. [1]

Labeling training set:
  • k-Means clustering used on training sets to determine anchor box sizes & their number

BK: These B anchor boxes are used for two things:

1) To help placing ground truth bounding box coordinates at the right position (for a specific anchor box) in the label vector. This specific anchor box is the one for which IoU with ground truth box is the highest.

2) To determine at the output which predicted coordinates (or predicted offsets) belong to which anchor box.

Yolo network does not use ever fixed sizes of anchor boxes. It only predicts each anchor box and anchor boxes at the network output will match those in the input ((x, y, w, h) at the specific location in the output vector are predictions learned after observing all quadruplets (x, y, w, h) at the same location in the input label simple as that, just like any other regression - see the image below).

  • Each object in image is assigned to cell that contains object’s midpoint and anchor box (belonging to that cell) with the highest IoU ⇒ cell where centre of the objects falls is responsible for detecting it ⇒ P(classi)=1
  • for all cells which do not contain midpoint of any object P(object) == 0 and label is [0, ?,?,?,?] (? means don't care). So for example if we have two objects in the image, only two grid cells will have [1, x, y, w, h] all other will have [0, ?,?,?,?] 
  • if anchor box size matches the ground truth box ⇔ P(object)=1 ∧ IOU=1

Training set labeling: label vector matches network output

Example of 2 potential anchor boxes.

The role of Anchor boxes is to hold the information about the location of potential bounding boxes.Image source:

pretrained on ImageNet

extensive data augmentation


Runs once on entire image. Very fast.

To enable using video files and streams from network or web cameras: compile it with OpenCV
On Ubuntu install libhighgui-dev and libopencv-dev (sudo apt-get install).

    • It makes predictions with a single network evaluation ⇒ huge speed (~45 fps; Fast YOLO 155 fps); 10x faster than Faster R-CNN (which had been developed at the same time as Yolo and 2-step algorithm: region proposal + classification; 7 fps) (Pascal  2007 dataset)
    • First CNN-based model which could be used for object detection in real-time videos
    • Looks at the whole image at test time ⇒ predictions are informed by global context in the image
    • far less likely to predict false detections where nothing exists

    • (slightly) less precise localization than Faster R-CNN 
    • Lower detection performance on smaller objects
    • Struggles with unusual aspect ratios
    • Performances improved in YOLOv2 (YOLO9000) and YOLOv3 


    [1] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"
    [2] Lecture 11 | Detection and Segmentation - YouTube

    Lei Mao's Log Book – Introduction to YOLOs

    Understanding YOLO - By

    machine learning - How to label training data for YOLO - Stack Overflow


    k-means clustering for anchor boxes - Lars’ Blog

    Non-maximum Suppression (NMS) - Towards Data Science

    neural network - How is the number of grid cells in YOLO determined? - Data Science Stack Exchange

    (18) How can I label an image to train YOLO automatically? - Quora

    One-stage object detection

    A Practical Guide to Object Detection using the Popular YOLO Framework

    A Comprehensive Guide To Object Detection Using YOLO Framework — Part II (Implementing using Python)

    #029 CNN Yolo Algorithm | Master Data Science

    YOLO, YOLOv2 and YOLOv3: All You want to know - Amro Kamal - Medium

    Understanding YOLO and YOLOv2 | Manal El Aidouni

    Juan Du: Understanding of Object Detection Based on CNN Family and YOLO

    Lecture 11: Detection and Localization

    Study of Using Deep Learning Nets for Mark Detection in Space Docking Control Images

    Thursday, 9 January 2020

    Object Detection with OverFeat


    Improve Sliding Window approach not by reducing number of crops (like in Region-proposal based solutions) but by passing image through CNN only once.


    Class predictions for each crop (position of the sliding window).


    Based on convolutional implementation of fully connected layers => NN = CNN + softmax layer.

    Input image goes only once through single forward-propagation NN => computationally efficient.
    This is why this method belongs to groups of detectors called Single-shot detectors.

    Class probabilities for all locations are predicted at once.

    Idea published by Pierre Sermanet et al.: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks" (2013)

    ImageFeat single propagation through CNN

    Image source: Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks"

    CNN (like AlexNet) typically has the following structure (conv-pooling pair is usually repeated several times):

    [conv --> pooling] --> FC --> FC --> Softmax

    FC layers require fixed-sized input which, in turn, when back-propagated, adds requirement to first convolutional layer (1st layer in the network) to also accept input of the certain (fixed) size. This is for example, why in some Region-based detectors, we need to warp Regions of Interest before they go through CNN.

    But what happens if we don't impose this restriction? The output of the last layer before the first FC layer will just be tensor of different (larger) dimensions. 

    If we then replace FC layers with convolution layers so for some NxN input we get 1x1 output (1x1xC actually where C is the name of classes), we'll get fully convolutional NN which can accept input image of any size and which will consequently output tensor of various dimensions. But it turns out that classification result (predictions) in the output layer spatially match the scaled "crop" of the input image. E.g. if input is (N+2)x(N+2) the output will be 2 x 2(x C) and in the upper left vector will contain C predictions for NxN upper left crop of the input image. Similarly, lower-right vector will contain C predictions for NxN lower right crop of the input image. So in one pass through CNN we essentially get predictions for Sliding Window crops! Sliding Window stride is defined here by the size of the kernel in the 1st convolutional layer.

    In other words, we can read the upper image in the following way:

    If we design fully convolutional NN so for input of size N x N x 3 (3 for RGB channels) it outputs tensor of size 1 x 1 x C (C is number of classes) then if we pass to it input image of size (N+2) x (N+2) x 3 we'll have at the output a tensor of size 2x2xC. Each element of that 2 x 2 output is a vector of C values - predictions of classes and each output spatially matches the N x N crop in the input image.

    OverFeat is based on the idea to replace FC layers with Conv layers as this would then allow passing the image of any size into the ConvNet.

    1st row shows ConvNet which outputs class prediction vector for an input image of certain size (14x14 in this case). The output is (1x1xC) a single vector containing C elements - class predictions (C is number of classes).

    If we pass through this ConvNet an image of some other dimensions. we’ll get larger matrix at the output (2x2xC) which would contain class predictions for 4 positions of sliding window. And we’ve got these predictions in one go!


    Bounding boxes match crops, they are not learned (not part of the output prediction) => low accuracy.


    [1] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, Yann LeCun: "OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks" (2013)

    Object Localization in Overfeat - Towards Data Science

    Convolutional Implementation of Sliding Windows - Object detection | Coursera


    Fully Connected Layers in Convolutional Neural Networks: The Complete Guide -

    tensorflow - Is it possible to give variable sized images as input to a convolutional neural network? - Cross Validated

    Object Detection with Faster R-CNN


    Time benchmarks of Fast R-CNN during inference were showing that large portion of time is taken by region proposals. E.g. Test time is 2.3 seconds with region proposals (using a fixed function like Selective Search which computes 2000 region proposals) and 0.32 seconds without. SPPnet was also exposing region proposal computation as a bottleneck. So the problem with Fast R-CNN and SPPnet is that runtime is dominated by computing region proposals.

    Faster R-CNN is solving this issue by making CNN itself predicting its own region proposals. It eliminated the overhead from computing region proposals outside the network.

    Faster R-CNN proposal was published in June 2015 (last revision was in January 2016) in paper [1506.01497] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

    Network Architecture

    Entire input image is run through some convolutional layers to get some convolutional feature map
    representing the entire high resolution image. This is the output of the last convolutional layer and contains detections of high-level features (shapes and objects).

    Faster R-CNN network
    Faster R-CNN network. 
    Image credit: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun: "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"

    There's now a separate Region Proposal Network (RPN) which is a fully convolutional network that uses convolutional features from a map to simultaneously predict:

    • object bounds (region proposals)
    • objectness scores (is it an object or a background?) at each position from those convolutional features

    Bounding boxes are now not calculated but hypothesized (predicted).

    RPN takes an image of arbitrary size to generate a set of rectangular object proposals. RPN operates on a specific conv layer with the preceding layers shared with object detection network => Region Proposal Network acts in a nearly cost-free way by sharing full-image conv features with detection network.

    RPN introduces the term called anchor box. RPN uses sliding window to go across the feature map and in each slide/position/crop it selects k (e.g. 9) rectangles of various aspect ratios with centre points being in the centre of the slide. These rectangles are anchor boxes. They are like potential bounding boxes. Some of them will be promoted into a bounding box at the end of the process.

    RPN has classifier and regressor.

    For each anchor box it predicts:

    • 2 objectness scores (object or background); classifier
    • 4 coordinates - regressor

    Region Proposal Network used in Faster R-CNN
    Region Proposal Network used in Faster R-CNN.
    Image credit: Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun: "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks"

    We can say that:

    Faster R-CNN = RPN + Fast R-CNN

    Once we have those predicted region proposals then network looks just like fast R-CNN where now we take crops from those region proposals from the convolutional features, pass them up to the rest of the network.

    Loss Function

    Network is now doing four things at once so we'll have four-way multi-task loss.

    Region proposal network does two things:
    • for each potential proposal it does binary classification and tells if region contains (any) object or not => Binary Classification Loss
    • performs regression to find the bounding box coordinates for each of those proposals => Bounding box Regression Loss

    The final network at the end also does these two things again:
    • makes final classification decisions for what are the class scores for each of these proposals => Classification Loss
    • predicting final box coordinates; the second round of bounding box regression to again correct any errors that may have come from the region proposal stage => Bounding box Regression Loss
    So total loss is a sum of two Classification Losses and two Bounding box Regression Losses.


    How is RPN trained? 

    The idea is that at any time you have a region proposal which has more than some threshold of overlap with any of the ground truth objects then you say that that is the positive region proposal
    and you should predict that as the region proposal mand any potential proposal which has very low overlap with any ground truth objects should be predicted as a negative.


    Faster R-CNN belongs to Region-based methods for object detection. In this family of methods there's some kind of region proposal and then we're doing some independent processing (pooling and then classification) for each of those potential regions, sequentially.

    That works well but it’s also quite slow (~7 fps) as it requires running the detection and classification portion of the model multiple times.

    While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications.

    Often detection speed for these approaches is measured in seconds per frame (SPF), and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS).

    Significantly increased speed comes only at the cost of significantly decreased detection accuracy.


    [1506.01497] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

    Lecture 11 | Detection and Segmentation - YouTube

    Object Detection for Dummies Part 3: R-CNN Family

    computer science - faster-RCNN,why don't we just use only RPN for detection? - Mathematics Stack Exchange

    Faster R-CNN for object detection - Towards Data Science

    A Step-by-Step Introduction to the Basic Object Detection Algorithms (Part 1)

    Understanding Object Detection - Towards Data Science

    Region Proposal Network (RPN) — Backbone of Faster R-CNN

    Faster R-CNN Explained - Hao Gao - Medium

    deep learning - Anchoring Faster RCNN - Cross Validated