Friday 31 January 2020

How to disable IPv6 on Ubuntu

Some web servers don't support IPv6 connections and might refuse such connections with 403 HTTP error (Forbidden).

Linux prioritises IPv6 over IPv4 so we need to deprioritise IPv6 or disable it completely on machine's network interfaces.

To drop the IPv6 priority we can edit /etc/gai.conf and uncomment the following line:

# precedence ::ffff:0:0/96  100

If this is not giving desired results we can disable IPv6 completely.



To check first that IPv6 traffic from your machine is enabled, go to some IP checker website (e.g.  https://test-ipv6.com/ or https://whatismyipaddress.com/) and check what it detects (if it shows or not your IPv6 address).

You can also run

$ ifconfig | grep inet6

...and check if (local) IPv6 addresses are assigned to active interfaces.


To disable IPv6 do the following:


1) Open /etc/sysctl.conf

$ sudo nano /etc/sysctl.conf


2) Append the following lines to the existing configuration and save the file:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv6.conf.tun0.disable_ipv6 = 1

If there is a network adapter with inet6 address assigned, like this one:

$ ifconfig 
...
wlp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.15  netmask 255.255.255.0  broadcast 192.168.0.255
        inet6 fd04:3f2a:df62:0:a9b4:d5a4:e119:be04  prefixlen 64  scopeid 0x0<global>
        inet6 fd04:3f2a:df62:0:f796:34ff:fef7:fd19  prefixlen 64  scopeid 0x0<global>
        inet6 2a02:c7f:ac1b:bf01:4d82:4779:3eb4:2f46  prefixlen 64  scopeid 0x0<global>
        inet6 fd04:3f2a:df62:0:b8b4:35ee:28ed:6ee6  prefixlen 64  scopeid 0x0<global>
        inet6 fd04:3f2a:df62:0:4a82:4779:3eb4:2f46  prefixlen 64  scopeid 0x0<global>
        inet6 2a02:c7f:ac1b:bf00:f656:34ff:fef7:fd19  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::f696:34ff:fef7:fd29  prefixlen 64  scopeid 0x20<link>
        ether f4:96:34:f8:fd:19  txqueuelen 1000  (Ethernet)
        ...

...then add also the line which disables IPv6 on it:

net.ipv6.conf.wlp2s0.disable_ipv6 = 1


3) Instruct OS to re-read this config file:

$ sudo sysctl -p


Changes will be applied immediately, even if you are on the active VPN connection.

To validate changes, repeat IPv6 validation steps described above.

$ ifconfig 
...
wlp2s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.0.15  netmask 255.255.255.0  broadcast 192.168.0.255
        ether f4:96:34:f8:fd:19  txqueuelen 1000  (Ethernet)
        ...


If we now go to some public IP checker e.g. https://www.whatismyip.com/ we'll see that our public IPv6 address is not detected anymore:



Sunday 19 January 2020

Building the Machine Learning Model

I came across this nice diagram which explains how to build the machine learning model so am sharing it with you. All credits go to its author, Chanin Nantasenamat.


Monday 13 January 2020

Viber on PC not syncing? Here is the solution.

I've noticed that Viber on my Ubuntu PC stopped syncing messages with Viber app on my mobile phone. I didn't find solution on Viber Help pages so I had to find the fix myself. It's actually very simple: you just have to delete one file and restart the application, no Viber reinstall is needed!

Before everything, exit Viber application on PC.

Let's find all Viber files and directories:

$ sudo find / -name "*viber*"
/var/lib/dpkg/info/viber.md5sums
/var/lib/dpkg/info/viber.list
/var/lib/dpkg/info/viber.postinst
/var/lib/dpkg/info/viber.prerm
/var/lib/dpkg/info/viber.preinst
/var/lib/dpkg/info/viber.0
/var/lib/dpkg/info/viber.copyright
/home/bojan/Downloads/viber.deb
/home/bojan/.cache/gnome-software/icons/4759200235b7bb401072b357b6bcd10db4e6c4a1-viber-icon-logo-4E5ED1327A-seeklogo.com.png
/home/bojan/.ViberPC/440123456789/viber.db-shm
/home/bojan/.ViberPC/440123456789/viber.db-wal
/home/bojan/.ViberPC/440123456789/viber.db
/usr/share/applications/viber.desktop
/usr/share/pixmaps/viber.png
/usr/share/viber
/opt/viber


NOTE: 440123456789 is the number of the mobile device to which you've been syncing messages so far.

/home/bojan/.ViberPC/440123456789/viber.db is file which contains all message history. Let's delete it: 

$ rm ~/.ViberPC/440123456789/viber.db 

Open Viber app on mobile phone and re-launch Viber on PC (from Applications or from Terminal, like here):

$ /opt/viber/Viber 
Attribute Qt::AA_ShareOpenGLContexts must be set before QCoreApplication is created.
qml: *** popupMode = 1920
qrc:/QML/DebugMenu.qml:262: TypeError: Cannot call method 'isWasabiEnabled' of undefined
qrc:/QML/DebugMenu.qml:289: TypeError: Cannot call method 'isSearchInCommunitiesForceEnabled' of undefined
qrc:/QML/DebugMenu.qml:296: TypeError: Cannot call method 'isOOABURISpamCheckerForceEnabled' of undefined
qrc:/QML/DebugMenu.qml:304: TypeError: Cannot call method 'isRateCallQualityForceEnabled' of undefined

We'll see prompts telling us to approve syncing on both PC and mobile applications:

Viber sync approval prompt on PC


Viber sync approval prompt on mobile phone
Viber sync start prompt on PC


After we approve syncing on both devices, syncing process will start:

Viber syncing message on mobile phone


Viber syncing message on PC
After the process completes your Viber on PC will be synced with mobile phone Viber app.

If removing viber.db does not help, delete also data.db:

 $ rm ~/.ViberPC/440123456789/data.db 

...and repeat the whole process.

Sunday 12 January 2020

Instance Segmentation

Input


  • image
  • predefined set of categories

Goal


Predict locations and identities of objects in that image similar to object detection, but rather than just predicting a bounding box for each of those objects, instead we want to predict a whole segmentation mask for each of those objects and predict which pixels in the input image corresponds to each object instance.

Instance Segmentation is a full problem, like a hybrid between semantic segmentation and object detection because like in object detection we can handle multiple objects and we differentiate the identities of different instances.


ROAD, SHEEP, SHEEP, SHEEP, GRASS

(Differentiate instances)


In the example above Instance Segmentation distinguishes between the three sheep instances.

The output is like in semantic segmentation where we have this pixel wise accuracy but here for each of these objects we also want to say which pixels belong to that object.

Method

The idea is to get region and classification predictions (for each object) and then apply semantic segmentation onto each of these regions.

Mask R-CNN



And this ends up looking a lot like Faster R-CNN.

So it has this multi-stage processing approach where we take our whole input image, that whole input image goes into some convolutional network and some learned region proposal network that's exactly the same as Faster R-CNN and now once we have our learned region proposals (input image goes through CNN - RPN) then we project those proposals onto our convolutional feature map just like we did in Fast and Faster R-CNN.

But now rather than just making a classification and a bounding box for regression decision
for each of those boxes we in addition want to predict a segmentation mask for each of those region proposals. So now it kind of looks like a semantic segmentation problem inside each of the region proposals that we're getting from our region proposal network.


Mask R-CNN Architecture
Kaiming He Georgia Gkioxari Piotr Dollar Ross Girshick: Mask R-CNN

After we do this RoI aligning to warp our features corresponding to the region of proposal
into the right shape, then we have two different branches.

First branch at the top looks just like Faster R-CNN and it will predict classification scores telling us what is the category corresponding to that region  proposal or alternatively whether or not it's background. And we'll also predict some bounding box coordinates that regressed off the region proposal coordinates.


Mask R-CNN Architecture in detail

Image source: Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation

And now in addition we'll have this branch at the bottom which looks basically like a semantic segmentation mini network which will classify for each pixel in that input region proposal whether or not it's an object. This Mask R-CNN architecture just kind of unifies Faster R-CNN and Semantic Segmentation models into one nice jointly end-to-end trainable model.

It works really well, just look at the examples in the paper. They look kind of indistinguishable from ground truth.

Pose Estimation


Mask R-CNN also does pose estimation. You can do pose estimation by predicting these joint coordinates for each of the joints of the person.

Mask R-CNN can do joint object detection, pose estimation, and instance segmentation.
And the only addition we need to make is that for each of these region proposals we add an additional little branch that predicts these coordinates of the joints for the instance of the current region proposal.



Addition for pose estimation

Image source: Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation



As another layer has been added (another head coming out of the network) we need to add another loss to our multi-task loss.

Because it's built on the Faster R-CNN framework it runs relatively close to real time so this is running something like 5fps on a GPU because this is all sort of done in the single forward pass of the network.


Training



How much training data do you need?

All of these instant segmentation results were trained on the Microsoft Coco data set. Microsoft Coco is roughly 200,000 training images. It has 80 categories that it cares about so in each of those 200,000 training images it has all the instances of those 80 categories labeled. So there's something like 200,000 images for training and there's something like I think an average of five or six instances per image. So it actually is quite a lot of data. And for Microsoft Coco for all the people in Microsoft Coco they also have all the joints annotated as well so this actually does have quite a lot of supervision at training time. It is trained with quite a lot of data.

Training: Future improvements


One really interesting topic to study moving forward is that we kind of know that if you have a lot of data to solve some problem, at this point we're relatively confident that you can stitch up some convolutional network that can probably do a reasonable job at that problem but figuring out ways to get performance like this with less training data is a super interesting and active area of research.
That's something people will be spending a lot of their efforts working on in the next few years.


Reference:





Semantic Segmentation

Input

  • image (pixels)
  • list of categories

Goal


Each pixel in the image to be classified (to be assigned a category label).
Don't differentiate instances (objects), only care about pixels.





Output

  • Every input pixel is assigned a category
  • Pixels of each category are painted with the same color e.g. grass, cat, tree, sky
  • If two instances of the same object are next to each other, entire area will have the same label and will be painted with same color

Method


Approach #1: Sliding Window 


Approach #1 is to use sliding window where we are moving a small window across the image and apply DNN classification to determine the class of the crop which is then assigned to the central pixel of the crop.

This would be very computationally expensive as we'd need to classify (push crop through CNN) separate crop for each pixel in the image.

This would also be very inefficient for not reusing shared features between overlapping patches. If two patches overlap then the convolutional features of these patches will end up going through the same convolutional layers and we can actually share a lot of computation when applying this to separate passes or applying this type of approach to separate patches of the image. 


Approach #2: CNN, layers keep spatial size


Using fully convolutional network where whole network is a giant stack of convolutional layers with no fully connected layers where each convolutional layer preserves the spatial size of the input:

input image --> [conv] --> output image
  • input 3 x H x W
  • convolutions D x H x W: conv --> conv --> conv --> conv--> 
  • scores: C x H x W (C is the number of categories/labels)
  • argmax ==> Predictions H x W
Final convolutional layer outputs tensor C x H x W.

The size of the output image has to be the same as the input image as we want to have classification for each pixel, output image has to be pixel-perfect, with sharp and clear borders between segments.

All computations are done in one pass. 

Using convolutional layers which are keeping the same spatial size as the input image is super expensive and would take lots of memory for the huge number of parameters required (high resolution input image, input in each layer has multiple channels...).

Approach #3: CNN, downsampling + upsampling



Design network as a bunch of convolutional layers, with downlsampling and upsampling of the feature map inside the network:

input image --> [conv --> downsampling] --> [conv --> upsampling] --> output image


Downsampling
  • spatial information gets lost
  • e.g. max pooling, strided convolution

Upsampling: max unpooling or strided transpose convolution.

Training


Put classification loss at every pixel at the output, take an average through space and train it through normal back propagation.

Data


Creating training set is expensive and long manual process. Each pixel has to be labelled. There are some tools for drawing contours and filling in the regions.

Loss Function


Loss function: cross-entropy loss is computed for each pixel in the output and ground truth pixels; then sum or average is taken over space or mini-batch.


Problem


Individual instances of the same category are not differentiated. This is improved with Instance Segmentation.

References:


Stanford University School of Engineering - Convolutional Neural Networks for Visual Recognition - Lecture 11 | Detection and Segmentation - YouTube

Saturday 11 January 2020

Object Detection with SSD


SSD (Single Shot Multibox Detector) is a method for object detection (object localization and classification) which uses a single Deep Neural Network (DNN). Single Shot means that object detection is performed in a single forward pass of the DNN.

This method was proposed by Wei Liu et al. in December 2015 and revised last time in December 2016: SSD: Single Shot MultiBox Detector.

Objective


Fast Object Detection.

Method


The SSD network, built on the VGG-16 network, performs the task of object detection and localization in a single forward pass of the network. This approach discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple features with different resolutions to naturally handle objects of various sizes. [source]


Here are some key points from the paper's abstraction:
  • SSD uses single deep neural network
  • SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location
BK: note here that different aspect ratios and scales are not applied to anchor boxes in the image but feature map
  • At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. 
  • Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. 
  • Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For 300×300 input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for 500×500 input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model. 

SSD Framework

Image source: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg: SSD: Single Shot MultiBox Detector

We know that deeper Conv layers in CNNs extract/learn more complex features.
Feature maps preserve spatial structure of the input image but at lower resolution.

Lecture 11: Detection and Localization


If we take some CNN (like ResNet) pretrained for image recognition (image classification) and remove its last FC layers we'll get as its output a feature map as described above.

Now we can do something which YOLO does on the image - divide feature map into a grid cells and apply equidistant detector which predicts anchor boxes.



--------------
Given our input image (3 * H * W) you imagine dividing that input image into some coarse S * S grid, and now within each of those grid cells you imagine some set of B base bounding boxes (e.g. B = 3 base bounding boxes like a tall one, a wide one, and a square one but in practice you would use more than three). These bounding boxes are centered at each grid cell.

Now for each of these grid cells (S x S) network has to predict two things:

  • for each of these base bounding boxes (B): an offset off the base bounding box to predict what is the true location of the object off this base bounding box. 
    • This prediction has two components:
      • bounding box coordinates: dxdydh , dw
      • confidence
    • So the final output has B * 5 values
  • classification scores for each of C classes (including background as a class)

At the end we end up predicting from our input image this giant tensor:
S * S * (B * 5 + C)

So that's just where we have B base bounding boxes, we have five numbers for each giving our offset and our confidence for that base bounding box and C classification scores for our C categories.

So then we kind of see object detection as this input of an image, output of this three dimensional tensor and you can imagine just training this whole thing with a giant convolutional network.

And that's kind of what these single shot methods do where they just, and again matching the ground truth objects into these potential base boxes becomes a little bit hairy but that's what these methods do.
--------------

Architecture


SSD has two components:

  • base (backbone) model
  • SSD head


Backbone model:

  • usually a pre-trained image classification network as a feature extractor from which the final fully connected classification layer has been removed; such NN is able to extract semantic meaning from the input image while preserving the spatial structure of the image albeit at a lower resolution
  • VGG-16 or ResNet trained on ImageNet 


SSD head:

  • one or more convolutional layers added to the backbone
  • outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the final layers activations

SSD vs YOLO Network Architecture
Image source: Wei Liu et al.: "SSD: Single Shot MultiBox Detector"




Examples


Tensorflow Object Detection API comes with pretrained models where ssd_inception_v2_coco_2017_11_17 is one of them.

TensorRT/samples/opensource/sampleUffSSD at master · NVIDIA/TensorRT · GitHub
TensorFlow implementation of SSD, which actually differs from the original paper, in that it has an inception_v2 backbone. For more information about the actual model, download ssd_inception_v2_coco. The TensorFlow SSD network was trained on the InceptionV2 architecture using the MSCOCO dataset which has 91 classes (including the background class). The config details of the network can be found here.
Logo detection in Images using SSD - Towards Data Science
TensorFlow Object Detection API with Single Shot MultiBox Detector (SSD) - YouTube


ssd_mobilenet_v1_coco_2017_11_17


References:


Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, 

Alexander C. Berg: SSD: Single Shot MultiBox Detector

TensorRT UFF SSD

RattyDAVE/pi-object-detection: Raspberry Pi Object detection.

https://github.com/weiliu89/caffe/tree/ssd

https://machinethink.net/blog/object-detection/

SSD : Single Shot Detector for object detection using MultiBox

13.7. Single Shot Multibox Detection (SSD) — Dive into Deep Learning 0.7.1 documentation

Understanding SSD MultiBox — Real-Time Object Detection In Deep Learning

How single-shot detector (SSD) works? | ArcGIS for Developers

(20) Is SSD really better than YOLO? - Quora

Review: SSD — Single Shot Detector (Object Detection)

SSD object detection: Single Shot MultiBox Detector for real-time processing

What do we learn from single shot object detectors (SSD, YOLOv3), FPN & Focal loss (RetinaNet)?


Object Detection with YOLO

YOLO (You Only Look Once), together with SSD (Single Shot Detection), OverFeat and some other methods belongs to a family of Object Detection algorithms which are known as "single-shot" object detectors as entire image is taken ("looked") and passed forward through network only once.

YOLO paper ([1506.02640] You Only Look Once: Unified, Real-Time Object Detection) was submitted in June 2015 and revised last time in May 2016.

Objective


Fast Object Detection.

Evolution of the Idea


Let's look two families that tried to improve on basic Sliding Window detector and see their strengths and waknesses.

Most advanced Region-proposal detectors (Faster R-CNN):
  • RoIs are learned => good bounding box accuracy
  • RoIs are further processed separately => low speed


Fully-convolutional approach (OverFeat):
  • RoIs are not learned => low bounding box accuracy
  • RoIs are processed in one go => high speed


To advance the improvement, let’s take the best of both :
  • Learning RoIs
  • Processing them in one go

Idea: 
  • No regions, image as taken as a whole.
  • Image passed once (in a “single shot”) through Fully Convolutional NN (FCNN).
  • FCNN simultaneously predicts:
    • all bounding boxes (regression)
    • class probabilities (classification) for each BBox
Benefits:
  • Training only a single neural network is required
  • Faster inference ⇒ NN can be used for object detection in real-time videos
  • Network architecture is lighter ⇒ NN can be deployed on embedded/mobile devices

Implementations: YOLO, SSD, DetectNet



Region-based detectors are all doing two things: proposing potential bounding boxes (RoIs) and then performing classification on them.
After classification, post-processing is used to refine the bounding box, eliminate duplicate detections, and rescore the box based on other objects in the scene. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.[1]

The idea in YOLO/SSD is that take detection completely as regression problem. Rather than doing independent processing for each of these potential regions instead we want to try to treat this like a regression problem and just make all these predictions all at once with a single convolutional network.

A single neural network predicts bounding boxes and class probabilities directly from
full images in one evaluation.
[1]

YOLO Detection System
Image source: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"

Use equidistant grid or predictors which predict set of bounding boxes and apply classification within them.

Create grid of fixed, equidistant detectors: divide image into equal static S*S grid cells. 

For each cell neural network predicts: 
  • B anchor boxes (which can go out of their cells; B is number of ground truth boxes used in labeling; they enable each cell to predict more than 1 object) and for each of them:
    • location and size ((x, y) - box center, w - width, h - height). All these variables are scaled to [0, 1] range. x and y are relative to (0, 0) point of the image and w and h are relative to image's total width and height. We can say that image's upper left corner is (0, 0) and lower right corner is (1, 1).
    • confidence score (a number between 0 and 1) made of:
      • probability that box contains some object (objectness):  P(object)
      • IOU (Intersection over Union) - how accurate predicted box matches the ground truth
  • (conditional) probability for each class c that this cell contains object of that class (if it contains any object at all): P(c|object)

Neural network output parameters (prediction) is a tensor with these dimensions:
(S*S)*(B*5+C)

YOLO turns object detection into regression problem (in contrast to classification used elsewhere).

Confidence score that given box contains a certain object (class) is:
P(c|object)*P(object)*IOU = P(c)* IOU

Non-max suppression:
  • for each class promote only boxes with confidences above the threshold
  • object has to be detected by one anchor box - the one with the highest confidence score and Intersection over Union (IoU) with nearby anchors will be taken, others will be discarded (iteratively)

Object is detected by one cell only - the one which contains its centre point. But one cell can predict multiple objects.

Cell predicts B anchor boxes to address cases where multiple objects have centrepoints in one cell.

k-Means clustering used on training sets to determine anchor boxes sizes and number (B).




YOLO Model
Image source: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"


Example:

7 * 7 grid, 2 anchor boxes (e.g. one horizontal and one vertical) per cell, 20 classes:
(7 * 7) * (2 * 5 + 20) = 1470 outputs (which is not much for typical neural network)


In original YOLO paper the output tensor has dimensions S x S x (B * 5 + C) but in YOLOv2 it was changed to S x S x (B * (5 + C)). (https://leimao.github.io/blog/YOLOs/, https://medium.com/@amrokamal_47691/yolo-yolov2-and-yolov3-all-you-want-to-know-7e3e92dc4899). This is also the output that Andrew Ng is using in his YOLO lecture.

Ground truth bounding box has the highest IoU with smaller horizontal anchor.
This means that when labeling this image, we'll set confidence score to 1 only for that anchor box while for others it will be 0.

So that's just where we have B base bounding boxes, we have five numbers for each giving our offset and our confidence for that base bounding box and C classification scores for our C categories.

So then we kind of see object detection as this input of an image, output of this three dimensional tensor and you can imagine just training this whole thing with a giant convolutional network.

And that's kind of what these single shot methods do where they just, and again matching the ground truth objects into these potential base boxes becomes a little bit hairy but that's what these methods do.
Ground truth bounding box has the highest IoU with smaller horizontal anchor.
This means that when labeling this image, we'll set confidence score to 1 only for that anchor box while for others it will be 0.

In practice anchor boxes are predetermined. k-means clustering is used on training set to find out what are the most common bounding boxes and they are then grouped in B groups. In the example presented in the image above, we have 5 anchor boxes and they have fixed size. Their size is used for calculating IoU with ground truth bounding boxes for determining to which anchor belongs the object. This information is then used in:
  • labelling - we then know where in output y vector we need to place data for particular object
  • inference - Non-max suppression (see its description above)

BK: Naturally, objects come in such shapes/positions that can be surrounded by either horizontal or vertical bounding boxes of rectangular shape of various sizes and aspect ratios. We use such "average" anchor boxes when deciding to which of them to assign ground truth box. They are like some kind of a "reference" upon we label training sets. Yolo would also predict such bounding boxes and that's why it's not super accurate.

For each anchor box, system will then not be predicting its dimensions but its offset to the ground truth bounding box. Anchor box with the smallest offset is the one with highest IoU and it will be promoted to the predicted bounding box.

Network

Darknet: CNN for the real-time object detection with high accuracy.

Single feedforward ConvNet that predicts BBs and class probabilities in a single evaluation

Built on GoogleNet which is 22 layers; added 2 convolutional layers and 2 fully connected layers (for inference and regression on the bounding box center coordinates as well as the size and width which can range over the whole image).

YOLO Architecture
Image source: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"

Training

Fully supervised.

YOLO trains on full images.

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall. [1]


Labeling training set:
  • k-Means clustering used on training sets to determine anchor box sizes & their number

BK: These B anchor boxes are used for two things:

1) To help placing ground truth bounding box coordinates at the right position (for a specific anchor box) in the label vector. This specific anchor box is the one for which IoU with ground truth box is the highest.

2) To determine at the output which predicted coordinates (or predicted offsets) belong to which anchor box.

Yolo network does not use ever fixed sizes of anchor boxes. It only predicts each anchor box and anchor boxes at the network output will match those in the input ((x, y, w, h) at the specific location in the output vector are predictions learned after observing all quadruplets (x, y, w, h) at the same location in the input label vector...as simple as that, just like any other regression - see the image below).

  • Each object in image is assigned to cell that contains object’s midpoint and anchor box (belonging to that cell) with the highest IoU ⇒ cell where centre of the objects falls is responsible for detecting it ⇒ P(classi)=1
  • for all cells which do not contain midpoint of any object P(object) == 0 and label is [0, ?,?,?,?] (? means don't care). So for example if we have two objects in the image, only two grid cells will have [1, x, y, w, h] all other will have [0, ?,?,?,?] 
  • if anchor box size matches the ground truth box ⇔ P(object)=1 ∧ IOU=1

Training set labeling: label vector matches network output


Example of 2 potential anchor boxes.


The role of Anchor boxes is to hold the information about the location of potential bounding boxes.Image source: datahacker.rs


Each of these y vectors belongs to a single cell so the complete y vector used in labeling is made of 3 x 3 such small y vectors.

Cells and anchors give spatial information to the regression output. If we don't use anchors, our neural network would be predicting classes for each cell and bounding boxes would be inaccurate as they would be matching exactly the borders of cells.

Training


For training we use convolutional weights that are pre-trained on Imagenet.
Extensive data augmentation.

Inference


Runs once on entire image. Very fast.

To enable using video files and streams from network or web cameras: compile it with OpenCV
On Ubuntu install libhighgui-dev and libopencv-dev (sudo apt-get install).


    Advantages:
    • It makes predictions with a single network evaluation ⇒ huge speed (~45 fps; Fast YOLO 155 fps); 10x faster than Faster R-CNN (which had been developed at the same time as Yolo and 2-step algorithm: region proposal + classification; 7 fps) (Pascal  2007 dataset)
    • First CNN-based model which could be used for object detection in real-time videos
    • Looks at the whole image at test time ⇒ predictions are informed by global context in the image
    • far less likely to predict false detections where nothing exists

    Disadvantages:
    • (slightly) less precise localization than Faster R-CNN 
    • Lower detection performance on smaller objects
    • Struggles with unusual aspect ratios
    • Performances improved in YOLOv2 (YOLO9000) and YOLOv3 


    Versions of YOLO


    YOLO

    YOLOv2 

    YOLOv3

    YOLOv4

    YOLOv5


    References:

    [1] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi: "You Only Look Once: Unified, Real-Time Object Detection"
    [2] Lecture 11 | Detection and Segmentation - YouTube

    Lei Mao's Log Book – Introduction to YOLOs

    Understanding YOLO - By

    machine learning - How to label training data for YOLO - Stack Overflow

    yolo.pdf
    YOLO9000.pdf

    k-means clustering for anchor boxes - Lars’ Blog

    Non-maximum Suppression (NMS) - Towards Data Science

    neural network - How is the number of grid cells in YOLO determined? - Data Science Stack Exchange

    (18) How can I label an image to train YOLO automatically? - Quora

    One-stage object detection

    A Practical Guide to Object Detection using the Popular YOLO Framework












    A Comprehensive Guide To Object Detection Using YOLO Framework — Part II (Implementing using Python)

    #029 CNN Yolo Algorithm | Master Data Science

    YOLO, YOLOv2 and YOLOv3: All You want to know - Amro Kamal - Medium

    Understanding YOLO and YOLOv2 | Manal El Aidouni

    Juan Du: Understanding of Object Detection Based on CNN Family and YOLO

    Lecture 11: Detection and Localization

    Study of Using Deep Learning Nets for Mark Detection in Space Docking Control Images