Artificial Intelligence

Convolutional neural networks

Computer vision

Image classification

Cat? (0/1)

Object detection

https://commons.wikimedia.org/wiki/File:Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg

Style transfer

How to handle larger images?

  • 64x64x3 means 12,288 inputs
  • 1000x1000x3 means 3,000,000 inputs
    • With 1000 neurons in first layer, $W^{[1]}$ has 3 billion entries

Edge detection

https://commons.wikimedia.org/wiki/File:%C3%84%C3%A4retuvastuse_n%C3%A4ide.png

Convolution operation

301274
158931
272513
013178
421628
245239
$*$
10-1
10-1
10-1
$=$
-5
-4
0
8
-10
-2
2
3
0
-2
-4
-7
-3
-2
-3
-16
6x6 input image 3x3 filter (kernel) 4x4 output

In TensorFlow: tf.nn.conf2d und tf.keras.layers.Conv2D

Vertical edge detection?

101010000
101010000
101010000
101010000
101010000
101010000
$*$
10-1
10-1
10-1
$=$
030300
030300
030300
030300

Flipped image

000101010
000101010
000101010
000101010
000101010
000101010
$*$
10-1
10-1
10-1
$=$
0-30-300
0-30-300
0-30-300
0-30-300

Horizontal edge detection

101010000
101010000
101010000
000101010
000101010
000101010
$*$
111
000
-1-1-1
$=$
0000
3010-10-30
3010-10-30
0000

Alternative edge detection filters

10-1
10-1
10-1
10-1
20-2
10-1
30-3
100-10
30-3
Sobel filter Scharr filter

Idea: instead of handpicking numbers for the filter, we learn them

$w_1$$w_2$$w_3$
$w_4$$w_5$$w_6$
$w_7$$w_8$$w_9$

Convolution operation $X * W$ can be applied for inputs of any size while having fixed number of parameters in $W$.

Our output is smaller, than the input

$X$ $*$ $W$ $=$ $Z$
$\in \mathbb{R}^{n \times n}$ $\in \mathbb{R}^{f \times f}$ $\in \mathbb{R}^{(n-f+1) \times (n-f+1)}$

Padding the input

00000000
03012740
01589310
02725130
00131780
04216280
02452390
00000000
$*$
10-1
10-1
10-1
$=$
-5-5-6-1610
-12-5-40811
-13-10-22311
-100-2-4-710
-7-3-2-3-1612
-60-21-95
$6 \times 6 \to 8 \times 8$ $3 \times 3$ $4 \times 4 \to 6 \times 6$

Padding $p=1$

Other padding strategy: replication

33012744
33012744
11589311
22725133
00131788
44216288
22452399
22452399

How much to pad

  • "valid": no padding
    $n \times n \quad * \quad f \times f \quad \to \quad n-f+1 \times n-f+1$
  • "same": pad so that output has the same size as the input
    $p = \frac{f-1}{2}$

https://commons.wikimedia.org/wiki/File:2D_Convolution_Animation.gif

Strided convolutions

$*$
$=$
$7 \times 7$ $3 \times 3$ $3 \times 3$

stride $s = 2$

  • Input: $n_H \times n_W$
  • Filter: $f \times f$
  • Padding: $p$
  • Stride: $s$

Output: $(\lfloor \frac{n_H + 2p - f}{s} \rfloor + 1) \times (\lfloor \frac{n_W + 2p - f}{s} \rfloor + 1)$

Now we can use convolutions on grayscale images.

How about RGB images

E.g. go from $1024 \times 1024$ to $1024 \times 1024 \times 3$.

$*$ $=$
$6 \times 6 \times 3$ $3 \times 3 \times 3$ $4 \times 4 (\times 1)$

Find vertical edges in red channel

10-1
10-1
10-1
000
000
000
000
000
000

Find vertical edges in all channel

10-1
10-1
10-1
10-1
10-1
10-1
10-1
10-1
10-1

Applying multiple filters

Dimensions:

  • Input: $n_H \times n_W \times n_c$
  • Filters: $f \times f \times n_c$
  • Number of filters: $n_c'$

Output: $(n_H - f + 1) \times (n_W -f + 1) \times n_c'$

We now almost have one convolutional neural network layer.

    What else do we need?

  • Add bias term $b$
  • Add non-linearity

If one layer has $10$ filters of size $3 \times 3 \times 3$. How many parameters has the layer?

    Notation for layer $l$:

  • $f^{[l]}$: filter size
  • $p^{[l]}$: padding
  • $s^{[l]}$: stride
  • $n_c^{[l]}$: number of filters

    Dimensions for layer $l$:

  • Input size: $n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}$
  • Output size: $n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$
    • $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]} + 2 p^{[l]} - f^{[l]}}{s^{[l]}}\rfloor + 1$
    • $n_W^{[l]} = \lfloor \frac{n_W^{[l-1]} + 2 p^{[l]} - f^{[l]}}{s^{[l]}}\rfloor + 1$
  • Each filter: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]}$
  • $n_c^{[l]}$ filters
  • All weights: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$
  • Bias: $n_c^{[l]}$ (or $1 \times 1 \times 1 \times n_c^{[l]})$

    Dimensions for layer $l$:

  • Activations $a^{[l]}$: $n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$
  • Batch of activations $A^{[l]}$: $m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$

Let's build a neural network

Input Layer 1 $a^{[1]}$ Layer 2 $a^{[2]}$ Layer 3 $a^{[3]}$
$39 \times 39 \times 3$ $\Rightarrow$ $37 \times 37 \times 10$ $\Rightarrow$ $17 \times 17 \times 20$ $\Rightarrow$ $7 \times 7 \times 40$
$n_H^{[0]} = 39\\ n_W^{[0]} = 39\\ n_c^{[0]} = 3$ $f^{[1]} = 3 \\ s^{[1]} = 1 \\ p^{[1]} = 0 \\ 10 \text{ filters}$ $n_H^{[1]} = 37\\ n_W^{[1]} = 37\\ n_c^{[1]} = 10$ $f^{[2]} = 5 \\ s^{[2]} = 2 \\ p^{[2]} = 0 \\ 20 \text{ filters}$ $n_H^{[2]} = 17\\ n_W^{[2]} = 17\\ n_c^{[2]} = 20$ $f^{[3]} = 5 \\ s^{[3]} = 2 \\ p^{[3]} = 0 \\ 40 \text{ filters}$ $n_H^{[3]} = 7\\ n_W^{[3]} = 7\\ n_c^{[3]} = 40$

    Finally (Layer 4):

  • Reshape to 1960 element vector
  • Fully connected layer with logistic or softmax output

Max pooling

1321
2911
1323
5612
$\Rightarrow$
9
2
6
3
$4 \times 4$ $2 \times 2$
  • $f = 2$
  • $s = 2$

If we have multiple channels, pooling is applied for each channel independently.

Pooling can reduce dimensions significantly

Average pooling

1321
2911
1323
5612
$\Rightarrow$
3.75
1.25
3.75
2
$4 \times 4$ $2 \times 2$

Adaptive pooling

Idea: Fix target dimensions and set stride and filter size accordingly

\[ s^{[l]} = \lfloor \frac{n^{[l-1]}}{n^{[l]}} \rfloor \\[3mm] f^{[l]} = n^{[l-1]} - (n^{[l]} - 1) s^{[l]} \]

With one adaptive pooling layer we ensure that images of arbitrary size can be used.

$1 \times 1$ filters

A $1 \times 1$ filter can be used to make an operation only on the channels.

    Layer types for convolutional networks

  • Convolution
  • Pooling
  • Fully connected

Example (close to LeNet-5)

Activation shapeActivation size# parameters
Input(32,32,3)30720
CONV1 (f=5, s=1)(28,28,6)4707456
POOL1 (f=2, s=2)(14,14,6)11760
CONV2 (f=5, s=1)(10,10,16)16002416
POOL2 (f=2, s=2)(5,5,16)4000
FC3(120,1)12048120
FC4(84,1)8410164
Softmax(10,1)10850

    Common patterns

  • Decreasing height and width with depth
  • Increasing channels with depth
  • Number of activations sinking with depth
  • Convolutions followed by poolings
  • Fully connected layers at the end

Parameter sharing

Feature detector that is useful for one part of the image should be useful on another part.

Translation invariance

Convolutions produce similar output when the image is shifted.

Sparsity / Locality

Each output is only affected by few inputs

How to decide on neural network architecture?

Check other models for similar tasks

LeNet-5 (1998)

Classify single digit images

  • Input: $32 \times 32 \times 3$
  • 6 Convolutions ($f=5, s=1$, tanh) : $28 \times 28 \times 6$
  • Avg. pooling ($f=2, s=2$) : $14 \times 14 \times 6$
  • 16 Convolutions ($f=5, s=1$, tanh) : $10 \times 10 \times 16$
  • Avg. pooling ($f=2, s=2$) : $5 \times 5 \times 16$
  • Flatten : $400$
  • 120 Fully conn. (tanh): $120$
  • 84 Fully conn. (tanh): $84$
  • 10 Fully conn. (rbf): $10$

About 60k paramters

AlexNet (2012)

Image recognition

  • Input: $227 \times 227 \times 3$
  • 96 Convolutions ($f=11, s=4$, ReLU) : $55 \times 55 \times 96$
  • Max Pooling ($f=3, s=2$) : $27 \times 27 \times 96$
  • 256 Convolutions ($f=5, p=2$, ReLU) : $27 \times 27 \times 256$
  • Max Pooling ($f=3, s=2$) : $13 \times 13 \times 256$
  • 384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
  • 384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
  • 256 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 256$
  • Max Pooling ($f=3, s=2$) : $6 \times 6 \times 256$
  • Flatten : $9216$
  • 4096 Fully conn. (ReLU) : $4096$
  • 4096 Fully conn. (ReLU) : $4096$
  • 1000 Fully conn. (SoftMax) : $1000$

About 60 mio. paramters

VGG - 16 (2015)

Image recognition

CONV: $f=3, p=1$; POOL: $f=2, s=2$

  • Input: $224 \times 224 \times 3$
  • 2x CONV 64 : $224 \times 224 \times 64$
  • POOL : $112 \times 112 \times 64$
  • 2x CONV 128 : $112 \times 112 \times 128$
  • POOL : $56 \times 56 \times 128$
  • 3x CONV 256 : $56 \times 56 \times 256$
  • POOL : $28 \times 28 \times 256$
  • 3x CONV 512 : $14 \times 14 \times 512$
  • POOL : $14 \times 14 \times 512$
  • 3x CONV 512 : $14 \times 14 \times 512$
  • POOL : $7 \times 7 \times 512$
  • 2 x FC $4096$
  • Softmax $1000$

About 138 mio. paramters

With very deep architectures, vanishing / exploding gradients become an issue

Idea: Add shortcuts to network (residual networks)

Residual block: \[ a^{[l+2]} = g(z^{[l+2]} {\color{red} + a^{[l]}}) \]

For very deep networks, the training becomes fragile (vanishing / exploding gradients problem)

With ResNets we can keep increasing depth more than with "plain" networks

Intuition why this works:

\[ a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) \\[3mm] = g(W^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]}) \]

Setting $W^{[l+2]} = 0, b^{[l+2]} = 0$: \[ a^{[l+2]} = g(a^{[l]}) \] which for ReLU implies $a^{[l+2]} = a^{[l]}$

For any residual block it is easy to not make things worse.

When using skip connections the dimensions of $a^{[l]}$ and $a^{[l+2]}$ must match.

Need to use convolutions with "same" padding strategy to keep dimensions.

What is a good filter size?

Idea: Use multiple sizes in same layer.

Inception network

All units have "same" padding strategy and stride 1, so that output dimensions match

Number of multiplications for applying $32$ filters of size $5 \times 5$ with "same" padding to an input with dimensions $28 \times 28 \times 192$:

  • A single output needs $5 \cdot 5 \cdot 192 = 4800$ multiplications
  • We have $28 \cdot 28 \cdot 32 = 25088$ outputs
  • That makes $4800 \cdot 25088 = 120422400$ multiplications.

Bottleneck layer

    Multiplications:

  • $1 \cdot 1 \cdot 192 \cdot 28 \cdot 28 \cdot 16 = 2408448$
  • $5 \cdot 5 \cdot 16 \cdot 28 \cdot 28 \cdot 32 = 10035200$
  • Total: $12443648$

Inception module

Inception network

https://knowyourmeme.com/photos/531557-we-need-to-go-deeper

MobileNets (2017)

Goal: Low computation cost for inference

Depthwise Separable Convolution

Idea: Apply filter to each channel independently and use 1x1 filter afterwards.

Normal convolution: $*$ $=$
Depthwise convolution: $*$ $=$

\[ n_W \times n_H \times n_c \\[3mm] \xrightarrow{\text{depthw. conv.}} n_W' \times n_H' \times n_c \\[3mm] \xrightarrow{n_c' \text{ conv. } 1\times 1} n_W' \times n_H' \times n_c' \]

Computation cost

Reduced by factor \[ \frac{1}{n_c'} + \frac{1}{f^2} \]

Original MobileNet v1: 13 layers depthwise separable convolution + POOL + FC + SoftMax

MobileNet v2 (2019)

"Bottleneck Block"

EfficientNet (2020)

How to select model for a given computation budget?

ModelAccuracy# Params# FLOPs
EfficientNet-B077.1%5.3M0.39B
EfficientNet-B179.1%7.8M0.70B
EfficientNet-B280.1%9.2M1.0B
EfficientNet-B381.6%12M1.8B
EfficientNet-B482.9%19M4.2B
EfficientNet-B583.6%30M9.9B
EfficientNet-B684.0%43M19B
EfficientNet-B784.3%66M37B

Do not reinvent the wheel

Use open source implementations of public models

Many computer vision models need huge amounts of data and are expensive to train

Idea: Reuse existing model and weights for new task

Transfer Learning

Adapt model for own final classes

Use new softmax layer, keep rest constant

With transfer learning we turn the first $L-1$ layers into a function $f(x) = a^{[L-1]}$.

Then we train a linear classfier on $a^{[L-1]}$: \[ \hat y = g(W^{[L]} x + b^{[L]}) \]

If we have more data, we can retrain more layers

    Mixed strategy:

  • Replace soft-max layer, freeze rest and train
  • Unfreeze some layers from the back and train with smaller learning rate

Using a pre-trained model as a start almost always beats training a new model from scratch.