Artificial Intelligence

Convolutional neural networks

Computer vision

Image classification

Cat? (0/1)

Object detection

https://commons.wikimedia.org/wiki/File:Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg

Style transfer

How to handle larger images?

64x64x3 means 12,288 inputs
1000x1000x3 means 3,000,000 inputs
- With 1000 neurons in first layer, $W^{[1]}$ has 3 billion entries

Edge detection

https://commons.wikimedia.org/wiki/File:%C3%84%C3%A4retuvastuse_n%C3%A4ide.png

Convolution operation

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

-5	-4	0	8
-10	-2	2	3
0	-2	-4	-7
-3	-2	-3	-16

6x6 input image

3x3 filter (kernel)

4x4 output

In TensorFlow: tf.nn.conf2d und tf.keras.layers.Conv2D

Vertical edge detection?

10	10	10
10	10	10
10	10	10
10	10	10
10	10	10
10	10	10

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

0	30	30	0
0	30	30	0
0	30	30	0
0	30	30	0

Flipped image

10	10	10
10	10	10
10	10	10
10	10	10
10	10	10
10	10	10

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

0	-30	-30	0
0	-30	-30	0
0	-30	-30	0
0	-30	-30	0

Horizontal edge detection

10	10	10	0	0	0
10	10	10	0	0	0
10	10	10	0	0	0
0	0	0	10	10	10
0	0	0	10	10	10
0	0	0	10	10	10

$*$

1	1	1
0	0	0
-1	-1	-1

$=$

0	0	0	0
30	10	-10	-30
30	10	-10	-30
0	0	0	0

Alternative edge detection filters

1	0	-1
1	0	-1
1	0	-1

1	0	-1
2	0	-2
1	0	-1

3	0	-3
10	0	-10
3	0	-3

Sobel filter

Scharr filter

Idea: instead of handpicking numbers for the filter, we learn them

$w_1$	$w_2$	$w_3$
$w_4$	$w_5$	$w_6$
$w_7$	$w_8$	$w_9$

Convolution operation $X * W$ can be applied for inputs of any size while having fixed number of parameters in $W$.

Our output is smaller, than the input

$X$	$*$	$W$	$=$	$Z$
$\in \mathbb{R}^{n \times n}$		$\in \mathbb{R}^{f \times f}$		$\in \mathbb{R}^{(n-f+1) \times (n-f+1)}$

Padding the input

0	0	0	0	0	0
3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9
0	0	0	0	0	0

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

-5	-5	-6	-1	6	10
-12	-5	-4	0	8	11
-13	-10	-2	2	3	11
-10	0	-2	-4	-7	10
-7	-3	-2	-3	-16	12
-6	0	-2	1	-9	5

$6 \times 6 \to 8 \times 8$

$3 \times 3$

$4 \times 4 \to 6 \times 6$

Padding $p=1$

Other padding strategy: replication

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9

How much to pad

"valid": no padding
$n \times n \quad * \quad f \times f \quad \to \quad n-f+1 \times n-f+1$
"same": pad so that output has the same size as the input
$p = \frac{f-1}{2}$

https://commons.wikimedia.org/wiki/File:2D_Convolution_Animation.gif

Strided convolutions

$*$

$=$

$7 \times 7$

$3 \times 3$

stride $s = 2$

Input: $n_H \times n_W$
Filter: $f \times f$
Padding: $p$
Stride: $s$

Output: $(\lfloor \frac{n_H + 2p - f}{s} \rfloor + 1) \times (\lfloor \frac{n_W + 2p - f}{s} \rfloor + 1)$

Now we can use convolutions on grayscale images.

How about RGB images

E.g. go from $1024 \times 1024$ to $1024 \times 1024 \times 3$.

	$*$		$=$
$6 \times 6 \times 3$		$3 \times 3 \times 3$		$4 \times 4 (\times 1)$

Find vertical edges in red channel

1	0	-1
1	0	-1
1	0	-1

0	0	0
0	0	0
0	0	0

0	0	0
0	0	0
0	0	0

Find vertical edges in all channel

1	0	-1
1	0	-1
1	0	-1

1	0	-1
1	0	-1
1	0	-1

1	0	-1
1	0	-1
1	0	-1

Applying multiple filters

Dimensions:

Input: $n_H \times n_W \times n_c$
Filters: $f \times f \times n_c$
Number of filters: $n_c'$

Output: $(n_H - f + 1) \times (n_W -f + 1) \times n_c'$

We now almost have one convolutional neural network layer.

What else do we need?

Add bias term $b$
Add non-linearity

If one layer has $10$ filters of size $3 \times 3 \times 3$. How many parameters has the layer?

Notation for layer $l$:

$f^{[l]}$: filter size
$p^{[l]}$: padding
$s^{[l]}$: stride
$n_c^{[l]}$: number of filters

Dimensions for layer $l$:

Input size: $n_H^{[l-1]} \times n_W^{[l-1]} \times n_c^{[l-1]}$
Output size: $n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$
- $n_H^{[l]} = \lfloor \frac{n_H^{[l-1]} + 2 p^{[l]} - f^{[l]}}{s^{[l]}}\rfloor + 1$
- $n_W^{[l]} = \lfloor \frac{n_W^{[l-1]} + 2 p^{[l]} - f^{[l]}}{s^{[l]}}\rfloor + 1$
Each filter: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]}$
$n_c^{[l]}$ filters
All weights: $f^{[l]} \times f^{[l]} \times n_c^{[l-1]} \times n_c^{[l]}$
Bias: $n_c^{[l]}$ (or $1 \times 1 \times 1 \times n_c^{[l]})$

Dimensions for layer $l$:

Activations $a^{[l]}$: $n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$
Batch of activations $A^{[l]}$: $m \times n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]}$

Let's build a neural network

Input	Layer 1	$a^{[1]}$	Layer 2	$a^{[2]}$	Layer 3	$a^{[3]}$
$39 \times 39 \times 3$	$\Rightarrow$	$37 \times 37 \times 10$	$\Rightarrow$	$17 \times 17 \times 20$	$\Rightarrow$	$7 \times 7 \times 40$
$n_H^{[0]} = 39\\ n_W^{[0]} = 39\\ n_c^{[0]} = 3$	$f^{[1]} = 3 \\ s^{[1]} = 1 \\ p^{[1]} = 0 \\ 10 \text{ filters}$	$n_H^{[1]} = 37\\ n_W^{[1]} = 37\\ n_c^{[1]} = 10$	$f^{[2]} = 5 \\ s^{[2]} = 2 \\ p^{[2]} = 0 \\ 20 \text{ filters}$	$n_H^{[2]} = 17\\ n_W^{[2]} = 17\\ n_c^{[2]} = 20$	$f^{[3]} = 5 \\ s^{[3]} = 2 \\ p^{[3]} = 0 \\ 40 \text{ filters}$	$n_H^{[3]} = 7\\ n_W^{[3]} = 7\\ n_c^{[3]} = 40$

Finally (Layer 4):

Reshape to 1960 element vector
Fully connected layer with logistic or softmax output

Max pooling

1	3	2	1
2	9	1	1
1	3	2	3
5	6	1	2

$\Rightarrow$

9	2
6	3

$4 \times 4$

$2 \times 2$

$f = 2$
$s = 2$

If we have multiple channels, pooling is applied for each channel independently.

Pooling can reduce dimensions significantly

Average pooling

1	3	2	1
2	9	1	1
1	3	2	3
5	6	1	2

$\Rightarrow$

3.75	1.25
3.75	2

$4 \times 4$

$2 \times 2$

Adaptive pooling

Idea: Fix target dimensions and set stride and filter size accordingly

\[ s^{[l]} = \lfloor \frac{n^{[l-1]}}{n^{[l]}} \rfloor \\[3mm] f^{[l]} = n^{[l-1]} - (n^{[l]} - 1) s^{[l]} \]

With one adaptive pooling layer we ensure that images of arbitrary size can be used.

$1 \times 1$ filters

A $1 \times 1$ filter can be used to make an operation only on the channels.

Layer types for convolutional networks

Convolution
Pooling
Fully connected

Example (close to LeNet-5)

	Activation shape	Activation size	# parameters
Input	(32,32,3)	3072	0
CONV1 (f=5, s=1)	(28,28,6)	4707	456
POOL1 (f=2, s=2)	(14,14,6)	1176	0
CONV2 (f=5, s=1)	(10,10,16)	1600	2416
POOL2 (f=2, s=2)	(5,5,16)	400	0
FC3	(120,1)	120	48120
FC4	(84,1)	84	10164
Softmax	(10,1)	10	850

Common patterns

Decreasing height and width with depth
Increasing channels with depth
Number of activations sinking with depth
Convolutions followed by poolings
Fully connected layers at the end

Parameter sharing

Feature detector that is useful for one part of the image should be useful on another part.

Translation invariance

Convolutions produce similar output when the image is shifted.

Sparsity / Locality

Each output is only affected by few inputs

How to decide on neural network architecture?

Check other models for similar tasks

LeNet-5 (1998)

Classify single digit images

Input: $32 \times 32 \times 3$
6 Convolutions ($f=5, s=1$, tanh) : $28 \times 28 \times 6$
Avg. pooling ($f=2, s=2$) : $14 \times 14 \times 6$
16 Convolutions ($f=5, s=1$, tanh) : $10 \times 10 \times 16$
Avg. pooling ($f=2, s=2$) : $5 \times 5 \times 16$
Flatten : $400$
120 Fully conn. (tanh): $120$
84 Fully conn. (tanh): $84$
10 Fully conn. (rbf): $10$

About 60k paramters

AlexNet (2012)

Image recognition

Input: $227 \times 227 \times 3$
96 Convolutions ($f=11, s=4$, ReLU) : $55 \times 55 \times 96$
Max Pooling ($f=3, s=2$) : $27 \times 27 \times 96$
256 Convolutions ($f=5, p=2$, ReLU) : $27 \times 27 \times 256$
Max Pooling ($f=3, s=2$) : $13 \times 13 \times 256$
384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
256 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 256$
Max Pooling ($f=3, s=2$) : $6 \times 6 \times 256$
Flatten : $9216$
4096 Fully conn. (ReLU) : $4096$
4096 Fully conn. (ReLU) : $4096$
1000 Fully conn. (SoftMax) : $1000$

About 60 mio. paramters

VGG - 16 (2015)

Image recognition

CONV: $f=3, p=1$; POOL: $f=2, s=2$

Input: $224 \times 224 \times 3$
2x CONV 64 : $224 \times 224 \times 64$
POOL : $112 \times 112 \times 64$
2x CONV 128 : $112 \times 112 \times 128$
POOL : $56 \times 56 \times 128$
3x CONV 256 : $56 \times 56 \times 256$
POOL : $28 \times 28 \times 256$
3x CONV 512 : $14 \times 14 \times 512$
POOL : $14 \times 14 \times 512$
3x CONV 512 : $14 \times 14 \times 512$
POOL : $7 \times 7 \times 512$
2 x FC $4096$
Softmax $1000$

About 138 mio. paramters

With very deep architectures, vanishing / exploding gradients become an issue

Idea: Add shortcuts to network (residual networks)

Residual block: \[ a^{[l+2]} = g(z^{[l+2]} {\color{red} + a^{[l]}}) \]

For very deep networks, the training becomes fragile (vanishing / exploding gradients problem)

With ResNets we can keep increasing depth more than with "plain" networks

Intuition why this works:

\[ a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) \\[3mm] = g(W^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]}) \]

Setting $W^{[l+2]} = 0, b^{[l+2]} = 0$: \[ a^{[l+2]} = g(a^{[l]}) \] which for ReLU implies $a^{[l+2]} = a^{[l]}$

For any residual block it is easy to not make things worse.

When using skip connections the dimensions of $a^{[l]}$ and $a^{[l+2]}$ must match.

Need to use convolutions with "same" padding strategy to keep dimensions.

What is a good filter size?

Idea: Use multiple sizes in same layer.

Inception network

All units have "same" padding strategy and stride 1, so that output dimensions match

Number of multiplications for applying $32$ filters of size $5 \times 5$ with "same" padding to an input with dimensions $28 \times 28 \times 192$:

A single output needs $5 \cdot 5 \cdot 192 = 4800$ multiplications
We have $28 \cdot 28 \cdot 32 = 25088$ outputs
That makes $4800 \cdot 25088 = 120422400$ multiplications.

Bottleneck layer

Multiplications:

$1 \cdot 1 \cdot 192 \cdot 28 \cdot 28 \cdot 16 = 2408448$
$5 \cdot 5 \cdot 16 \cdot 28 \cdot 28 \cdot 32 = 10035200$

Total: $12443648$

Inception module

Inception network

https://knowyourmeme.com/photos/531557-we-need-to-go-deeper

MobileNets (2017)

Goal: Low computation cost for inference

Depthwise Separable Convolution

Idea: Apply filter to each channel independently and use 1x1 filter afterwards.

Normal convolution:		$*$		$=$
Depthwise convolution:		$*$		$=$

\[ n_W \times n_H \times n_c \\[3mm] \xrightarrow{\text{depthw. conv.}} n_W' \times n_H' \times n_c \\[3mm] \xrightarrow{n_c' \text{ conv. } 1\times 1} n_W' \times n_H' \times n_c' \]

Computation cost

Reduced by factor \[ \frac{1}{n_c'} + \frac{1}{f^2} \]

Original MobileNet v1: 13 layers depthwise separable convolution + POOL + FC + SoftMax

MobileNet v2 (2019)

"Bottleneck Block"

EfficientNet (2020)

How to select model for a given computation budget?

Model	Accuracy	# Params	# FLOPs
EfficientNet-B0	77.1%	5.3M	0.39B
EfficientNet-B1	79.1%	7.8M	0.70B
EfficientNet-B2	80.1%	9.2M	1.0B
EfficientNet-B3	81.6%	12M	1.8B
EfficientNet-B4	82.9%	19M	4.2B
EfficientNet-B5	83.6%	30M	9.9B
EfficientNet-B6	84.0%	43M	19B
EfficientNet-B7	84.3%	66M	37B

Do not reinvent the wheel

Use open source implementations of public models

Many computer vision models need huge amounts of data and are expensive to train

Idea: Reuse existing model and weights for new task

Transfer Learning

Adapt model for own final classes

Use new softmax layer, keep rest constant

With transfer learning we turn the first $L-1$ layers into a function $f(x) = a^{[L-1]}$.

Then we train a linear classfier on $a^{[L-1]}$: \[ \hat y = g(W^{[L]} x + b^{[L]}) \]

If we have more data, we can retrain more layers

Mixed strategy:

Replace soft-max layer, freeze rest and train
Unfreeze some layers from the back and train with smaller learning rate

Using a pre-trained model as a start almost always beats training a new model from scratch.

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9