Artificial Intelligence
Convolutional neural networks
Computer vision
Image classification
 |
Cat? (0/1) |
Object detection
Style transfer
How to handle larger images?
- 64x64x3 means 12,288 inputs
- 1000x1000x3 means 3,000,000 inputs
- With 1000 neurons in first layer, $W^{[1]}$ has 3 billion entries
Edge detection
Convolution operation
| 3 | 0 | 1 | 2 | 7 | 4 |
| 1 | 5 | 8 | 9 | 3 | 1 |
| 2 | 7 | 2 | 5 | 1 | 3 |
| 0 | 1 | 3 | 1 | 7 | 8 |
| 4 | 2 | 1 | 6 | 2 | 8 |
| 2 | 4 | 5 | 2 | 3 | 9 |
|
$*$ |
|
$=$ |
-5 | -4 | 0 | 8 |
-10 | -2 | 2 | 3 |
0 | -2 | -4 | -7 |
-3 | -2 | -3 | -16 |
|
| 6x6 input image |
|
3x3 filter (kernel) |
|
4x4 output |
Vertical edge detection?
| 10 | 10 | 10 | 0 | 0 | 0 |
| 10 | 10 | 10 | 0 | 0 | 0 |
| 10 | 10 | 10 | 0 | 0 | 0 |
| 10 | 10 | 10 | 0 | 0 | 0 |
| 10 | 10 | 10 | 0 | 0 | 0 |
| 10 | 10 | 10 | 0 | 0 | 0 |
|
$*$ |
|
$=$ |
| 0 | 30 | 30 | 0 |
| 0 | 30 | 30 | 0 |
| 0 | 30 | 30 | 0 |
| 0 | 30 | 30 | 0 |
|
Flipped image
| 0 | 0 | 0 | 10 | 10 | 10 |
| 0 | 0 | 0 | 10 | 10 | 10 |
| 0 | 0 | 0 | 10 | 10 | 10 |
| 0 | 0 | 0 | 10 | 10 | 10 |
| 0 | 0 | 0 | 10 | 10 | 10 |
| 0 | 0 | 0 | 10 | 10 | 10 |
|
$*$ |
|
$=$ |
| 0 | -30 | -30 | 0 |
| 0 | -30 | -30 | 0 |
| 0 | -30 | -30 | 0 |
| 0 | -30 | -30 | 0 |
|
Horizontal edge detection
| 10 | 10 | 10 | 0 | 0 | 0 |
| 10 | 10 | 10 | 0 | 0 | 0 |
| 10 | 10 | 10 | 0 | 0 | 0 |
| 0 | 0 | 0 | 10 | 10 | 10 |
| 0 | 0 | 0 | 10 | 10 | 10 |
| 0 | 0 | 0 | 10 | 10 | 10 |
|
$*$ |
|
$=$ |
| 0 | 0 | 0 | 0 |
| 30 | 10 | -10 | -30 |
| 30 | 10 | -10 | -30 |
| 0 | 0 | 0 | 0 |
|
Alternative edge detection filters
|
|
|
|
|
Sobel filter |
Scharr filter |
Idea: instead of handpicking numbers for the filter, we learn them
| $w_1$ | $w_2$ | $w_3$ |
| $w_4$ | $w_5$ | $w_6$ |
| $w_7$ | $w_8$ | $w_9$ |
Convolution operation $X * W$ can be applied for inputs of any size while having
fixed number of parameters in $W$.
Our output is smaller, than the input
| $X$ |
$*$ |
$W$ |
$=$ |
$Z$ |
| $\in \mathbb{R}^{n \times n}$ |
|
$\in \mathbb{R}^{f \times f}$ |
|
$\in \mathbb{R}^{(n-f+1) \times (n-f+1)}$ |
Padding the input
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 3 | 0 | 1 | 2 | 7 | 4 | 0 |
| 0 | 1 | 5 | 8 | 9 | 3 | 1 | 0 |
| 0 | 2 | 7 | 2 | 5 | 1 | 3 | 0 |
| 0 | 0 | 1 | 3 | 1 | 7 | 8 | 0 |
| 0 | 4 | 2 | 1 | 6 | 2 | 8 | 0 |
| 0 | 2 | 4 | 5 | 2 | 3 | 9 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
$*$ |
|
$=$ |
| -5 | -5 | -6 | -1 | 6 | 10 |
| -12 | -5 | -4 | 0 | 8 | 11 |
| -13 | -10 | -2 | 2 | 3 | 11 |
| -10 | 0 | -2 | -4 | -7 | 10 |
| -7 | -3 | -2 | -3 | -16 | 12 |
| -6 | 0 | -2 | 1 | -9 | 5 |
|
| $6 \times 6 \to 8 \times 8$ |
|
$3 \times 3$ |
|
$4 \times 4 \to 6 \times 6$ |
Padding $p=1$
Other padding strategy: replication
| 3 | 3 | 0 | 1 | 2 | 7 | 4 | 4 |
| 3 | 3 | 0 | 1 | 2 | 7 | 4 | 4 |
| 1 | 1 | 5 | 8 | 9 | 3 | 1 | 1 |
| 2 | 2 | 7 | 2 | 5 | 1 | 3 | 3 |
| 0 | 0 | 1 | 3 | 1 | 7 | 8 | 8 |
| 4 | 4 | 2 | 1 | 6 | 2 | 8 | 8 |
| 2 | 2 | 4 | 5 | 2 | 3 | 9 | 9 |
| 2 | 2 | 4 | 5 | 2 | 3 | 9 | 9 |
How much to pad
- "valid": no padding
$n \times n \quad * \quad f \times f \quad \to \quad n-f+1 \times n-f+1$
- "same": pad so that output has the same size as the input
$p = \frac{f-1}{2}$
Strided convolutions
|
|
$*$ |
|
$=$ |
|
| $7 \times 7$ |
|
$3 \times 3$ |
|
$3 \times 3$ |
stride $s = 1$
- Input: $n_H \times n_W$
- Filter: $f \times f$
- Padding: $p$
- Stride: $s$
Output:
$(\lfloor \frac{n_H + 2p - f}{s} \rfloor + 1) \times (\lfloor \frac{n_W + 2p - f}{s} \rfloor + 1)$
Now we can use convolutions on grayscale images.
How about RGB images
E.g. go from $1024 \times 1024$ to $1024 \times 1024 \times 3$.
|
$*$ |
|
$=$ |
|
| $6 \times 6 \times 3$ |
|
$3 \times 3 \times 3$ |
|
$4 \times 4 (\times 1)$ |
Find vertical edges in red channel
Find vertical edges in all channel
Applying multiple filters
Dimensions:
- Input: $n_H \times n_W \times n_c$
- Filters: $f \times f \times n_c$
- Number of filters: $n_c'$
Output: $(n_H - f + 1) \times (n_W -f + 1) \times n_c'$
We now almost have one convolutional neural network layer.
What else do we need?
- Add bias term $b$
- Add non-linearity
If one layer has $10$ filters of size $3 \times 3 \times 3$. How many parameters has the layer?
Notation for layer $l$:
- $f^{[l]}$: filter size
- $p^{[l]}$: padding
- $s^{[l]}$: stride
- $n_c^{[l]}$: number of filters
Let's build a neural network
| Input |
Layer 1 |
$a^{[1]}$ |
Layer 2 |
$a^{[2]}$ |
Layer 3 |
$a^{[3]}$ |
| $39 \times 39 \times 3$ |
$\Rightarrow$ |
$37 \times 37 \times 10$ |
$\Rightarrow$ |
$17 \times 17 \times 20$ |
$\Rightarrow$ |
$7 \times 7 \times 40$ |
| $n_H^{[0]} = 39\\ n_W^{[0]} = 39\\ n_c^{[0]} = 3$ |
$f^{[1]} = 3 \\ s^{[1]} = 1 \\ p^{[1]} = 0 \\ 10 \text{ filters}$ |
$n_H^{[1]} = 37\\ n_W^{[1]} = 37\\ n_c^{[1]} = 10$ |
$f^{[2]} = 5 \\ s^{[2]} = 2 \\ p^{[2]} = 0 \\ 20 \text{ filters}$ |
$n_H^{[2]} = 17\\ n_W^{[2]} = 17\\ n_c^{[2]} = 20$ |
$f^{[3]} = 5 \\ s^{[3]} = 2 \\ p^{[3]} = 0 \\ 40 \text{ filters}$ |
$n_H^{[3]} = 7\\ n_W^{[3]} = 7\\ n_c^{[3]} = 40$ |
Finally (Layer 4):
- Reshape to 1960 element vector
- Fully connected layer with logistic or softmax output
Max pooling
|
|
$\Rightarrow$ |
|
| $4 \times 4$ |
|
$2 \times 2$ |
If we have multiple channels, pooling is applied for each channel independently.
Pooling can reduce dimensions significantly
Average pooling
|
|
$\Rightarrow$ |
|
| $4 \times 4$ |
|
$2 \times 2$ |
Adaptive pooling
Idea: Fix target dimensions and set stride and filter size accordingly
\[
s^{[l]} = \lfloor \frac{n^{[l-1]}}{n^{[l]}} \rfloor \\[3mm]
f^{[l]} = n^{[l-1]} - (n^{[l]} - 1) s^{[l]}
\]
With one adaptive pooling layer we ensure that images of arbitrary size can be used.
$1 \times 1$ filters
A $1 \times 1$ filter can be used to make an operation only on the channels.
Example (close to LeNet-5)
| Activation shape | Activation size | # parameters |
| Input | (32,32,3) | 3072 | 0 |
| CONV1 (f=5, s=1) | (28,28,6) | 4707 | 456 |
| POOL1 (f=2, s=2) | (14,14,6) | 1176 | 0 |
| CONV2 (f=5, s=1) | (10,10,16) | 1600 | 2416 |
| POOL2 (f=2, s=2) | (5,5,16) | 400 | 0 |
| FC3 | (120,1) | 120 | 48120 |
| FC4 | (84,1) | 84 | 10164 |
| Softmax | (10,1) | 10 | 850 |
Common patterns
- Decreasing height and width with depth
- Increasing channels with depth
- Number of activations sinking with depth
- Convolutions followed by poolings
- Fully connected layers at the end
Parameter sharing
Feature detector that is useful for one part of the image should be useful on another part.
Translation invariance
Convolutions produce similar output when the image is shifted.
Sparsity / Locality
Each output is only affected by few inputs
How to decide on neural network architecture?
Check other models for similar tasks
LeNet-5 (1998)
Classify single digit images
- Input: $32 \times 32 \times 3$
- 6 Convolutions ($f=5, s=1$, tanh) : $28 \times 28 \times 6$
- Avg. pooling ($f=2, s=2$) : $14 \times 14 \times 6$
- 16 Convolutions ($f=5, s=1$, tanh) : $10 \times 10 \times 16$
- Avg. pooling ($f=2, s=2$) : $5 \times 5 \times 16$
- Flatten : $400$
- 120 Fully conn. (tanh): $120$
- 84 Fully conn. (tanh): $84$
- 10 Fully conn. (rbf): $10$
About 60k paramters
AlexNet (2012)
Image recognition
- Input: $227 \times 227 \times 3$
- 96 Convolutions ($f=11, s=4$, ReLU) : $55 \times 55 \times 96$
- Max Pooling ($f=3, s=2$) : $27 \times 27 \times 96$
- 256 Convolutions ($f=5, p=2$, ReLU) : $27 \times 27 \times 256$
- Max Pooling ($f=3, s=2$) : $13 \times 13 \times 256$
- 384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
- 384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
- 256 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 256$
- Max Pooling ($f=3, s=2$) : $6 \times 6 \times 256$
- Flatten : $9216$
- 4096 Fully conn. (ReLU) : $4096$
- 4096 Fully conn. (ReLU) : $4096$
- 1000 Fully conn. (SoftMax) : $1000$
About 60 mio. paramters
VGG - 16 (2015)
Image recognition
CONV: $f=3, p=1$; POOL: $f=2, s=2$
- Input: $224 \times 224 \times 3$
- 2x CONV 64 : $224 \times 224 \times 64$
- POOL : $112 \times 112 \times 64$
- 2x CONV 128 : $112 \times 112 \times 128$
- POOL : $56 \times 56 \times 128$
- 3x CONV 256 : $56 \times 56 \times 256$
- POOL : $28 \times 28 \times 256$
- 3x CONV 512 : $14 \times 14 \times 512$
- POOL : $14 \times 14 \times 512$
- 3x CONV 512 : $14 \times 14 \times 512$
- POOL : $7 \times 7 \times 512$
- 2 x FC $4096$
- Softmax $1000$
About 138 mio. paramters
With very deep architectures, vanishing / exploding gradients become an issue
Idea: Add shortcuts to network (residual networks)
Residual block:
\[
a^{[l+2]} = g(z^{[l+2]} {\color{red} + a^{[l]}})
\]
For very deep networks, the training becomes fragile (vanishing / exploding gradients problem)
With ResNets we can keep increasing depth more than with "plain" networks
Intuition why this works:
\[
a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) \\[3mm]
= g(W^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]})
\]
Setting $W^{[l+2]} = 0, b^{[l+2]} = 0$:
\[
a^{[l+2]} = g(a^{[l]})
\]
which for ReLU implies $a^{[l+2]} = a^{[l]}$
For any residual block it is easy to not make things worse.
When using skip connections the dimensions of $a^{[l]}$ and $a^{[l+2]}$ must match.
Need to use convolutions with "same" padding strategy to keep dimensions.
What is a good filter size?
Idea: Use multiple sizes in same layer.
Inception network
All units have "same" padding strategy and stride 1, so that output dimensions match
Number of multiplications for applying $32$ filters of size $5 \times 5$ with "same" padding to an input with dimensions $28 \times 28 \times 192$:
- A single output needs $5 \cdot 5 \cdot 192 = 4800$ multiplications
- We have $28 \cdot 28 \cdot 32 = 25088$ outputs
- That makes $4800 \cdot 25088 = 120422400$ multiplications.
Bottleneck layer
Multiplications:
- $1 \cdot 1 \cdot 192 \cdot 28 \cdot 28 \cdot 16 = 2408448$
- $5 \cdot 5 \cdot 16 \cdot 28 \cdot 28 \cdot 32 = 10035200$
Total: $12443648$
Inception module
Inception network
MobileNets (2017)
Goal: Low computation cost for inference
Depthwise Separable Convolution
Idea: Apply filter to each channel independently and use 1x1 filter afterwards.
| Normal convolution: |
|
$*$ |
|
$=$ |
|
| Depthwise convolution: |
|
$*$ |
|
$=$ |
|
\[
n_W \times n_H \times n_c \\[3mm]
\xrightarrow{\text{depthw. conv.}}
n_W' \times n_H' \times n_c \\[3mm]
\xrightarrow{n_c' \text{ conv. } 1\times 1}
n_W' \times n_H' \times n_c'
\]
Computation cost
Reduced by factor
\[
\frac{1}{n_c'} + \frac{1}{f^2}
\]
Original MobileNet v1: 13 layers depthwise separable convolution + POOL + FC + SoftMax
MobileNet v2 (2019)
"Bottleneck Block"
EfficientNet (2020)
How to select model for a given computation budget?
| Model | Accuracy | # Params | # FLOPs |
| EfficientNet-B0 | 77.1% | 5.3M | 0.39B |
| EfficientNet-B1 | 79.1% | 7.8M | 0.70B |
| EfficientNet-B2 | 80.1% | 9.2M | 1.0B |
| EfficientNet-B3 | 81.6% | 12M | 1.8B |
| EfficientNet-B4 | 82.9% | 19M | 4.2B |
| EfficientNet-B5 | 83.6% | 30M | 9.9B |
| EfficientNet-B6 | 84.0% | 43M | 19B |
| EfficientNet-B7 | 84.3% | 66M | 37B |
Do not reinvent the wheel
Use open source implementations of public models
Many computer vision models need huge amounts of data and are expensive to train
Idea: Reuse existing model and weights for new task
Transfer Learning
Adapt model for own final classes
Use new softmax layer, keep rest constant
With transfer learning we turn the first $L-1$ layers into a function $f(x) = a^{[L-1]}$.
Then we train a linear classfier on $a^{[L-1]}$:
\[
\hat y = g(W^{[L]} x + b^{[L]})
\]
If we have more data, we can retrain more layers
Mixed strategy:
- Replace soft-max layer, freeze rest and train
- Unfreeze some layers from the back and train with smaller learning rate
Using a pre-trained model as a start almost always beats training a new model from scratch.