Machine Learning

Convolutional neural networks

Computer vision

Bild Klassifikation

Katze? (0/1)

Object detection

https://commons.wikimedia.org/wiki/File:Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg

Style transfer

Wie können wir große Bilder verarbeiten?

64x64x3 bedeutet 12,288 Eingangsfeatures
1000x1000x3 bedeutet 3,000,000 Eingangsfeatures
- Mit 1000 Neuronen im ersten Layer, kommt $W^{[1]}$ auf 3 Milliarden Einträge

Edge detection

https://commons.wikimedia.org/wiki/File:%C3%84%C3%A4retuvastuse_n%C3%A4ide.png

Convolutionoperation

3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

-5	-4	0	8
-10	-2	2	3
0	-2	-4	-7
-3	-2	-3	-16

6x6 input image

3x3 filter (kernel)

4x4 output

In TensorFlow: tf.nn.conf2d und tf.keras.layers.Conv2D

Vertikale edge detection?

10	10	10
10	10	10
10	10	10
10	10	10
10	10	10
10	10	10

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

0	30	30	0
0	30	30	0
0	30	30	0
0	30	30	0

Gespiegeltes Bild

10	10	10
10	10	10
10	10	10
10	10	10
10	10	10
10	10	10

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

0	-30	-30	0
0	-30	-30	0
0	-30	-30	0
0	-30	-30	0

Horizontale edge detection

10	10	10	0	0	0
10	10	10	0	0	0
10	10	10	0	0	0
0	0	0	10	10	10
0	0	0	10	10	10
0	0	0	10	10	10

$*$

1	1	1
0	0	0
-1	-1	-1

$=$

0	0	0	0
30	10	-10	-30
30	10	-10	-30
0	0	0	0

Alternative edge detection filter

1	0	-1
1	0	-1
1	0	-1

1	0	-1
2	0	-2
1	0	-1

3	0	-3
10	0	-10
3	0	-3

Sobel filter

Scharr filter

Idee: Wir lernen den Filter, als ihn von Hand zu bauen

$w_1$	$w_2$	$w_3$
$w_4$	$w_5$	$w_6$
$w_7$	$w_8$	$w_9$

Die Convolution $X * W$ kann auf Inputs beliebiger Größe angewendet werden. Die Anzahl Parameter in $W$ bleibt dabei immer gleich.

Unser Output ist kleiner als der Input:

$X$	$*$	$W$	$=$	$Z$
$\in \mathbb{R}^{n \times n}$		$\in \mathbb{R}^{f \times f}$		$\in \mathbb{R}^{(n-f+1) \times (n-f+1)}$

Input Padding

0	0	0	0	0	0
3	0	1	2	7	4
1	5	8	9	3	1
2	7	2	5	1	3
0	1	3	1	7	8
4	2	1	6	2	8
2	4	5	2	3	9
0	0	0	0	0	0

$*$

1	0	-1
1	0	-1
1	0	-1

$=$

-5	-5	-6	-1	6	10
-12	-5	-4	0	8	11
-13	-10	-2	2	3	11
-10	0	-2	-4	-7	10
-7	-3	-2	-3	-16	12
-6	0	-2	1	-9	5

$6 \times 6 \to 8 \times 8$

$3 \times 3$

$4 \times 4 \to 6 \times 6$

Padding $p=1$

Andere Padding Strategie: Replication

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9

Wieviel Padding?

"valid": Kein Padding
$n \times n \quad * \quad f \times f \quad \to \quad n-f+1 \times n-f+1$
"same": So viel Padding, dass der Output gleich groß, wie der Input ist.
$p = \frac{f-1}{2}$

https://commons.wikimedia.org/wiki/File:2D_Convolution_Animation.gif

Strided convolutions

$*$

$=$

$7 \times 7$

$3 \times 3$

stride $s = 2$

Input: $n_H \times n_W$
Filter: $f \times f$
Padding: $p$
Stride: $s$

Output: $(\lfloor \frac{n_H + 2p - f}{s} \rfloor + 1) \times (\lfloor \frac{n_W + 2p - f}{s} \rfloor + 1)$

Wir können nun auf Graustufenbildern Convolutions verwenden.

Wie sieht es mit RGB Bildern aus?

Z.B. statt $1024 \times 1024$ haben wir nun $1024 \times 1024 \times 3$.

	$*$		$=$
$6 \times 6 \times 3$		$3 \times 3 \times 3$		$4 \times 4 (\times 1)$

Find vertical edges in red channel

1	0	-1
1	0	-1
1	0	-1

0	0	0
0	0	0
0	0	0

0	0	0
0	0	0
0	0	0

Finde vertikale Kanten in allen Kanälen.

1	0	-1
1	0	-1
1	0	-1

1	0	-1
1	0	-1
1	0	-1

1	0	-1
1	0	-1
1	0	-1

Statt einem können wir auch mehrere Output Kanäle produzieren:

Dimensionen:

Input: $n_H \times n_W \times n_c$
Filter: $f \times f \times n_c$
Anzahl Filter: $n_c'$

Output: $(n_H - f + 1) \times (n_W -f + 1) \times n_c'$

Wir haben nun fast alles für einen Layer Convolutional Neural Network (CNN) zusammen.

Was brauchen wir noch?

Ein Bias $b$
Eine Nicht-Linearität

Wenn ein layer $10$ Filter der Größe $3 \times 3 \times 3$ hat, wieviele Parameter hat dann der Layer?

Wir konstruieren nun ein CNN:

Input	Layer 1	$a^{[1]}$	Layer 2	$a^{[2]}$	Layer 3	$a^{[3]}$
$39 \times 39 \times 3$	$\Rightarrow$	$37 \times 37 \times 10$	$\Rightarrow$	$17 \times 17 \times 20$	$\Rightarrow$	$7 \times 7 \times 40$
$n_H^{[0]} = 39\\ n_W^{[0]} = 39\\ n_c^{[0]} = 3$	$f^{[1]} = 3 \\ s^{[1]} = 1 \\ p^{[1]} = 0 \\ 10 \text{ filters}$	$n_H^{[1]} = 37\\ n_W^{[1]} = 37\\ n_c^{[1]} = 10$	$f^{[2]} = 5 \\ s^{[2]} = 2 \\ p^{[2]} = 0 \\ 20 \text{ filters}$	$n_H^{[2]} = 17\\ n_W^{[2]} = 17\\ n_c^{[2]} = 20$	$f^{[3]} = 5 \\ s^{[3]} = 2 \\ p^{[3]} = 0 \\ 40 \text{ filters}$	$n_H^{[3]} = 7\\ n_W^{[3]} = 7\\ n_c^{[3]} = 40$

Abschluss (Layer 4):

Reshape zu Vektor mit 1960 Einträgen.
Fully connected layer mit Sigmoid oder Softmax als Aktivierungsfunktion.

Max pooling

1	3	2	1
2	9	1	1
1	3	2	3
5	6	1	2

$\Rightarrow$

9	2
6	3

$4 \times 4$

$2 \times 2$

$f = 2$
$s = 2$

Wenn wir mehrere Kanäle haben, dann wir Pooling auf jeden Kanal separat angewandt.

Durch Pooling kann die Größe radikal reduziert werden.

Average pooling

1	3	2	1
2	9	1	1
1	3	2	3
5	6	1	2

$\Rightarrow$

3.75	1.25
3.75	2

$4 \times 4$

$2 \times 2$

Adaptive Pooling

Idee: Wähle Stride und Filter Größe so, dass wir immer eine festgelegte Output Größe erreichen.

\[ s^{[l]} = \lfloor \frac{n^{[l-1]}}{n^{[l]}} \rfloor \\[3mm] f^{[l]} = n^{[l-1]} - (n^{[l]} - 1) s^{[l]} \]

Mit einem adaptive Pooling Layer können wir Input Bilder beliebiger Größe verwenden.

Layer Typen für Convolutional Neural Networks

Convolution
Pooling
Fully connected

Beispiel (ähnlich zum LeNet-5)

	Activation shape	Activation size	# parameters
Input	(32,32,3)	3072	0
CONV1 (f=5, s=1)	(28,28,6)	4707	456
POOL1 (f=2, s=2)	(14,14,6)	1176	0
CONV2 (f=5, s=1)	(10,10,16)	1600	2416
POOL2 (f=2, s=2)	(5,5,16)	400	0
FC3	(120,1)	120	48120
FC4	(84,1)	84	10164
Softmax	(10,1)	10	850

Häufige Muster:

Schrumpfende Größe mit Tiefe.
Wachsende Anzahl Kanäle.
Anzahl Aktivierungen sinkt.
Convolutions gefolgt von Pooling Layern.
Fully Connected Layer am Ende.

Parameter sharing

Ein Feature Detektor der auf einem Teil des Bildes funktioniert, sollte auch auf einem anderen Teil nützlich sein.

Translation Invariance

Convolutions produzieren ähnlichen Output, wenn wir das Bild verschieben.

Sparsity / Lokaltiät

Jeder Output wird nur von Inputs aus einer kleinen Region beeinflusst.

Wie finden wir eine gute CNN Architektur?

Kopiere von anderen Modellen die auf ähnlichen Daten funktionieren.

LeNet-5 (1998)

Klassifiziere einzelne Ziffern.

Input: $32 \times 32 \times 3$
6 Convolutions ($f=5, s=1$, tanh) : $28 \times 28 \times 6$
Avg. pooling ($f=2, s=2$) : $14 \times 14 \times 6$
16 Convolutions ($f=5, s=1$, tanh) : $10 \times 10 \times 16$
Avg. pooling ($f=2, s=2$) : $5 \times 5 \times 16$
Flatten : $400$
120 Fully conn. (tanh): $120$
84 Fully conn. (tanh): $84$
10 Fully conn. (rbf): $10$

Ungefähr 60k Paramter

AlexNet (2012)

Bilderkennung

Input: $227 \times 227 \times 3$
96 Convolutions ($f=11, s=4$, ReLU) : $55 \times 55 \times 96$
Max Pooling ($f=3, s=2$) : $27 \times 27 \times 96$
256 Convolutions ($f=5, p=2$, ReLU) : $27 \times 27 \times 256$
Max Pooling ($f=3, s=2$) : $13 \times 13 \times 256$
384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
384 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 384$
256 Convolutions ($f=3, p=1$, ReLU) : $13 \times 13 \times 256$
Max Pooling ($f=3, s=2$) : $6 \times 6 \times 256$
Flatten : $9216$
4096 Fully conn. (ReLU) : $4096$
4096 Fully conn. (ReLU) : $4096$
1000 Fully conn. (SoftMax) : $1000$

Ungefähr 60 mio. Parameter

VGG - 16 (2015)

Bilderkennung

CONV: $f=3, p=1$; POOL: $f=2, s=2$

Input: $224 \times 224 \times 3$
2x CONV 64 : $224 \times 224 \times 64$
POOL : $112 \times 112 \times 64$
2x CONV 128 : $112 \times 112 \times 128$
POOL : $56 \times 56 \times 128$
3x CONV 256 : $56 \times 56 \times 256$
POOL : $28 \times 28 \times 256$
3x CONV 512 : $14 \times 14 \times 512$
POOL : $14 \times 14 \times 512$
3x CONV 512 : $14 \times 14 \times 512$
POOL : $7 \times 7 \times 512$
2 x FC $4096$
Softmax $1000$

Ungefähr 138 Mio. Parameter

Mit sehr tiefen Architekturen wird das Vanishing / Exploding Gradients Problem immer größer.

Idee: Füge dem Netzwerk Abkürzungen hinzu
(residual networks)

Residual block: \[ a^{[l+2]} = g(z^{[l+2]} {\color{red} + a^{[l]}}) \]

Bei sehr tiefen neuronalen Netzwerken wird das Training schnell sehr fragil.

Mit ResNets können wir die Tiefe weiter erhöhen.

Intuition, warum das funktioniert:

\[ a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) \\[3mm] = g(W^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]}) \]

Setze $W^{[l+2]} = 0, b^{[l+2]} = 0$: \[ a^{[l+2]} = g(a^{[l]}) \] was für ReLU bedeutet: $a^{[l+2]} = a^{[l]}$

Es ist für den Residual Block einfach das Ergebnis nicht zu verschlimmbessern.

Wenn wir Skip Connections verwenden, dann müssen die Dimensionen von $a^{[l]}$ und $a^{[l+2]}$ gleich sein.

Brauchen Convolutions mit der "same" Paddding Strategy.

Was ist eine gute Filtergröße?

Idee: Verwende Filter mit verschiedenen Größen im gleichen Layer.

Inception Network

Alle Teile müssen die "same" Padding Strategie und Stride 1 haben, so dass Output Dimensionen zusammen passen.

Anzahl an Multiplikationen um $32$ Filter der Größe $5 \times 5$ mit "same" Padding auf einen Input mit Größe $28 \times 28 \times 192$:

Ein einzelner Output braucht $5 \cdot 5 \cdot 192 = 4800$ Multiplikationen
Wir haben $28 \cdot 28 \cdot 32 = 25088$ Outputs
Das sind $4800 \cdot 25088 = 120422400$ Multiplikationen.

Bottleneck Layer

Multiplikationen:

$1 \cdot 1 \cdot 192 \cdot 28 \cdot 28 \cdot 16 = 2408448$
$5 \cdot 5 \cdot 16 \cdot 28 \cdot 28 \cdot 32 = 10035200$

Gesamt: $12443648$

Inception module

Inception network

https://knowyourmeme.com/photos/531557-we-need-to-go-deeper

MobileNets (2017)

Ziel: Effizientere Berechnung für Inference.

Depthwise Separable Convolution

Idee: Wende jeden Filter auf jeden Kanal unabhängig an und merge diese anschließend mit einem 1x1 Filter.

Normal convolution:		$*$		$=$
Depthwise convolution:		$*$		$=$

\[ n_W \times n_H \times n_c \\[3mm] \xrightarrow{\text{depthw. conv.}} n_W' \times n_H' \times n_c \\[3mm] \xrightarrow{n_c' \text{ conv. } 1\times 1} n_W' \times n_H' \times n_c' \]

Einfluss auf Rechenaufwand

Reduziert um Faktor \[ \frac{1}{n_c'} + \frac{1}{f^2} \]

Original MobileNet v1: 13 Layers Depthwise Separable Convolution + POOL + FC + SoftMax

MobileNet v2 (2019)

"Bottleneck Block"

EfficientNet (2020)

Wie wählen wir ein Model für ein bestimmtes Budget an Rechenressourcen?

Model	Accuracy	# Params	# FLOPs
EfficientNet-B0	77.1%	5.3M	0.39B
EfficientNet-B1	79.1%	7.8M	0.70B
EfficientNet-B2	80.1%	9.2M	1.0B
EfficientNet-B3	81.6%	12M	1.8B
EfficientNet-B4	82.9%	19M	4.2B
EfficientNet-B5	83.6%	30M	9.9B
EfficientNet-B6	84.0%	43M	19B
EfficientNet-B7	84.3%	66M	37B

Das Rad nicht neu erfinden!

Von vielen Modellen gibt es Open Source Implementierungen.

Viele Computer Vision Modelle brauchen sehr viele Daten und sind teuer zu trainieren.

Idee: Wiederverwenden von schon trainierten Modellen und ihren Gewichten für eine neue Aufgabe.

Transfer Learning

Passe Modell für neue Aufgabe an:

Verwende einen neuen Softmax Layer am Ende und halte den Rest konstant.

Wir verwandeln die ersten $L-1$ Layer in eine Funktion $f(x) = a^{[L-1]}$.

Dann trainieren wir eine logistische Regression auf $a^{[L-1]}$: \[ \hat y = g(W^{[L]} x + b^{[L]}) \]

Wenn wir mehr Daten haben, können wir auch mehr Layer neu trainieren.

Gemischte Strategie:

Ersetze den Softmax Layer, friere Rest ein und trainiere.
Unfreeze ein paar Layer vom Ende an und trainiere mit kleinerer Lernrate.

Ein vortrainiertes Modell als Startpunkt zu verwenden ist fast immer besser, als ein Modell komplett neu zu trainieren.

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9

3	3	0	1	2	7	4	4
3	3	0	1	2	7	4	4
1	1	5	8	9	3	1	1
2	2	7	2	5	1	3	3
0	0	1	3	1	7	8	8
4	4	2	1	6	2	8	8
2	2	4	5	2	3	9	9
2	2	4	5	2	3	9	9