Grundlagen der künstlichen Intelligenz

Machine Learning

Reflex-basierte Modelle

	Such Probleme
Regression	Markov Decision Processes	Constraint Satisfaction Problems
Classifier	Kompetitive Spiele	Bayesian networks
Reflex	Zustände	Variablen	Logik

"Niedrige Intelligenz"		"Höhere Intelligenz"

Reflex-basierte Modelle

input		predictor		output
$x$	$\Rightarrow$	$f$	$\Rightarrow$	$y$

Spam Filter

Input $x$: Email

From: BILL GATES FOUNDATION <al8622787@gmail.com>
Subject: HELLO
Reply me urgently i have a very important Donation to share with you.
Have A Blessed Day
Regards
Mr. Bill Gates Foundation

Output $y$: Klassifikation: Spam oder kein Spam

Regression

				response
$x$	$\Rightarrow$	$f$	$\Rightarrow$	$y \in \mathbb{R}$

Behandlungskosten: Patienteneigenschaften → Behandlungskosten
Auto Verkaufspreis: Automobileigenschaften → Verkaufspreis
Rotweinqualität: Chemische Eigenschaften → sensorische Qualität

Klassifikation

		classifier		label
$x$	$\Rightarrow$	$f$	$\Rightarrow$	$y \in \{-1, 1\}$

Betrugserkennung: Kreditkartentransaktion → Betrug oder kein Betrug
Toxische Kommentare: Online Kommentare → Toxisch oder nicht toxisch
Higgs Boson: Ereignis → Higgs Decay oder Hintergrundrauschen

Structured Prediction

$x$

$\Rightarrow$

$f$

$\Rightarrow$

$y$ ist ein komplexes Objekt

Übersetzung: Deutscher Satz → English sentence
Image captioning: Bild → beschreibender Satz
Image segmentation: Bild → Segmentierung

Daten

Ein Beispiel (Example) ist ein Tupel $(x,y)$ aus Input $x$ und dem korrekten Output (ground-truth) $y$

Trainingsdaten: Eine Liste von Beispielen

$\mathcal{D}_\text{train} = [\\(\text{"...a very important Donation..."}, +1),\\ (\text{"...sende ich Ihnen die Agenda..."}, -1),\\ \ldots]$

				$x$
				$\Downarrow$
$\mathcal{D}_\text{train}$	$\Rightarrow$	Lern Algorithmus	$\Rightarrow$	$f$
				$\Downarrow$
				$y$

Design Entscheidungen

Welche Eigenschaften des Inputs sind relevant? (Feature Extraction)
Welche Predictors sind möglich? (Hypothesis class)
Was ist ein guter Predictor? (Loss function)
Wie ermitteln wir den Predictor? (Optimierungs Algorithmus)

Feature Extraction

Ein Feature Extractor ist eine Funktion, die unseren Input in einen Feature Vektor abbildet: \[ \phi(x) \in \mathbb{R}^d \]

Beispiel

$x$ = "abc@gmail.com"

$\phi(x) = $	$13$	(Länge)
	$1$	(Länge > 10)
	$0.85$	(Anteil Buchst.)
	$1$	(Enthält @)
	$1$	(Endet auf .com)
	$0$	(Endet auf .org)

Machine Learning

Lineare Modelle

Am 1. Januar 1801 entdeckte der Astronom Giuseppe Piazzi den Zwergplanet Ceres. Er machte sehr detaillierte Notizen über Zeit und Position der Beobachtungen.

https://en.wikipedia.org/wiki/File:Costanzo_Angelini,_L%27astronomo_Piazzi_1825_ca.jpg

Wann kann man Ceres wieder beobachten?

https://en.wikipedia.org/wiki/File:Ceres-Beobachtung_von_Piazzi.png

Im September 1801 erstellte Carl F. Gauss ein Modell, welches die Sichtung von Ceres auf 0,5 Grad genau vorhersagte. Dazu verwendete er Linear Least-Squares Regression.

https://commons.wikimedia.org/wiki/File:Carl_Friedrich_Gauss_1840_by_Jensen.jpg

$\mathcal{D}_\text{train}$
$x$	$y$
$1$	$1$
$2$	$3$
$4$	$3$

Hypothesis class?

\[ f(x) = \frac{3}{5} x + 1 \]

\[ f(x) = \frac{2}{5} x + 1.5 \]

\[ f(x) = w_1 x + w_0 \]

Vektor Notation:

Gewichte (weights): $\color{red} w = [w_0, w_1]$

\[ f_w(x) = {\color{red} w} \cdot \phi(x) \]

$\phi(x) = [1, x]$ (Feature Extractor)

\[ f_{[1, 0.6]}(3) = [1, 0.6] \cdot [1, 3] = 1 + 0.6 \cdot 3 = 2.8 \]

Hypothesis class

\[ \mathcal{F} = \{ f_w : w \in \mathbb{R}^2 \} \]

bzw. \[ \mathcal{F} = \{ f_w : w \in \mathbb{R}^d \} \]

Loss function

Residual: ${\color{red} f_w}({\color{blue}x}) - {\color{blue} y}$
Squared Loss: $\operatorname{Loss}(x,y,w) = ({\color{red} f_w}({\color{blue}x}) - {\color{blue} y})^2$
Training Loss: \[\operatorname{TrainLoss}(w) = \frac{1}{|\mathcal{D}_\text{train}|} \sum_{(x,y)\in\mathcal{D}_\text{train}} \operatorname{Loss}(x,y,w)\]

$\operatorname{Loss}({\color{blue} 1},{\color{blue} 1},{\color{red} [1, 0.6]})$	$= ({\color{red} 1} + {\color{red} 0.6} \cdot {\color{blue} 1} - {\color{blue} 1})^2$	$= 0.36$
$\operatorname{Loss}({\color{blue} 2},{\color{blue} 3},{\color{red} [1, 0.6]})$	$= ({\color{red} 1} + {\color{red} 0.6} \cdot {\color{blue} 2} - {\color{blue} 3})^2$	$= 0.64$
$\operatorname{Loss}({\color{blue} 4},{\color{blue} 3},{\color{red} [1, 0.6]})$	$= ({\color{red} 1} + {\color{red} 0.6} \cdot {\color{blue} 4} - {\color{blue} 3})^2$	$= 0.16$
	$\operatorname{TrainLoss}({\color{red} [1, 0.6]})$	$=0.3867$

Andere Loss Funktionen sind möglich, zum Beispiel \[ Loss_\text{Abs.}(x,y,w) = | f_w(x) - y | \]

\[ \min_{w\in \mathbb{R}^n} \frac{1}{|\mathcal{D}_\text{train}|} \sum_{(x,y)\in\mathcal{D}_\text{train}} (f_w(x) - y)^2 \]

Ziel: Minimiere Training Loss

Wie berechnen wir $w$?

Analytische Lösung: \[ w = (X^T X)^{-1} X^T Y \]

Aber: Lösung sehr aufwendig für viele Features oder viele Beispiele

Alternative: Gradient Descent

Idee: Suche iterativ Schritte für kleine Verbesserungen

Grundlage vieler Machine Learning Algorithmen.

Grundlagen: Gradienten

Wir starten mit einer Funktion $f: \mathbb{R}^3 \to \mathbb{R}$: \[ f(\left[ {\begin{array}{c} x_1 \\ x_2 \\ x_3 \end{array}} \right]) = x_1^2 + \frac{1}{x_2} + x_3 \]

Was ist die Ableitung von $f$?

Was wäre die Ableitung von
$f(x_1) = x_1^2 + \frac{1}{x_2} + x_3$
mit $x_2$ und $x_3$ konstant (z.B. $x_2 = 2$, $x_3= 4$)

\[ f'(x_1) = 2 x_1 \]

Partielle Ableitungen

$f(x_1, x_2, x_3) = x_1^2 + \frac{1}{x_2} + x_3$

$\frac{\partial}{\partial x_1}f(x_1, x_2, x_3) = 2 x_1$
$\frac{\partial}{\partial x_2}f(x_1, x_2, x_3) = - \frac{1}{x_2^2}$
$\frac{\partial}{\partial x_3}f(x_1, x_2, x_3) = 1$

Wenn die Parameter "interagieren" ändert sich trotzdem nichts:

$f(x_1, x_2) = x_1 x_2^2$

$\frac{\partial}{\partial x_1}f(x_1, x_2) = x_2^2$
$\frac{\partial}{\partial x_2}f(x_1, x_2) = 2 x_1 x_2$

Die partielle Ableitung sagt uns, wie stark sich die Funktion in der Richtung ändert:

$f(x_1 + \epsilon, x_2) \approx f(x_1, x_2) + \epsilon \frac{\partial}{\partial x_1}f(x_1, x_2)$

$f(x_1, x_2 + \epsilon) \approx f(x_1, x_2) + \epsilon \frac{\partial}{\partial x_2}f(x_1, x_2)$

Beispiel: $f(x_1, x_2) = x_1 x_2^2$

$f(3, 2) = 12$

$\frac{\partial}{\partial x_1} f(x_1, x_2) = x_2^2$

$\frac{\partial}{\partial x_1} f(3, 2) = 2^2 = 4$

$f(3.1, 2) = 12.4$

$f(3,2) + 0.1 \frac{\partial}{\partial x_1} f(3, 2) = 12 + 0.1 \cdot 4 = 12.4$

Beispiel: $f(x_1, x_2) = x_1 x_2^2$

$f(3, 2) = 12$

$\frac{\partial}{\partial x_2} f(x_1, x_2) = 2 x_1 x_2$

$\frac{\partial}{\partial x_1} f(3, 2) = 2 \cdot 3 \cdot 2 = 12$

$f(3, 2.1) = 13.23$

$f(3,2) + 0.1 \frac{\partial}{\partial x_2} f(3, 2) = 12 + 0.1 \cdot 12 = 13.2$

Definition: Gradient

\[ \nabla f(x) = \left[ {\begin{array}{c} \frac{\partial}{\partial x_1} f(x) \\ \frac{\partial}{\partial x_2} f(x) \\ \frac{\partial}{\partial x_3} f(x) \\ \vdots \end{array}} \right] \] Der Gradient ist der Vektor mit allen partiellen Ableitungen

Beispiel: $f(x_1, x_2) = x_1 x_2^2$

\[ \nabla f(x_1, x_2) = \left[ {\begin{array}{c} x_2^2 \\ 2 x_1 x_2 \end{array}} \right] \]

\[ \nabla f(3, 2) = \left[ {\begin{array}{c} 4 \\ 12 \end{array}} \right] \]

Der Gradient ist die Richtung in der $f$ am meisten ansteigt

Auch bei noch wilderen Funktionen funktioniert alles analog:$f: \mathbb{R}^{2\times 2} \to \mathbb{R}$ \[ f(\left[ {\begin{array}{cc} x_{1,1} & x_{1,2} \\ x_{2,1} & x_{2,2} \end{array}} \right]) = x_{1,1}^3 x_{2,1} - 2 x_{1,2} x_{2,1}^2 + 5 x_{2,2} \]

\[ \nabla f(\left[ {\begin{array}{cc} x_{1,1} & x_{1,2} \\ x_{2,1} & x_{2,2} \end{array}} \right]) = \left[ {\begin{array}{cc} 3 x_{1,1}^2 x_{2,1} & -2 x_{2,1}^2 \\ x_{1,1}^3 - 4 x_{1,2} x_{2,1} & 5 \end{array}} \right] \]

Iterative Lösung:

Starte an einem beliebigen Punkt und verbessere dich schrittweise.

Der Gradient $\nabla \operatorname{TrainLoss}(w)$ ist die Richtung, in die der TrainingLoss am meisten steigt.

Algorithmus: Gradient descent

$w^0 \gets [0,0,\ldots, 0]$
Für $t = 1, 2, 3, \ldots$ (Epoche)
- $w^t \gets w^{t-1} - \eta \nabla \operatorname{TrainLoss}(w^{t-1})$

$\eta$: Lernrate / step size

Abbruchkriterium

Maximale Iterationszahl
Gradient klein genug: $|| \nabla \operatorname{TrainLoss}(w^{t}) || < \epsilon$

\[ \operatorname{TrainLoss}(w) = \frac{1}{|\mathcal{D}_\text{train}|} \sum_{(x,y)\in\mathcal{D}_\text{train}} (w \cdot \phi(x) - y)^2 \]

\[ \nabla \operatorname{TrainLoss}(w) = \frac{1}{|\mathcal{D}_\text{train}|} \sum_{(x,y)\in\mathcal{D}_\text{train}} 2 ({\color{red} \underbrace{w \cdot \phi(x)}_\text{prediction}} - {\color{blue} \underbrace{y}_\text{target}}) \phi(x) \]

Jupyter

Klassifikation

$\mathcal{D}_\text{train}$
$x_1$	$x_2$	$y$
$1$	$1$	$+1$
$2$	$3$	$-1$
$4$	$3$	$+1$

Decision boundary

\[ f(x) = \operatorname{sign}({\color{red} \underbrace{[1.5, 0.2, -1]}_{w}} \cdot {\color{blue} \underbrace{[1, x_1, x_2]}_{\phi(x)}}) \] \[ \operatorname{sign} = \begin{cases} +1 && \text{If } x > 0 \\ 0 && \text{If } x = 0 \\ -1 && \text{else} \end{cases} \]

$f([1,1])$	$=\operatorname{sign}([1.5, 0.2, -1] \cdot [1, 1, 1])$	$=\operatorname{sign}(0.7)$	$=+1$
$f([2,3])$	$=\operatorname{sign}([1.5, 0.2, -1] \cdot [1, 2, 3])$	$=\operatorname{sign}(-1.1)$	$=-1$
$f([4,3])$	$=\operatorname{sign}([1.5, 0.2, -1] \cdot [1, 4, 3])$	$=\operatorname{sign}(-0.7)$	$=-1$

Generischer binärer Classifier

\[ f_w(x) = \operatorname{sign}(w \cdot \phi(x)) \]

Hypothesis class

\[ \mathcal{F} = \{ f_w : w \in \mathbb{R}^n \} \]

Loss function

Erste Idee: Richtig oder falsch?

Zero-one Loss: \[ \operatorname{Loss}_{0-1}(x,y,w) = 1[f_w(x) \neq y] \]

Indikatorfunktion: \[ 1[x] = \begin{cases} 1 && \text{If } x \\ 0 && \text{else} \end{cases} \]

\[ f(x) = \operatorname{sign}({\color{magenta} \underbrace{[1.5, 0.2, -1]}_{w}} \cdot { \underbrace{[1, x_1, x_2]}_{\phi(x)}}) \]

$\mathcal{D}_\text{train}$
$x_1$	$x_2$	$y$
$\color{green}1$	$\color{green}1$	$\color{blue}+1$
$\color{green}2$	$\color{green}3$	$\color{red}-1$
$\color{green}4$	$\color{green}3$	$\color{blue}+1$

$\operatorname{Loss}_{0-1}({\color{green} [1,1]}, {\color{blue} 1}, {\color{magenta} [1.5, 0.2, -1]})$	$=1[\operatorname{sign}(0.7) \neq {\color{blue} 1}]$	$=0$
$\operatorname{Loss}_{0-1}({\color{green} [2,3]}, {\color{red} -1}, {\color{magenta} [1.5, 0.2, -1]})$	$=1[\operatorname{sign}(-1.1) \neq {\color{red} -1}]$	$=0$
$\operatorname{Loss}_{0-1}({\color{green} [4,3]}, {\color{blue} 1}, {\color{magenta} [1.5, 0.2, -1]})$	$=1[\operatorname{sign}(-0.7) \neq {\color{blue} 1}]$	$=1$
$\operatorname{TrainLoss}({\color{magenta} [1.5, 0.2, -1]})$		$=0.3333$

Definitionen

Predicted Label: $f_w(x) = \operatorname{sign}(w \cdot \phi(x))$
Target Label: $y$
Score: $w \cdot \phi(x)$
Margin: $(w \cdot \phi(x)) y$

\[ \operatorname{Loss}_{0-1}(x,y,w) = 1[\operatorname{sign}(\underbrace{w \cdot \phi(x)}_\text{score}) \neq y] \]

\[ = 1[\underbrace{(w \cdot \phi(x))y}_\text{margin} \leq 0] \]

Können wir jetzt gradient descent verwenden, um $\min_w \operatorname{TrainLoss}(w)$ zu ermitteln?

$\nabla \operatorname{TrainLoss}(w)$ ist praktisch immer $0$!

Hinge Loss

\[ \operatorname{Loss}_\text{hinge}(x,y,w) = \max(1 - \underbrace{(w \cdot \phi(x))y}_\text{margin}, 0) \]

\[ \operatorname{Loss}_\text{logistic}(x,y,w) = \log(1 + e^{-(w \cdot \phi(x))y}) \]

Hinge loss Gradient

\[ \operatorname{Loss}_\text{hinge}(x,y,w) = \max(1 - (w \cdot \phi(x))y, 0) \]

\[ \nabla \operatorname{Loss}_\text{hinge}(x,y,w) = \begin{cases} -\phi(x)y && \text{If } 1 - (w \cdot \phi(x)) y > 0 \\ 0 && \text{else} \end{cases} \]

\[ f(x) = \operatorname{sign}({\color{magenta} \underbrace{[1.5, 0.2, -1]}_{w}} \cdot { \underbrace{[1, x_1, x_2]}_{\phi(x)}}) \]

$\mathcal{D}_\text{train}$
$x_1$	$x_2$	$y$
$\color{green}1$	$\color{green}1$	$\color{blue}+1$
$\color{green}2$	$\color{green}3$	$\color{red}-1$
$\color{green}4$	$\color{green}3$	$\color{blue}+1$

$({\color{green} x}, y, {\color{magenta} w})$	$\operatorname{Loss}_\text{hinge}$		$\nabla \operatorname{Loss}_\text{hinge}$
$({\color{green} [1,1]}, {\color{blue} 1}, {\color{magenta} [1.5, 0.2, -1]})$	$\max(1 - (0.7 \cdot {\color{blue} 1}), 0)$	$=0.3$	- $[1, {\color{green} 1}, {\color{green} 1}] \cdot {\color{blue} 1}=$	$[-1, -1, -1]$
$({\color{green} [2,3]}, {\color{red} -1}, {\color{magenta} [1.5, 0.2, -1]})$	$\max(1 - (-1.1 \cdot {\color{red} -1}), 0)$	$=0$		$[0,0,0]$
$({\color{green} [4,3]}, {\color{blue} 1}, {\color{magenta} [1.5, 0.2, -1]})$	$\max(1 - (-0.7 \cdot {\color{blue} 1}), 0)$	$=1.7$	- $[1, {\color{green} 4}, {\color{green} 3}] \cdot {\color{blue} 1}=$	$[-1, -4, -3]$
$\operatorname{TrainLoss}({\color{magenta} [1.5, 0.2, -1]})$		$=0.67$	$\nabla \operatorname{TrainLoss}({\color{magenta} [1.5, 0.2, -1]})=$	$[-0.67, -1.67, -1.33]$

Jupyter

	Regression	Classification
Prediction $f_w$	score	$\operatorname{sign}$(score)
Abgleich mit $y$	residual (score - $y$)	margin (score $\cdot y$)
Loss	quadratisch absolut	zero-one hinge logistic
Opt. Algorithmus	gradient descent	gradient descent

\[ \nabla \operatorname{TrainLoss}(w) = \frac{1}{|\mathcal{D}_\text{train}|} \sum_{(x,y)\in\mathcal{D}_\text{train}} \nabla \operatorname{Loss}(x,y,w) \]

Gradient descent skaliert schlecht mit der Anzahl Beispielen.

Können wir auch schon Fortschritt machen, bevor wir alle Beispiele gesehen haben?

Algorithmus: Gradient descent

$w^0 \gets [0,0,\ldots, 0]$
Für $t = 1, 2, 3, \ldots$ (Epoche)
- $w^t \gets w^{t-1} - \eta \nabla \operatorname{TrainLoss}(w^{t-1})$

Algorithmus: Stochastic gradient descent (SGD)

$w^0 \gets [0,0,\ldots, 0]$
Für $t = 1, 2, 3, \ldots$ (Epoche)
- Für $(x,y) \in \mathcal{D}_\text{train}$
  - $w^t \gets w^{t-1} - \eta {\color{red}\nabla \operatorname{Loss}(x,y,w^{t-1})}$

Wie wählen wir die Schrittgröße $\eta$?

Schnelle Konvergenz vs Stabilität

Simple Strategien:
- Konstante: $\eta = 0.01$
- Mit der Zeit fallend: $\eta = \frac{1}{\sqrt{\text{\# updates}}}$
Komplexere Strategien:
- AdaGrad
- Adam

Jupyter

Kernidee: Lieber viele kleine und ungenaue, als ein präzises Update

Quantität über Qualität

Machine Learning

Feature Engineering

Macht lineare Regression hier Sinn?

Gibt es hier einen guten linearen Classifier?

Statt direkt neuronale Netzwerke oder andere kompliziertere Algorithmen zu nehmen, können wir auch an unseren Featuren arbeiten:

\[ \phi(x) = [1, x, x^2] \]

\[ f(x) = {\color{orange} [18, -8, 1]} \cdot \phi(x)\\ = 18 - 8 x + x^2 \]

$f$ ist linear in $\phi(x)$, jedoch nicht linear in $x$

Die Anzahl quadratischer Features wächst quadratisch mit der Dimension von $x$

\[ x = [x_1, x_2, x_3, \ldots, x_n] \]

\[ \phi(x) = [1, x_1, x_2, \ldots, x_n,\\ x_1^2, x_1 x_2, x_1, x_3, \ldots, x_1 x_n,\\ x_2^2, x_2 x_3, \ldots, x_2 x_n,\\ \ldots,\\ x_n^2] \]

Wir können bei der Wahl von $\phi$ beliebig kreativ werden.

\[ \color{orange} \phi(x) = [1[x < 2], 1[2 \leq x < 4], 1[4 \leq x < 6], 1[6 \leq x]] \]

\[ \color{blue} \phi(x) = [1, x, \cos(x), \cos(2x)] \]

Auswahl guter Features ist eine Kunst für sich.

Domänenwissen kann eine große Rolle spielen.

Aus Sicht des Algorithmus ist jedes Feature optional.

Für Classifier funktioniert das analog.

\[ \phi(x) = [1, x_1, x_2, x_1^2, x_2^2] \] \[ \color{orange} w = [-23, 8, 8, -1, -1] \] Decision boundary: \[ \color{orange} (x_1-4)^2 + (x_2-4)^2 = 3^2 \]

Wie kommen wir allgemein an sinnvolle Features?

$x$ = "abc@gmail.com"

$\phi(x) = $	$13$	(Länge)
	$1$	(Länge > 10)
	$0.85$	(Anteil Buchst.)
	$1$	(Enthält @)
	$1$	(Endet auf .com)
	$0$	(Endet auf .org)

Organisation in Feature Templates

Feature Template: Gruppe Features, die nach dem gleichen System berechnet werden.

$x$ = "abc@gmail.com"

Template: Endet auf ___

$0$	(Endet auf .au)
$1$	(Endet auf .com)
$0$	(Endet auf .de)
$0$	(Endet auf .es)
...	...
$0$	(Endet auf .org)

Feature Template Beispiel

lon = 49.9530
lat = 7.9308

Template	Beispiel
Pixel Intensität in Zeile __ und Spalte __ im Kanal __	Pixel Intensität in Zeile 10 und Spalte 93 im Kanal rot
Latitude liegt zwischen ___ und ___	Latitude liegt zwischen 7.9 und 8.0
Longitude liegt zwischen ___ und ___	Longitude liegt zwischen 49.9 und 50.0

https://www.google.com/maps

In der Praxis sind binäre Feature oft besser als ganzzahlige

Länge der Email Adresse
vs
Länge der Email Adresse ist __

Der Algorithmus interessiert sich nicht für die Templates - wir aber schon!

Beispiel: Unser trainierter Classifier hat Gewichte $w=[0.3, 1.2, -2.1, 204, 1.3, -0.7]$
Welches Feature hat das Gewicht $204$ bekommen?

Wir sollten uns merken, welcher Index welchem benahmtem Feature entspricht:

Idx	Feature Name
0	Endet auf org
1	Endet auf com
2	Endet auf edu
3	Endet auf es
4	Endet auf fr
5	Endet auf de

Je nach Anwendung kann der Feature Vektor dünn besetzt sein.

Beispiel - Feature Template: Email endet auf ___

Statt Vektor \[[0,0,0,0,0,0,\ldots, 0, 1, 0, \ldots, 0, 0, 0]\] speichert man hier lieber ein Mapping: Idx $\to$ Value

{ 123: 1, 456: 1, 123456: 0.4 }

Machine Learning

Neuronale Netzwerke

Mit neuronalen Netzwerken können einen nicht linearen Predictor bauen, auch ohne nicht lineare Features $\phi$.

Alle potentiell sinnvollen nicht linearen Features in $\phi$ abzubilden ist oft nicht machbar, wenn $d$ sehr groß wird.

2-layer neural network classifier

\[ h_{\color{red} V}(x) = \sigma({\color{red}V} \phi(x)) \]

\[ f_{\color{red}V, w}(x) = \operatorname{sign}({\color{red}w} \cdot h_{\color{red} V}(x)) \]

$\sigma$ ist die sogenannte Aktivierungsfunktion (mehr später)

Hypothesis class: \[ \mathcal{F} = \{ f_{\color{red}V, w} : {\color{red}V} \in \mathbb{R}^{k \times d}, {\color{red}w} \in \mathbb{R}^k \} \]

\[ f_{V,w}(x) = \operatorname{sign}( \underbrace{ \left[ {\begin{array}{ccc} \bullet && \bullet && \bullet \end{array}} \right] }_w \cdot \sigma( \underbrace{ \left[ {\begin{array}{cccc} \bullet && \bullet && \bullet && \bullet \\ \bullet && \bullet && \bullet && \bullet \\ \bullet && \bullet && \bullet && \bullet \end{array}} \right] }_V \underbrace{ \left[ {\begin{array}{c} \bullet \\ \bullet \\ \bullet \\ \bullet \end{array}} \right] }_{\phi(x)} ) ) \]

\[ h_{\color{red} V}(x) = \sigma({\color{red}V} \phi(x)) \] ist praktisch eine Art gelernter Feature Extraktor $\phi'(x)$

Das neuronale Netz ist nur ein linearer Predictor über $\phi'(x)$

Was passiert, wenn wir $\sigma$ weglassen?

\[ f_{V,w}(x) = \operatorname{sign}(\underbrace{w \cdot V}_{ = w'} \phi(x)) = \operatorname{sign}(w' \cdot \phi(x)) = f_{w'}(x) \]

Wir können $w \in \mathbb{R}^k$ und $V \in \mathbb{R}^{k \times d}$ zu einem Vektor $w' \in \mathbb{R}^d$ zusammenfassen. Der Classifier bleibt linear.

Gute Aktivierungsfunktion:

Nicht linear
Effizient zu berechnen
Einfach differenzierbar
Vermeide 0 Gradienten

Populäre Aktivierungsfunktionen:

Logistic: $\sigma(z) = \frac{1}{1 + e^{-z}}$
Rectified Linear Unit (ReLU): $\sigma(z) = \max(0, z)$

1-layer neural network \[ score = w \cdot \phi(x) \]
2-layer neural network \[ score = w \cdot \sigma( V \phi(x) ) \]
3-layer neural network \[ score = w \cdot \sigma( V_2 \sigma(V_1 \phi(x)) ) \]
...

Jeder Layer im neuronalen Netzwerk ist eine Funktion welche die Eingangs Feature auf neue, abstraktere Feature abbildet.

https://commons.wikimedia.org/wiki/File:Deep_Learning.jpg

Warum brauchen wir Tiefe?

\[ \phi(x) \Rightarrow h_1(x) \Rightarrow h_2(x) \Rightarrow h_3(x) \Rightarrow \ldots \Rightarrow \text{score} \]

Mehr Abstraktion
Mehr Nicht-Linearität
Funktioniert in der Praxis
Theoretisch wissen wir das (noch) nicht

Wie trainieren wir ein Neuronales Netz?

\[ \operatorname{Loss}(x,y,V_1, V_2, V_3, w) = (w \cdot \sigma(V_3 \sigma(V_2 \sigma(V_1 \phi(x)))) - y)^2 \]

Stochastic gradient descent

\[ V_1 \gets V_1 - \eta \nabla_{\color{red} V_1} \operatorname{Loss}(x, y, {\color{red} V_1}, V_2, V_3, w) \\ V_2 \gets V_2 - \eta \nabla_{\color{red} V_2} \operatorname{Loss}(x, y, V_1, {\color{red} V_2}, V_3, w) \\ V_3 \gets V_3 - \eta \nabla_{\color{red} V_3} \operatorname{Loss}(x, y, V_1, V_2, {\color{red} V_3}, w) \\ w \gets w - \eta \nabla_{\color{red} w} \operatorname{Loss}(x, y, V_1, V_2, V_3, {\color{red} w}) \\ \]

Wie berechnen wir die Gradienten?

Variante 1: Per Hand, als Hausaufgabe
Variante 2: Automatisch mit Backpropagation

Computation Graph: Ein gerichteter, azyklischer Graph. Jeder Knoten repräsentiert eine mathematische Operation.

\[ c = a + b \]

Gradient: Wie viel ändert sich $c$, wenn sich $a$ oder $b$ ändert?

\[ (a+{\color{red}\epsilon}) + b = c + {\color{green} 1} {\color{red} \epsilon} \]	\[ \frac{\partial c}{\partial a} = {\color{green} 1} \]
\[ a + (b+{\color{red}\epsilon}) = c + {\color{green} 1} {\color{red} \epsilon} \]	\[ \frac{\partial c}{\partial b} = {\color{green} 1} \]

\[ c = a \cdot b \]

\[ (a+{\color{red}\epsilon}) b = c + {\color{green} b} {\color{red} \epsilon} \]	\[ \frac{\partial c}{\partial a} = {\color{green} b} \]
\[ a (b+{\color{red}\epsilon}) = c + {\color{green} a} {\color{red} \epsilon} \]	\[ \frac{\partial c}{\partial b} = {\color{green} a} \]

Weitere Bausteine

Subtraktion

Quadrat

ReLU

Maximum

Komplexe Funktionen bauen wir uns aus simplen Grundoperationen.

\[ d = (a \cdot b)^2 \]

Kettenregel

\[ \frac{\partial d}{\partial a} = {\color{green} \frac{\partial d}{\partial c}\frac{\partial c}{\partial a}} = (2c) (b) = (2ab)(b) = 2ab^2 \]

Linearer Classifier mit Hinge Loss

\[ \operatorname{Loss}(x,y,{\color{red} w}) = \max(1 - {\color{red} w} \cdot \phi(x) y, 0) \]

\[ \nabla_{\color{red} w} \operatorname{Loss}(x,y,{\color{red}w}) = {\color{green} 1[\text{margin} < 1] (-1) y \phi(x)} \]

2-layer neural network

\[ \operatorname{Loss}(x,y, {\color{red} V}, {\color{red} w}) = (w \cdot \operatorname{ReLU}(V \phi(x))-y)^2 \]

\[ \nabla_{\color{red} w} \operatorname{Loss}(x,y, {\color{red} V}, {\color{red} w}) = {\color{green} 2\text{residual} \cdot 1 \cdot h} \]

\[ \nabla_{\color{red} V} \operatorname{Loss}(x,y, {\color{red} V}, {\color{red} w}) = \\{\color{green} 2\text{residual} \cdot 1 \cdot w \cdot 1[V\phi(x) > 0] \cdot \phi(x)} \]

Backpropagation

Für jeden Knoten $i$ definieren wir 2 Werte:

$\color{orange} f_i$: den evaluierten Wert im Knoten $i$
${\color{magenta} g_i} = \frac{\partial \text{Parent}}{\partial i}(f_i)$: Einfluss von $i$ auf Elternknoten am Punkt $f_i$

Algorithmus: Backpropagation

Berechne $\color{orange} f_i$ von den Blättern zur Wurzel
Berechne $\color{magenta} g_i$ von der Wurzel zu den Blättern

Linearer Classifier mit Hinge Loss

$ y = -1 $
$ \phi(x) = [-1, 2] $
$ w = [1,2] $	$ \color{magenta} g_w = \phi(x) $
$ \color{orange} f_\text{score} = 3 $	$ \color{magenta} g_\text{score} = 1 $
$ \color{orange} f_\text{margin} = -3 $	$ \color{magenta} g_\text{margin} = -1 $
$ \color{orange} f_a = 4 $	$ \color{magenta} g_a = 1 $
$ \color{orange} f_\text{Loss} = 4 $

Wofür brauchen wir das alles, wenn wir das auch von PyTorch erledigen lassen können?

Debugging!

$\min \operatorname{TrainLoss}(w)$ ist bei einem linearen Predictor konvex und leicht zu berechnen.

Bei neuronalen Netzen ist der Loss nicht konvex und Suche nach einer guten Lösung viel schwerer!

Initiale Parameter

Wir dürfen $w$ und $V_i$ nicht mit 0 initialisieren (wg. Symmetrie)

Lösung: Entweder zufällige Werte oder aus anderem Netzwerk kopieren

Machine Learning

Best practices

Algorithmus: Auswendig Lernen

Training: Speichere $\mathcal{D}_\text{train}$
Inference: \[ f(x) = \begin{cases} y && \text{if } (x,y) \in \mathcal{D}_\text{train} \\ 0 && \text{else} \end{cases} \]

TrainLoss ist 0. Warum taugt der Algorithmus trotzdem nichts?

Problem: Overfitting

Overfitting bedeuted, dass unser Predictor gut auf den Trainingsdaten, aber schlecht auf neuen, noch nicht gesehenen Daten funktioniert. Er generalisiert schlecht.

Overfitting

https://commons.wikimedia.org/wiki/File:Overfitting.svg
https://commons.wikimedia.org/wiki/File:Overfitted_Data.png

Noch ein Beispiel für overfitting:

https://xkcd.com/1122/

Umgekehrt ist es auch nicht gut, das Trainingsset zu schlecht zu treffen (underfitting).

Wir suchen nach einem guten Mittelweg.

https://commons.wikimedia.org/wiki/File:Underfitting_e_overfitting.png

Wie können wir auswerten, wie gut unser Predictor $f$ wirklich ist?

Wir nehmen uns ein Test Set $\mathcal{D}_\text{test}$ mit Beispielen, die nicht zum Training genutzt werden.

Neue Metrik $\operatorname{TestLoss}$: Loss auf dem Testset

Freiheitsgrade

Jupyter

Für Polynome gilt: Ein Polynom vom Grad $n$ kann $n+1$ Datenpunkte "immer" exakt treffen.

Allgemein: Mehr Parameter (Freiheitsgrade / degrees of freedom) → das Modell kann mehr abbilden und mehr auswendig lernen

Mehr Daten verhindert auswendig lernen → wir können komplexere Modelle mit weniger Sorge verwenden.

Wenn wir wenige Daten haben, müssen wir selektiver mit Features und Parametern sein

Als Parameter zählen nicht nur die Werte in $w$ oder $V$, sondern auch:

Die Dimension von $w$ und $V$
Die Anzahl an Layern eines Netzes
Welche Aktivierungsfunktion wir nehmen
Welche Lernrate wir wählen
....

→ Hyperparameter

Wie kalibrieren wir unsere Hyperparameter?

Hyperparameter so wählen, dass TrainLoss minimiert wird?
Nein, Optimum wäre immer möglichst viele Features / Parameter wählen
Hyperparameter so wählen, dass TestLoss minimiert wird?
Nein, damit trainieren wir die Hyperparameter auf dem Test Set und entwerten es.

Wir schneiden aus den Testdaten noch einen dritten Datensatz heraus: Das Validation Set $\mathcal{D}_\text{val}$

Für jede (sinnvolle) Kombination an Hyperparametern können wir jetzt den Prediktor ohne Validation Set trainieren und über den Loss auf dem Validation Set vergleichen.

Best practice: Wir sollten das Test Set nur möglichst selten verwenden und nur, um die finale Qualität des Predictors festzustellen.

Wie groß sollten Test und Validation Set sein?

Groß genug für sinnvolle Abschätzung vom Loss, aber nicht zu viel von Trainingsdaten nehmen.

Es kann Sinn machen, Metriken nach Teilgruppen aufzuschlüsseln

Gruppe	TrainLoss	ValidationLoss
Alle	1000	1100
Rheinland-Pfälzer	950	1030
Hessen	1500	1800

Einfach starten
- Simples Modell
- Kleine Teilmenge der Daten
- Synthetische Daten
- Sanity check: Overfitted das Modell eine handvoll Beispiele?
Alles mitloggen:
- Entwicklung der Gradienten
- Training und Validation Loss während des Trainings
- Hyperparameter Settings
Mehr Metriken (z.B. aufgeschlüsselt nach Teilgruppen)