Artificial Intelligence

Neural networks

Logistic regression is a 1-layer neural network

\[ z = w^T x + b \\ \hat y = \sigma(z) \]

Next step: add another layer.

\[ z^{[1]}_1 = {w^{[1]}_1}^T x + b^{[1]}_1 \\ a^{[1]}_1 = \sigma(z^{[1]}_1) \]

\[ z^{[1]}_2 = {w^{[1]}_2}^T x + b^{[1]}_2 \\ a^{[1]}_2 = \sigma(z^{[1]}_2) \]

\[ z^{[1]}_3 = {w^{[1]}_3}^T x + b^{[1]}_3 \\ a^{[1]}_3 = \sigma(z^{[1]}_3) \]

\[ z^{[2]} = {w^{[2]}}^T a^{[1]} + b^{[2]} \\ \hat y = a^{[2]} = \sigma(z^{[2]}) \]

More vectorization for hidden layer

\[ z^{[1]} = W^{[1]} x + b^{[1]} \\ a^{[1]} = \sigma(z^{[1]}) \]

\[ W^{[1]} = \begin{bmatrix} - & w^{[1]}_1 & - \\ - & w^{[1]}_2 & - \\ - & w^{[1]}_3 & - \end{bmatrix} \in \mathbb{R}^{3\times 3} \quad b^{[1]} = \begin{bmatrix} b^{[1]}_1 \\ b^{[1]}_2 \\ b^{[1]}_3 \end{bmatrix} \quad a^{[1]} = \begin{bmatrix} a^{[1]}_1 \\ a^{[1]}_2 \\ a^{[1]}_3 \end{bmatrix} \]

Putting it all together

\[ \hat y = \underbrace{\sigma(\underbrace{{w^{[2]}}^T \underbrace{\sigma(\underbrace{W^{[1]} x + b^{[1]}}_{z^{[1]}})}_{a^{[1]}} + b^{[1]}}_{z^{[2]}})}_{a^{[2]}} \]

Single neuron with one output

One neural network layer

Getting matrix and vector dimensions right

Size of input vector: $n^{[0]}$
Size of activation vector: $n^{[1]}$

\[ W \in \mathbb{R}^{n^{[1]} \times n^{[0]}} \quad b \in \mathbb{R}^{n^{[1]}} \quad a \in \mathbb{R}^{n^{[1]}} \quad \] \[ a = \sigma(W x + b) \]

\[ \underbrace{A}_{\mathbb{R}^{{\color{blue} l} \times {\color{green} m}}} \; \underbrace{B}_{\mathbb{R}^{{\color{green} m} \times {\color{red}n}}} = \underbrace{C}_{\mathbb{R}^{{\color{blue} l} \times {\color{red}n}}} \]

\[ \begin{bmatrix} \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \end{bmatrix} \begin{bmatrix} \cdot & \cdot\\ \cdot & \cdot\\ \cdot & \cdot\\ \end{bmatrix} = \begin{bmatrix} \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \end{bmatrix} \]

With broadcasting we can vectorize a layer for a batch of examples $X$


            import numpy as np

            X = np.random.randn(4,5)
            W = np.random.randn(3,4)
            b = np.random.randn(3,1)

            Z = np.matmul(W,x) + b

The $a^{[1]} = \sigma(z^{[1]})$ are the activations, $\sigma$ is the activation function.

Why do we need an activation function?

Let's leave it out

$\hat y = {W^{[2]}} ( W^{[1]} x + b^{[1]} ) + b^{[2]}$

$= \underbrace{({W^{[2]}} W^{[1]})}_{w'} x + \underbrace{(b^{[1]} + b^{[2]})}_{b'}$

$= w' x + b'$

We can choose a different activation function than the sigmoid

\[ a^{(i)} = {\color{red} g^{[i]}}(W^{[i]} x + b^{[i]}) \]

Sigmoid

$g(z) = \frac{1}{1+e^{-z}}$

Hyperbolic tangent

$g(z) = \operatorname{tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

Rectified Linear Unit (ReLU)

$g(z) = \max(z, 0) = \begin{cases} x & \text{if } x > 0\\ 0 & \text{else} \end{cases}$

Leaky ReLU

$g(z) = \max(x, 0.01x) = \begin{cases} x & \text{if } x > 0\\ 0.01 x & \text{else} \end{cases}$

Linear

$g(z) = z$

Those are the most common choices - but many more are possible

Should be nonlinear and have a derivative (almost) everywhere

The jobs of the two parts of a neuron:

allow inputs to interact
introduce non-linearity

The XOR problem

$x_1$	$x_2$	$y$
0	0	0
0	1	1
1	0	1
1	1	0

Cannot be solved perfectly for any linear classifier
(1-layer neural network).

But it is possible with 2 layers

\[ a_1 = \operatorname{ReLU}(x_1 - x_2) \quad a_2 = \operatorname{ReLU}(-x_1 + x_2) \quad \hat y = a_1 + a_2 \quad \]

$x_1$	$x_2$	$y$	$a_1$	$a_2$	$\hat y$
0	0	0	0	0	0
0	1	1	0	1	1
1	0	1	1	0	1
1	1	0	0	0	0

Plotting after the first layer


With ReLU	Without ReLU

Derivatives of activation functions

Sigmoid

$g(z) = \frac{1}{1+e^{-z}}$

$\frac{d}{dz}g(z) = \frac{1}{1 + e^{-z}} (1- \frac{1}{1 + e^{-z}}) = g(z)(1-g(z))$

Hyperbolic tangent

$g(z) = \operatorname{tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$

$\frac{d}{dz}g(z) = 1-\operatorname{tanh}(z)^2 = 1 - g(z)^2$

Rectified Linear Unit (ReLU)

$g(z) = \max(z, 0)$

$\frac{d}{dz}g(z) = \begin{cases} 0 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \end{cases}$

Leaky ReLU

$g(z) = \max(x, 0.01x)$

$\frac{d}{dz}g(z) = \begin{cases} 0.01 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \end{cases}$

Gradient descent with multiple layers

More parameters:

$W^{[1]} \in \mathbb{R}^{n^{[1]} \times n^{[0]}}$
$b^{[1]} \in \mathbb{R}^{n^{[1]}}$
$W^{[2]} \in \mathbb{R}^{n^{[2]} \times n^{[1]}}$
$b^{[2]} \in \mathbb{R}^{n^{[2]}}$

Cost function: \[ J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(\hat y^{(m)}, y^{(m)}) \]

Update step: \[ W^{[1]} \gets W^{[1]} - \eta \nabla_{W^{[1]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] b^{[1]} \gets b^{[1]} - \eta \nabla_{b^{[1]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] W^{[2]} \gets W^{[2]} - \eta \nabla_{W^{[2]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] b^{[2]} \gets b^{[2]} - \eta \nabla_{b^{[2]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] \]

Forward propagation

$z^{[1]} = W^{[1]} x + b^{[1]}$
$a^{[1]} = g^{[1]}(Z^{[1]})$
$z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$
$\hat y = a^{[2]} = g^{[2]}(Z^{[2]}) = \sigma(Z^{[2]})$

Backward propadation

$\nabla_{z^{[2]}} J = a^{[2]} - y$
$\nabla_{W^{[2]}} J = (\nabla_{z^{[2]}} J) {a^{[1]}}^T$
$\nabla_{b^{[2]}} J = \nabla_{z^{[2]}} J$
$\nabla_{a^{[1]}} J = {W^{[2]}}^T \nabla_{z^{[2]}} J$
$\nabla_{z^{[1]}} J = \nabla_{a^{[1]}} J \odot {g^{[1]}}'(z^{[1]})$
$\nabla_{W^{[1]}} J = (\nabla_{z^{[1]}} J) x^T$
$\nabla_{b^{[1]}} J = \nabla_{z^{[1]}} J$

The gradient computations can be vectorized over an entire batch. Remember: we need to average over the batch, i.e., divide by $m$.


            grad_A1 = np.matmul(W2.T, grad_Z2)
            grad_Z1 = grad_A1 * activation_gradient_1(Z1)
            grad_W1 = 1/m * np.matmul(grad_Z1, X.T)
            grad_b1 = 1/m * np.sum(grad_Z1, axis=1, keepdims=True)

Initial parameter values

For logistic regression we can start with setting $w$ and $b$ to 0.

Can we also do this now?

If all parameters are the same, then all neurons are symmetric.

\[ \nabla_{W^{[1]}} J = \begin{bmatrix} u & v & \cdots \\ u & v & \cdots \\ \vdots & \vdots & \vdots \\ u & v & \cdots \end{bmatrix} \]

No matter, how many iterations we train, every neuron will stay the same.

Solution: Start with random values


          import numpy as np

          W1 = np.random.randn(n1, n0) * 0.01
          b1 = np.zeros([n1, 1])

Deep neural networks are now straight forward:

$l$: Number of layers
$n^{[i]}$: Number of units in layer $i$
- $n^{[0]}$: Number of input features
- $n^{[l]}$: Number of outputs (predictions)
$a^{[i]}$: Activations in layer $i$
- $\hat y = a^{[l]}$
$g^{[i]}$: Activation function in layer $i$
$W^{[i]}, b^{[i]}$: parameters in layer $i$
$a^{[i]} = g^{[i]}(z^{[i]}) \qquad z^{[i]} = W^{[i]} a^{[i-1]} + b^{[i]}$

$l$ $=4$	$a^{[3]}$ $\in \mathbb{R}^4$	$b^{[1]}$ $\in \mathbb{R}^3$
$W^{[2]}$ $\in \mathbb{R}^{5 \times 3}$	$\nabla_{b^{[1]}} J$ $\in \mathbb{R}^3$	$\nabla_{W^{[2]}} J$ $\in \mathbb{R}^{5 \times 3}$

Forward propagation

For every layer $i = 1, \ldots, l$:

$z^{[i]} = W^{[i]} a^{[i-1]} + b^{[i]}$
$a^{[i]} = g^{[i]}(z^{[i]})$

Backward propagation

For every layer $i = l, \ldots, 1$:

$\nabla_{z^{[i]}} J = \nabla_{a^{[i]}} J \odot {g^{[i]}}'(z^{[i]})$
$\nabla_{W^{[i]}} J = (\nabla_{z^{[i]}} J) {a^{[i-1]}}^T $
$\nabla_{b^{[i]}} J = \nabla_{z^{[i]}} J$
$\nabla_{a^{[i-1]}} J = {W^{[i]}}^T \nabla_{z^{[i]}} J$

Again, when batching multiple examples, we need to average the gradients:


          grad_Z[i] = grad_A[i] * activation_gradient[i](Z[i])
          grad_W[i] = 1/m * np.matmul(grad_Z[i], A[i-1].T)
          grad_b[i] = 1/m * np.sum(grad_Z[i], axis=1, keepdims=True)
          grad_A[i-1] = np.matmul(W[i].T, grad_Z[i])

Why do we need more layers?

Non linear relationships

Hierarchy of feature abstraction

https://commons.wikimedia.org/wiki/File:Deep_Learning.jpg

For some tasks a shallow neural network needs exponentially more neurons than a deep version, to get the same performance.

Example:
$y = x_1 \operatorname{XOR} x_2 \operatorname{XOR} x_3 \operatorname{XOR} x_4 \operatorname{XOR} \ldots$

$O(\log(n))$ layers
$O(n)$ neurons

With 2 layers we need $O(2^n)$ neurons in the hidden layer.

Best practice: Always start small and monitor if more layers improve the performance

In general, how do we choose hyperparameters?

Learning rate $\eta$
Number of layers $l$
Activation function for each layer
Number $n^{[i]}$ of units per layer
...

Hold-out Cross Validation

Split entire dataset into 3 parts: $\mathcal{D}_\text{train}$, $\mathcal{D}_\text{val}$, $\mathcal{D}_\text{test}$
For different combinations of hyperparameters:
- Train on $\mathcal{D}_\text{train}$
- Evaluate on $\mathcal{D}_\text{val}$
Choose best hyperparameters based on results on $\mathcal{D}_\text{val}$
Evaluate on $\mathcal{D}_\text{test}$

Why can't we tune the hyper parameters on the test set?

Then the hyperparameters can overfit to the test data and we have no way to find out.

Some hyper parameters are more important to tune than others

Learning rate $\eta$ must always be tuned
First change number of units per layer before number of layers

How do we try hyperparameters?

Do not use grid search

From coarse to fine: decrease search space

Select distribution for sampling hyperparameters

Uniform distribution:

Number of layers
Number of units per layer


          import numpy as np

          np.random.uniform(3, 8)  # uniform random in [3,8]

          np.random.randint(3, 8+1)  # uniform random integer in [3, 8]

Uniform distribution can be a bad idea

Example: Select learning rate $\eta$ between 0.0001 and 1

Better: Sample on a log scale


          import numpy as np

          np.exp(np.random.uniform(np.log(0.0001), np.log(1)))  # uniform on log scale between 0.0001 and 1

Similar, if we want to have values $\beta \in [0.9, 0.999]$

Equivalent: $(1-\beta) \in [0.001, 0.1]$


          import numpy as np

          1 - np.exp(np.random.uniform(np.log(0.001), np.log(1)))

Choosing hyperparameters can be automated (e.g. using packages like optuna)

But: testing many combinations of hyperparameters on a regular basis often infeasible

Bias

Example: Build robot to automatically kill weeds and train the weed/non weed detector on images from the web.

Ideally: Train/val/test data generated the same way as in production.

But especially val/test data!

Bias vs Variance


high bias		high variance

Train set error:	1%
Val set error:	11%

High variance: the algorithm is overfitting

Train set error:	15%
Val set error:	16%

Assuming: human performance has close to 0% error

High bias: the model has insufficient capacity

Train set error:	15%
Val set error:	30%

High bias & high variance

Train set error:	0.5%
Val set error:	1%

Low bias & low variance

Train set error:	15%
Val set error:	16%

Human performance is 15% error

Low bias & low variance

What to do about bias?

Bigger network
- More units per layer
- More layers
Different NN architecture
Improve training
- Train longer
- Smaller step size
- Different algorithm (Adam / Adagrad / LBFGS / ...)

What to do about variance?

More data
Smaller network
Different NN architecture
Regularization

Regularization

For logistic regression we can adjust the cost function: \[ J(w,b) = \frac{1}{m}\sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) {\color{red} + \frac{\lambda}{2m} ||w||^2_2} \\[2mm] ||w||_2^2 = \sum_j w_j^2 = w^T w \]

L2 regularization. $||w||_2$ is the $L_2$ norm.

$\lambda \geq 0$ is a hyper parameter.

L1 regularization

\[ J(w,b) = \frac{1}{m}\sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) {\color{red} + \frac{\lambda}{2m} ||w||_1} \\[2mm] ||w||_1 = \sum_j |w_j| \]

L1 vs L2 norm

L1 regularization can lead to sparser $w$.

Small $\lambda$: less regularization, less bias, more variance.
Big $\lambda$: more regularization, less variance, more bias

Regularization for neural networks

\[ J(W^{[1]}, b^{[1]}, \ldots, W^{[l]}, b^{[l]}) =\\ \frac{1}{m}\sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) {\color{red} + \frac{\lambda}{2m} \sum_{i = 1}^l ||W^{[i]}||^2_F} \\[2mm] ||W^{[k]}||^2_F = \sum_{i=1}^{n^{[l]}} \sum_{j = 1}^{n^{[l-1]}} (W^{[k]}_{i,j})^2 \]

$||W^{[k]}||_F$ is called the Frobenius norm

Effect on gradient

\[ \nabla_{W^{[i]}} J = \ldots { \color{red} + \frac{\lambda}{m} W^{[i]}} \]

Some intuition for regularization

Smaller values for $W$ imply that the $z$ are closer together.

$z$ close to zero are mapped almost linearly

Dropout regularization

Idea: For each training example remove some units randomly

Dropout means, that for each example, we train on a smaller network.

Note: the network is different for each training example.

New hyper parameter: $0 \leq \text{keep\_prob} \leq 1$

Probability that we do not drop a unit


          import numpy as np

          d_i = np.random.rand(a_i.shape[0], a_i.shape[1]) < keep_prob
          a_i = np.multiply(a_i, d_i)
          a_i /= keep_prob

Note that we rescale a at the end ("inverted dropout")

Dropout is only applied during training!

Why does dropout work?

During training, the network is smaller
Each layer cannot rely to much on any single input

Need to spread out weights and not rely too much on any one

We can have different keep_prob parameters for each layer.

No need to drop a unit if the layer has just 3.

If there is no problem with high variance, there is no need to use dropout.

Dropout regularization can be a source of stochastic bugs.

Best practice: turn off for debugging to remove source of randomness.

Data augmentation

Idea: create new examples from old ones

Flip image
Zoom into /crop image
Rotate image
Add distortions

Early stopping

Idea: Stop training before validation error increases

Intuition: stop training, before we start overfitting

Drawback: strong coupling to optimization algorithm

How do we evaluate models?

Loss function is often hard to interpret

Confusion Matrix

		Predicted
		Positiv	Negativ
Truth	Positive	True positive (TP)	False negative (FN)
	Negative	False positive (FP)	True negative (TN)

Accuracy

\[ \frac{\text{TP+TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \]

What fraction of our predictions was correct?

Recall
(True Positive Rate)

\[ \frac{\text{TP}}{\text{TP} + \text{FN}} \]

What fraction of actually positive examples did we find?

Specificity
(True Negative Rate)

\[ \frac{\text{TN}}{\text{TN} + \text{FP}} \]

What fraction of actually negative examples did we find?

Fall-out
(False Positive Rate)

\[ \frac{\text{FP}}{\text{TN} + \text{FP}} \]

What fraction of actual negative examples did we classify wrong?

Precision

\[ \frac{\text{TP}}{\text{TP} + \text{FP}} \]

What fraction of our positive predictions was correct?

A single metric can be misleading

If we always predict positive, we get recall of 100%
If we always predict negative, specificity is 100%
If 98% of examples is negative, always predicting "negative" leads to 98% accuracy

Tradeoff between

Recall ($\frac{\text{TP}}{\text{TP} + \text{FN}}$)
Specificity ($\frac{\text{TN}}{\text{TN} + \text{FP}}$) / Precision ($\frac{\text{TP}}{\text{TP} + \text{FP}}$)

$\text{F}_1$ Score

\[ \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

Harmonic mean between precision and recall

We can influence the tradeoff by setting a threshold value for classification.

High threshold → better specificity
Low threshold → better recall

Receiver operating characteristic (ROC)

For all threshold, plot recall vs fall-out.

https://commons.wikimedia.org/wiki/File:Roc_curve.svg

Example (Wikipedia):

https://commons.wikimedia.org/wiki/File:Roccurves.png

AUROC
(Area under operating characteristic)

Calculate area under the ROC curve

1: perfect classifier
0.5: coin toss
0: every single prediction is wrong.

Multiple classes

Given $C$ different classes $(0, 1, \ldots, C-1)$.

Approach: one output for each class ($\hat y \in \mathbb{R}^C$)

Each output shall give probability for its class:
$P(y=2 | x) = \hat y_{2+1}$

If an example can have multiple labels, then the last activation is a sigmoid.

If only one label is allowed, we add a soft-max layer.

Given last layer values $z^{[L]} = W^{[L]} a^{[L-1]} + b^{[L]}$
Activation function is \[ a^{[L]}_i = \frac{e^{z^{[l]}_i}}{\sum_{j=1}^C e^{z^{[l]}_j}} \]

Example

\[ z^{[L]} = \left[ {\begin{array}{c} 1 \\ 2 \\ -1 \end{array}} \right] ) \\[3mm] a^{[L]} = \left[ {\begin{array}{c} \frac{2.718}{2.718 + 7.389 + 0.368} \\[2mm] \frac{7.389}{2.718 + 7.389 + 0.368} \\[2mm] \frac{0.368}{2.718 + 7.389 + 0.368} \end{array}} \right] = \left[ {\begin{array}{c} 0.259 \\[2mm] 0.705 \\[2mm] 0.035 \end{array}} \right] \]

We get a decision bounary for each pair of classes and an area for each class.

The softmax outputs always add up to 1.

If $C=2$, then softmax reduces to logistic regression.

Loss function with soft-max

\[ \mathcal{L}(\hat y, y) = - \sum_{j=1}^C y_j \log \hat y_j \]

Example: $y = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$, $\hat y = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.3 \end{bmatrix}$

$\mathcal{L}(\hat y, y) = -\log 0.2 \approx 1.61$

Note that with a vectorized implementation
$Y \in \mathbb{R}^{C \times m}$

Gradient ist still friendly (like for sigmoid): \[ \nabla_{z^{[L]}} J = \hat y - y \]