Logistic regression is a 1-layer neural network
\[ z = w^T x + b \\ \hat y = \sigma(z) \]
Next step: add another layer.
| \[ z^{[1]}_1 = {w^{[1]}_1}^T x + b^{[1]}_1 \\ a^{[1]}_1 = \sigma(z^{[1]}_1) \] | \[ z^{[1]}_2 = {w^{[1]}_2}^T x + b^{[1]}_2 \\ a^{[1]}_2 = \sigma(z^{[1]}_2) \] | \[ z^{[1]}_3 = {w^{[1]}_3}^T x + b^{[1]}_3 \\ a^{[1]}_3 = \sigma(z^{[1]}_3) \] | \[ z^{[2]} = {w^{[2]}}^T a^{[1]} + b^{[2]} \\ \hat y = a^{[2]} = \sigma(z^{[2]}) \] |
More vectorization for hidden layer
\[ z^{[1]} = W^{[1]} x + b^{[1]} \\ a^{[1]} = \sigma(z^{[1]}) \]
\[ W^{[1]} = \begin{bmatrix} - & w^{[1]}_1 & - \\ - & w^{[1]}_2 & - \\ - & w^{[1]}_3 & - \end{bmatrix} \in \mathbb{R}^{3\times 3} \quad b^{[1]} = \begin{bmatrix} b^{[1]}_1 \\ b^{[1]}_2 \\ b^{[1]}_3 \end{bmatrix} \quad a^{[1]} = \begin{bmatrix} a^{[1]}_1 \\ a^{[1]}_2 \\ a^{[1]}_3 \end{bmatrix} \]
Putting it all together
\[ \hat y = \underbrace{\sigma(\underbrace{{w^{[2]}}^T \underbrace{\sigma(\underbrace{W^{[1]} x + b^{[1]}}_{z^{[1]}})}_{a^{[1]}} + b^{[2]}}_{z^{[2]}})}_{a^{[2]}} \]
Single neuron with one output
One neural network layer
Getting matrix and vector dimensions right
\[ W \in \mathbb{R}^{n^{[1]} \times n^{[0]}} \quad b \in \mathbb{R}^{n^{[1]}} \quad a \in \mathbb{R}^{n^{[1]}} \quad \] \[ a = \sigma(W x + b) \]
\[ \underbrace{A}_{\mathbb{R}^{{\color{blue} l} \times {\color{green} m}}} \; \underbrace{B}_{\mathbb{R}^{{\color{green} m} \times {\color{red}n}}} = \underbrace{C}_{\mathbb{R}^{{\color{blue} l} \times {\color{red}n}}} \]
\[ \begin{bmatrix} \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \\ \cdot & \cdot & \cdot \end{bmatrix} \begin{bmatrix} \cdot & \cdot\\ \cdot & \cdot\\ \cdot & \cdot\\ \end{bmatrix} = \begin{bmatrix} \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \\ \cdot & \cdot \end{bmatrix} \]
With broadcasting we can vectorize a layer for a batch of examples $X$
import numpy as np
X = np.random.randn(4,5)
W = np.random.randn(3,4)
b = np.random.randn(3,1)
Z = np.matmul(W,X) + b
The $a^{[1]} = \sigma(z^{[1]})$ are the activations, $\sigma$ is the activation function.
Why do we need an activation function?
Let's leave it out
$\hat y = {W^{[2]}} ( W^{[1]} x + b^{[1]} ) + b^{[2]}$
$= \underbrace{({W^{[2]}} W^{[1]})}_{w'} x + \underbrace{(W^{[2]}b^{[1]} + b^{[2]})}_{b'}$
$= w' x + b'$
We can choose a different activation function than the sigmoid
\[ a^{(i)} = {\color{red} g^{[i]}}(W^{[i]} x + b^{[i]}) \]
Sigmoid
$g(z) = \frac{1}{1+e^{-z}}$
Hyperbolic tangent
$g(z) = \operatorname{tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
Rectified Linear Unit (ReLU)
$g(z) = \max(z, 0) = \begin{cases} x & \text{if } x > 0\\ 0 & \text{else} \end{cases}$
Leaky ReLU
$g(z) = \max(x, 0.01x) = \begin{cases} x & \text{if } x > 0\\ 0.01 x & \text{else} \end{cases}$
Linear
$g(z) = z$
Those are the most common choices - but many more are possible
Should be nonlinear and have a derivative (almost) everywhere
The jobs of the two parts of a neuron:
The XOR problem
| $x_1$ | $x_2$ | $y$ |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Cannot be solved perfectly for any linear classifier
(1-layer neural network).
But it is possible with 2 layers
\[ a_1 = \operatorname{ReLU}(x_1 - x_2) \quad a_2 = \operatorname{ReLU}(-x_1 + x_2) \quad \hat y = a_1 + a_2 \quad \]
| $x_1$ | $x_2$ | $y$ | $a_1$ | $a_2$ | $\hat y$ |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 0 | 1 | 1 |
| 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 |
Plotting after the first layer
|
|
| With ReLU | Without ReLU |
Derivatives of activation functions
Sigmoid
$g(z) = \frac{1}{1+e^{-z}}$
$\frac{d}{dz}g(z) = \frac{1}{1 + e^{-z}} (1- \frac{1}{1 + e^{-z}}) = g(z)(1-g(z))$
Hyperbolic tangent
$g(z) = \operatorname{tanh}(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
$\frac{d}{dz}g(z) = 1-\operatorname{tanh}(z)^2 = 1 - g(z)^2$
Rectified Linear Unit (ReLU)
$g(z) = \max(z, 0)$
$\frac{d}{dz}g(z) = \begin{cases} 0 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \end{cases}$
Leaky ReLU
$g(z) = \max(x, 0.01x)$
$\frac{d}{dz}g(z) = \begin{cases} 0.01 & \text{if } z < 0 \\ 1 & \text{if } z \geq 0 \end{cases}$
Gradient descent with multiple layers
More parameters:
Cost function: \[ J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(\hat y^{(m)}, y^{(m)}) \]
Update step: \[ W^{[1]} \gets W^{[1]} - \eta \nabla_{W^{[1]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] b^{[1]} \gets b^{[1]} - \eta \nabla_{b^{[1]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] W^{[2]} \gets W^{[2]} - \eta \nabla_{W^{[2]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] b^{[2]} \gets b^{[2]} - \eta \nabla_{b^{[2]}} J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]})\\[2mm] \]
Forward propagation
Backward propadation |
|
The gradient computations can be vectorized over an entire batch. Remember: we need to average over the batch, i.e., divide by $m$.
grad_A1 = np.matmul(W2.T, grad_Z2)
grad_Z1 = grad_A1 * activation_gradient_1(Z1)
grad_W1 = 1/m * np.matmul(grad_Z1, X.T)
grad_b1 = 1/m * np.sum(grad_Z1, axis=1, keepdims=True)
Initial parameter values
For logistic regression we can start with setting $w$ and $b$ to 0.
Can we also do this now?
If all parameters are the same, then all neurons are symmetric.
\[ \nabla_{W^{[1]}} J = \begin{bmatrix} u & v & \cdots \\ u & v & \cdots \\ \vdots & \vdots & \vdots \\ u & v & \cdots \end{bmatrix} \]
No matter, how many iterations we train, every neuron will stay the same.
Solution: Start with random values
import numpy as np
W1 = np.random.randn(n1, n0) * 0.01
b1 = np.zeros([n1, 1])
Deep neural networks are now straight forward:
| $l$ $=4$ | $a^{[3]}$ $\in \mathbb{R}^4$ | $b^{[1]}$ $\in \mathbb{R}^3$ |
| $W^{[2]}$ $\in \mathbb{R}^{5 \times 3}$ | $\nabla_{b^{[1]}} J$ $\in \mathbb{R}^3$ | $\nabla_{W^{[2]}} J$ $\in \mathbb{R}^{5 \times 3}$ |
Forward propagation
For every layer $i = 1, \ldots, l$:
Backward propagation
For every layer $i = l, \ldots, 1$:
Again, when batching multiple examples, we need to average the gradients:
grad_Z[i] = grad_A[i] * activation_gradient[i](Z[i])
grad_W[i] = 1/m * np.matmul(grad_Z[i], A[i-1].T)
grad_b[i] = 1/m * np.sum(grad_Z[i], axis=1, keepdims=True)
grad_A[i-1] = np.matmul(W[i].T, grad_Z[i])
Why do we need more layers?
Non linear relationships
Hierarchy of feature abstraction
For some tasks a shallow neural network needs exponentially more neurons than a deep version, to get the same performance.
Example:
$y = x_1 \operatorname{XOR} x_2 \operatorname{XOR} x_3 \operatorname{XOR} x_4
\operatorname{XOR} \ldots$
|
|
With 2 layers we need $O(2^n)$ neurons in the hidden layer.
Best practice: Always start small and monitor if more layers improve the performance
In general, how do we choose hyperparameters?
Hold-out Cross Validation
Why can't we tune the hyper parameters on the test set?
Then the hyperparameters can overfit to the test data and we have no way to find out.
Some hyper parameters are more important to tune than others
How do we try hyperparameters?
Do not use grid search
From coarse to fine: decrease search space
Select distribution for sampling hyperparameters
Uniform distribution:
import numpy as np
np.random.uniform(3, 8) # uniform random in [3,8]
np.random.randint(3, 8+1) # uniform random integer in [3, 8]
Uniform distribution can be a bad idea
Example: Select learning rate $\eta$ between 0.0001 and 1
Better: Sample on a log scale
import numpy as np
np.exp(np.random.uniform(np.log(0.0001), np.log(1))) # uniform on log scale between 0.0001 and 1
Similar, if we want to have values $\beta \in [0.9, 0.999]$
Equivalent: $(1-\beta) \in [0.001, 0.1]$
import numpy as np
1 - np.exp(np.random.uniform(np.log(0.001), np.log(1)))
Choosing hyperparameters can be automated (e.g. using packages like optuna)
But: testing many combinations of hyperparameters on a regular basis often infeasible
Bias
Example: Build robot to automatically kill weeds and train the weed/non weed detector on images from the web.
Ideally: Train/val/test data generated the same way as in production.
But especially val/test data!
Bias vs Variance
|
|
|
| high bias | high variance |
| Train set error: | 1% |
| Val set error: | 11% |
High variance: the algorithm is overfitting
| Train set error: | 15% |
| Val set error: | 16% |
Assuming: human performance has close to 0% error
High bias: the model has insufficient capacity
| Train set error: | 15% |
| Val set error: | 30% |
High bias & high variance
| Train set error: | 0.5% |
| Val set error: | 1% |
Low bias & low variance
| Train set error: | 15% |
| Val set error: | 16% |
Human performance is 15% error
Low bias & low variance
What to do about bias?
What to do about variance?
Regularization
For logistic regression we can adjust the cost function: \[ J(w,b) = \frac{1}{m}\sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) {\color{red} + \frac{\lambda}{2m} ||w||^2_2} \\[2mm] ||w||_2^2 = \sum_j w_j^2 = w^T w \]
L2 regularization. $||w||_2$ is the $L_2$ norm.
$\lambda \geq 0$ is a hyper parameter.
L1 regularization
\[ J(w,b) = \frac{1}{m}\sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) {\color{red} + \frac{\lambda}{2m} ||w||_1} \\[2mm] ||w||_1 = \sum_j |w_j| \]
L1 vs L2 norm
L1 regularization can lead to sparser $w$.
Regularization for neural networks
\[ J(W^{[1]}, b^{[1]}, \ldots, W^{[l]}, b^{[l]}) =\\ \frac{1}{m}\sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) {\color{red} + \frac{\lambda}{2m} \sum_{i = 1}^l ||W^{[i]}||^2_F} \\[2mm] ||W^{[k]}||^2_F = \sum_{i=1}^{n^{[l]}} \sum_{j = 1}^{n^{[l-1]}} (W^{[k]}_{i,j})^2 \]
$||W^{[k]}||_F$ is called the Frobenius norm
Effect on gradient
\[ \nabla_{W^{[i]}} J = \ldots { \color{red} + \frac{\lambda}{m} W^{[i]}} \]
Some intuition for regularization
Smaller values for $W$ imply that the $z$ are closer together.
$z$ close to zero are mapped almost linearly
Dropout regularization
Idea: For each training example remove some units randomly
Dropout means, that for each example, we train on a smaller network.
Note: the network is different for each training example.
New hyper parameter: $0 \leq \text{keep\_prob} \leq 1$
Probability that we do not drop a unit
import numpy as np
d_i = np.random.rand(a_i.shape[0], a_i.shape[1]) < keep_prob
a_i = np.multiply(a_i, d_i)
a_i /= keep_prob
Note that we rescale a at the end ("inverted dropout")
Dropout is only applied during training!
Why does dropout work?
|
|
Need to spread out weights and not rely too much on any one
We can have different keep_prob parameters for each layer.
No need to drop a unit if the layer has just 3.
If there is no problem with high variance, there is no need to use dropout.
Dropout regularization can be a source of stochastic bugs.
Best practice: turn off for debugging to remove source of randomness.
Data augmentation
Idea: create new examples from old ones
|
|
|
|
||
|
Early stopping
Idea: Stop training before validation error increases
Intuition: stop training, before we start overfitting
Drawback: strong coupling to optimization algorithm
How do we evaluate models?
Loss function is often hard to interpret
| Predicted | |||
| Positiv | Negativ | ||
| Truth | Positive | True positive (TP) | False negative (FN) |
| Negative | False positive (FP) | True negative (TN) | |
Accuracy
\[ \frac{\text{TP+TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \]
What fraction of our predictions was correct?
Recall
(True Positive Rate)
\[ \frac{\text{TP}}{\text{TP} + \text{FN}} \]
What fraction of actually positive examples did we find?
Specificity
(True Negative Rate)
\[ \frac{\text{TN}}{\text{TN} + \text{FP}} \]
What fraction of actually negative examples did we find?
Fall-out
(False Positive Rate)
\[ \frac{\text{FP}}{\text{TN} + \text{FP}} \]
What fraction of actual negative examples did we classify wrong?
Precision
\[ \frac{\text{TP}}{\text{TP} + \text{FP}} \]
What fraction of our positive predictions was correct?
A single metric can be misleading
Tradeoff between
$\text{F}_1$ Score
\[ \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]
Harmonic mean between precision and recall
We can influence the tradeoff by setting a threshold value for classification.
Receiver operating characteristic (ROC)
For all threshold, plot recall vs fall-out.
Example (Wikipedia):
AUROC
(Area under operating characteristic)
Calculate area under the ROC curve
Multiple classes
Given $C$ different classes $(0, 1, \ldots, C-1)$.
Approach: one output for each class ($\hat y \in \mathbb{R}^C$)
Each output shall give probability for its class:
$P(y=2 | x) = \hat y_{2+1}$
If an example can have multiple labels, then the last activation is a sigmoid.
If only one label is allowed, we add a soft-max layer.
Example
\[ z^{[L]} = \left[ {\begin{array}{c} 1 \\ 2 \\ -1 \end{array}} \right] ) \\[3mm] a^{[L]} = \left[ {\begin{array}{c} \frac{2.718}{2.718 + 7.389 + 0.368} \\[2mm] \frac{7.389}{2.718 + 7.389 + 0.368} \\[2mm] \frac{0.368}{2.718 + 7.389 + 0.368} \end{array}} \right] = \left[ {\begin{array}{c} 0.259 \\[2mm] 0.705 \\[2mm] 0.035 \end{array}} \right] \]
We get a decision bounary for each pair of classes and an area for each class.
The softmax outputs always add up to 1.
If $C=2$, then softmax reduces to logistic regression.
Loss function with soft-max
\[ \mathcal{L}(\hat y, y) = - \sum_{j=1}^C y_j \log \hat y_j \]
Example: $y = \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}$, $\hat y = \begin{bmatrix} 0.5 \\ 0.2 \\ 0.3 \end{bmatrix}$
$\mathcal{L}(\hat y, y) = -\log 0.2 \approx 1.61$
Note that with a vectorized implementation
$Y \in \mathbb{R}^{C \times m}$
Gradient ist still friendly (like for sigmoid): \[ \nabla_{z^{[L]}} J = \hat y - y \]