Normalizing inputs
|
|
|
|
| \[ x_\text{centered} \gets x - \mu \\[2mm] \mu = \frac{1}{m}\sum_{i=1}^m x^{(i)} \] | \[ x_\text{normalized} \gets \frac{x_\text{centered}}{\sigma^2} \\[2mm] \sigma^2 = \frac{1}{m} \sum_{i=1}^m {x^{(i)}}^2 \] |
Normalizing: $x_\text{normalized} \gets \frac{x - \mu}{\sigma^2}$
We calculate $\mu$ and $\sigma$ on $\mathcal{D}_\text{train}$
$\mu$ and $\sigma$ become part of the model and must be the same in production.
Why do we need normalization?
| No normalization | With normalization |
In practice it is sufficient to have all features on roughly the same scale.
Especially: do not normalize binary features!
Can we use normalization on hidden layers?
Batch normalization
For some layer we have intermediate values $z^{(1)}, z^{(2)}, \ldots, z^{(m)}$.
Having mean 0 and std.dev. 1 might not be exactly what we need
We use $\tilde z^{(i)}$ instead of $z^{(i)}$ as input of the activation function.
\[ x \xrightarrow{W^{[1]}, b^{[1]}} z^{[1]} \xrightarrow{\text{BatchNorm}, \beta^{[1]}, \gamma^{[1]}} \tilde z^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde z^{[1]})\\[2mm] a^{[1]} \xrightarrow{W^{[2]}, b^{[2]}} z^{[2]} \xrightarrow{\text{BatchNorm}, \beta^{[2]}, \gamma^{[2]}} \tilde z^{[2]} \rightarrow a^{[2]} = g^{[2]}(\tilde z^{[2]}) \]
New parameter updates: \[ \beta^{[i]} \gets \beta^{[i]} - \eta \nabla_{\beta^{[i]}} J\\[2mm] \gamma^{[i]} \gets \gamma^{[i]} - \eta \nabla_{\gamma^{[i]}} J \]
Batch Normalization adds stability for each layer
Each update changes input distribution for hidden layers
Batch Norm makes those more stable.
How do we get $\mu$ and $\sigma^2$ for normalization in production?
Idea: Take exponentially weighted average during training.
Example
| $X^{\{1\}}$ | $\mu^{\{1\}[l]}$ | $\mu^{[l]} \gets 0.1 \mu^{\{1\}[l]}$ |
| $X^{\{2\}}$ | $\mu^{\{2\}[l]}$ | $\mu^{[l]} \gets 0.9 \mu^{[l]} + 0.1 \mu^{\{2\}[l]}$ |
| $X^{\{3\}}$ | $\mu^{\{3\}[l]}$ | $\mu^{[l]} \gets 0.9 \mu^{[l]} + 0.1 \mu^{\{3\}[l]}$ |
| $X^{\{4\}}$ | $\mu^{\{4\}[l]}$ | $\mu^{[l]} \gets 0.9 \mu^{[l]} + 0.1 \mu^{\{4\}[l]}$ |
Use final $\mu^{[l]}$ in production.
Deep learning frameworks
When choosing a framework consider
In most frameworks you specify the compute graph (how forward propagation works).
Back propagation and training is taken care of by the framework.
Tensorflow example
Minimize $w^2 - 10 w + 25$
JupyterAssume that we have a deep neural network:
$\hat y = W^{[l]} \operatorname{ReLU}(W^{[l-1]} \operatorname{ReLU}(\ldots W^{[2]} \operatorname{ReLU}(W^{[1]} x) \ldots))$
Assume for all weights
|
$W = \begin{bmatrix}
0.5 & 0 \\ 0 & 0.5
\end{bmatrix}$ Then $\hat y = 0.5^l x_1 + 0.5^l x_2$ $0.5 ^{30} = 9.31 \cdot 10^{-10}$ |
$W = \begin{bmatrix}
1.5 & 0 \\ 0 & 1.5
\end{bmatrix}$ Then $\hat y = 1.5^l x_1 + 1.5^l x_2$ $1.5 ^{30} = 191751.06$ |
With many layers, small changes in all weights can make big differences
Gradients are affected in the same way
Vanishing / exploding gradients
Idea: Initialize parameters more carefully
\[ z = w_1 x_1 + w_2 x_2 + \ldots +w_n x_n \]
If $w_i$ are normal distributed with std. dev. 1,
then $\operatorname{std.dev.}(\sum_i w_i) = \sqrt{n}$
More inputs $\Rightarrow$ more variance
Scale initial weights towards std. dev. of 1
import numpy as np
W = np.random.randn(5,4) * np.sqrt(1/4)
Modification for ReLU layer:
Scale by $\sqrt{\frac{\color{red}2}{n^{[i-1]}}}$
import numpy as np
W = np.random.randn(5,4) * np.sqrt(2/4)
Best practice: Monitor magnitude of gradients / weights during training.
flat_parameters = np.concatenate([W1.flatten(), \
W2.flatten(), W3.flatten(), b1.flatten(), \
b2.flatten(), b3.flatten()])
parameter_magnitude = np.linalg.norm(flat_parameters)
Gradient clipping
A quick fix for exploding gradients is to fix the gradient to some bounding box: \[ \max(\min(\nabla J, 1), -1) \]
Does not fix underlying problems, but helps to avoid single bad steps.
Checking gradient calculation
Two-sided derivative approximation
$\nabla_x f(\theta) \approx \frac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon}$
Implement gradient checking
What differences are acceptable?
$\epsilon = 10^{-7}$
Look at element with largest difference
Problem: With a lot of training data (large $m$), one training step with $X \in \mathbb{R}^{n \times m}$ is slow
With very large $m$ the calculation may not fit in memory
Idea: Split training data into mini-batches.
For example with batch size 100: \[ X^{\{1\}} = \begin{bmatrix} x^{(1)} & x^{(2)} & \cdots & x^{(100)} \end{bmatrix} \in \mathbb{R}^{n \times 100} \\ X^{\{2\}} = \begin{bmatrix} x^{(101)} & x^{(102)} & \cdots & x^{(200)} \end{bmatrix}\in \mathbb{R}^{n \times 100}\\ \cdots \]
Analogously for $Y$
Mini-batch gradient descent
One "Epoch":
Like gradient descent but on mini-batches instead of all examples.
| Batch gradient descent | Mini-batch gradient descent |
|
|
What is the right batch size?
Need to try several sizes to find a sweet spot
Usually mini-batch size is a multiple of 2:
64, 128, 256, ...
We want a mini-batch size that fits in (GPU) memory.
Next: improving gradient descent
Exponentially weighted averages
Daily minimum temperature in Frankfurt
What is the temperature trend?
Input: a time series $\theta_1, \theta_2, \theta_3, \ldots$
Calculate moving average as \[ V_t = \beta V_{t-1} + (1 - \beta) \theta_t\\ V_0 = 0 \]
$\beta = 0.9$ $\beta = 0.98$ $\beta = 0.5$
\[ V_{100} = (1-\beta) \theta_{100} + (1-\beta) \beta \theta_{99} + (1-\beta) \beta^2 \theta_{98} + (1-\beta) \beta^3 \theta_{97} + \ldots \]
Influence of past data points decays exponentially.
Correcting the bias at the start
Bias correction: instead of $V_t$ use \[ \frac{V_t}{1- \beta^t} \]
For large $t$, $\beta^t$ goes to 0
Gradient descent with momentum
Take exponentially weighted average of gradients
Exponentially weighted average of gradients: \[ V_W = \beta V_W + (1-\beta) \nabla_W J \]
Update step: \[ W \gets W - \eta V_W \]
RMSprop (root mean square propagation)
Exponentially weighted average of squared gradients: \[ S_W = \beta_2 S_W + (1 - \beta_2) (\nabla_W J)^2 \]
Update step: \[ W \gets W - \eta \frac{\nabla_W J}{\sqrt{S_W} + \epsilon} \]
Momentum + RMSprop = Adam (Adaptive moment estimation)
Hyperparameters
Learning rate decay
Idea: Slowly reduce learning rate over time
Bigger steps at the beginning and smaller steps when getting closer to minimum
$\eta \gets \frac{1}{1 + \text{decayrate} \cdot \text{epoch}} \eta_0$
Other decay rates
Are local optima a problem?
With a lot of dimensions there usually is one where we can still make progress
More common: saddle points with $\nabla J = 0$
Problematic are plateaus (regions where gradient is close to zero for long time).
Momentum helps to make sufficient progress