Artificial Intelligence

Practical aspects of neural networks

Normalizing inputs

\[ x_\text{centered} \gets x - \mu \\[2mm] \mu = \frac{1}{m}\sum_{i=1}^m x^{(i)} \] \[ x_\text{normalized} \gets \frac{x_\text{centered}}{\sigma^2} \\[2mm] \sigma^2 = \frac{1}{m} \sum_{i=1}^m {x^{(i)}}^2 \]

Normalizing: $x_\text{normalized} \gets \frac{x - \mu}{\sigma^2}$

We calculate $\mu$ and $\sigma$ on $\mathcal{D}_\text{train}$

$\mu$ and $\sigma$ become part of the model and must be the same in production.

Why do we need normalization?

No normalization With normalization

In practice it is sufficient to have all features on roughly the same scale.

  • All between 0 and 1
  • All between -1 and 1
  • All between 1 and 2

Especially: do not normalize binary features!

Can we use normalization on hidden layers?

Batch normalization

    For some layer we have intermediate values $z^{(1)}, z^{(2)}, \ldots, z^{(m)}$.

  • $\mu \gets \frac{1}{m} \sum_i z^{(i)}$
  • $\sigma^2 \gets \frac{1}{m} \sum_i (z^{(i)} - \mu)^2$
  • $z^{(i)}_\text{norm} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}}$

    Having mean 0 and std.dev. 1 might not be exactly what we need

  • $\tilde z^{(i)} = \gamma z^{(i)}_\text{norm} + \beta$
  • $\gamma$ and $\beta$ are learnable parameters.

We use $\tilde z^{(i)}$ instead of $z^{(i)}$ as input of the activation function.

\[ x \xrightarrow{W^{[1]}, b^{[1]}} z^{[1]} \xrightarrow{\text{BatchNorm}, \beta^{[1]}, \gamma^{[1]}} \tilde z^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde z^{[1]})\\[2mm] a^{[1]} \xrightarrow{W^{[2]}, b^{[2]}} z^{[2]} \xrightarrow{\text{BatchNorm}, \beta^{[2]}, \gamma^{[2]}} \tilde z^{[2]} \rightarrow a^{[2]} = g^{[2]}(\tilde z^{[2]}) \]

New parameter updates: \[ \beta^{[i]} \gets \beta^{[i]} - \eta \nabla_{\beta^{[i]}} J\\[2mm] \gamma^{[i]} \gets \gamma^{[i]} - \eta \nabla_{\gamma^{[i]}} J \]

Batch Normalization adds stability for each layer

Each update changes input distribution for hidden layers

Batch Norm makes those more stable.

  • Mean and variance are calculated per mini-batch
  • These change for each mini-batch
  • This noise has a regularizing effect like dropout

How do we get $\mu$ and $\sigma^2$ for normalization in production?

Idea: Take exponentially weighted average during training.

Example

$X^{\{1\}}$ $\mu^{\{1\}[l]}$ $\mu^{[l]} \gets 0.1 \mu^{\{1\}[l]}$
$X^{\{2\}}$ $\mu^{\{2\}[l]}$ $\mu^{[l]} \gets 0.9 \mu^{[l]} + 0.1 \mu^{\{2\}[l]}$
$X^{\{3\}}$ $\mu^{\{3\}[l]}$ $\mu^{[l]} \gets 0.9 \mu^{[l]} + 0.1 \mu^{\{3\}[l]}$
$X^{\{4\}}$ $\mu^{\{4\}[l]}$ $\mu^{[l]} \gets 0.9 \mu^{[l]} + 0.1 \mu^{\{4\}[l]}$

Use final $\mu^{[l]}$ in production.

Deep learning frameworks

  • Caffe/Caffe2
  • CNTK
  • DL4J
  • Keras
  • Lasagne
  • mxnet
  • PaddlePaddle
  • TensorFlow
  • Theano
  • Torch

When choosing a framework consider

  • what the business environment allows
  • supported programming languages
  • available documentation
  • open source

In most frameworks you specify the compute graph (how forward propagation works).
Back propagation and training is taken care of by the framework.

Tensorflow example

Minimize $w^2 - 10 w + 25$

Jupyter

Assume that we have a deep neural network:

$\hat y = W^{[l]} \operatorname{ReLU}(W^{[l-1]} \operatorname{ReLU}(\ldots W^{[2]} \operatorname{ReLU}(W^{[1]} x) \ldots))$

Assume for all weights

$W = \begin{bmatrix} 0.5 & 0 \\ 0 & 0.5 \end{bmatrix}$
Then
$\hat y = 0.5^l x_1 + 0.5^l x_2$
$0.5 ^{30} = 9.31 \cdot 10^{-10}$
$W = \begin{bmatrix} 1.5 & 0 \\ 0 & 1.5 \end{bmatrix}$
Then
$\hat y = 1.5^l x_1 + 1.5^l x_2$
$1.5 ^{30} = 191751.06$

With many layers, small changes in all weights can make big differences

Gradients are affected in the same way

Vanishing / exploding gradients

Idea: Initialize parameters more carefully

\[ z = w_1 x_1 + w_2 x_2 + \ldots +w_n x_n \]

If $w_i$ are normal distributed with std. dev. 1,
then $\operatorname{std.dev.}(\sum_i w_i) = \sqrt{n}$

More inputs $\Rightarrow$ more variance

Scale initial weights towards std. dev. of 1


          import numpy as np

          W = np.random.randn(5,4) * np.sqrt(1/4)
        

Modification for ReLU layer:
Scale by $\sqrt{\frac{\color{red}2}{n^{[i-1]}}}$


          import numpy as np

          W = np.random.randn(5,4) * np.sqrt(2/4)
        

Best practice: Monitor magnitude of gradients / weights during training.


          flat_parameters = np.concatenate([W1.flatten(), \
            W2.flatten(), W3.flatten(), b1.flatten(), \
            b2.flatten(), b3.flatten()])

          parameter_magnitude = np.linalg.norm(flat_parameters)
        

Gradient clipping

A quick fix for exploding gradients is to fix the gradient to some bounding box: \[ \max(\min(\nabla J, 1), -1) \]

Does not fix underlying problems, but helps to avoid single bad steps.

Checking gradient calculation

Two-sided derivative approximation

$\nabla_x f(\theta) \approx \frac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon}$

Implement gradient checking

  • Flatten all parameters $W^{[1]}, b^{[1]}, \ldots, W^{[l]}, b^{[l]}$ into long vector $\theta$
  • Flatten all derivatives to $\nabla_\theta J$.
  • For each parameter $i$:
    • Calculate $\nabla^\text{approx}_{\theta_i} J = \frac{J(\theta_1, \ldots, \theta_i + \epsilon, \ldots) - J(\theta_1, \ldots, \theta_i - \epsilon, \ldots)}{2\epsilon}$
  • Check size of $\frac{||\nabla^\text{approx}_\theta J - \nabla_\theta J||_2}{||\nabla^\text{approx}_\theta J||_2 + ||\nabla_\theta J||_2}$

What differences are acceptable?

$\epsilon = 10^{-7}$

  • $10^{-7}$: everything fine
  • $10^{-5}$: maybe double check
  • $10^{-3}$: there seems to be a bug

Look at element with largest difference

  • Approximation is expensive - only use for debugging
  • Remember the regularization in $J$
  • Does not work with dropout
  • Check at several training steps - not just at initial parameters

Problem: With a lot of training data (large $m$), one training step with $X \in \mathbb{R}^{n \times m}$ is slow

With very large $m$ the calculation may not fit in memory

Idea: Split training data into mini-batches.

For example with batch size 100: \[ X^{\{1\}} = \begin{bmatrix} x^{(1)} & x^{(2)} & \cdots & x^{(100)} \end{bmatrix} \in \mathbb{R}^{n \times 100} \\ X^{\{2\}} = \begin{bmatrix} x^{(101)} & x^{(102)} & \cdots & x^{(200)} \end{bmatrix}\in \mathbb{R}^{n \times 100}\\ \cdots \]

Analogously for $Y$

Mini-batch gradient descent

    One "Epoch":

  • For each mini-batch $i$
    • Perform gradient descent on $X^{\{i\}}$

Like gradient descent but on mini-batches instead of all examples.

Batch gradient descent Mini-batch gradient descent

What is the right batch size?

  • Batch size $m$ (batch gradient descent): too slow per iteration
  • Batch size $1$ (stochastic gradient descent): no advantage from vectorization

Need to try several sizes to find a sweet spot

Usually mini-batch size is a multiple of 2:
64, 128, 256, ...

We want a mini-batch size that fits in (GPU) memory.

Next: improving gradient descent

Exponentially weighted averages

Daily minimum temperature in Frankfurt

What is the temperature trend?

Input: a time series $\theta_1, \theta_2, \theta_3, \ldots$

Calculate moving average as \[ V_t = \beta V_{t-1} + (1 - \beta) \theta_t\\ V_0 = 0 \]

$\beta = 0.9$ $\beta = 0.98$ $\beta = 0.5$

\[ V_{100} = (1-\beta) \theta_{100} + (1-\beta) \beta \theta_{99} + (1-\beta) \beta^2 \theta_{98} + (1-\beta) \beta^3 \theta_{97} + \ldots \]

Influence of past data points decays exponentially.

Correcting the bias at the start

Bias correction: instead of $V_t$ use \[ \frac{V_t}{1- \beta^t} \]

For large $t$, $\beta^t$ goes to 0

Gradient descent with momentum

Take exponentially weighted average of gradients

Exponentially weighted average of gradients: \[ V_W = \beta V_W + (1-\beta) \nabla_W J \]

Update step: \[ W \gets W - \eta V_W \]

RMSprop (root mean square propagation)

Exponentially weighted average of squared gradients: \[ S_W = \beta_2 S_W + (1 - \beta_2) (\nabla_W J)^2 \]

Update step: \[ W \gets W - \eta \frac{\nabla_W J}{\sqrt{S_W} + \epsilon} \]

Momentum + RMSprop = Adam (Adaptive moment estimation)

  • Set $V_W = 0$, $S_W = 0$
  • In each iteration $t$:
    • $V_W \gets \beta_1 V_W + (1 - \beta_1) \nabla_W J$
    • $V^\text{corrected}_W \gets \frac{V_W}{1-\beta_1^t}$
    • $S_W \gets \beta_2 S_W + (1 - \beta_2) (\nabla_W J)^2$
    • $S^\text{corrected}_W \gets \frac{S_W}{1-\beta_2^t}$
    • $W \gets W - \eta \frac{V^\text{corrected}_W}{\sqrt{S^\text{corrected}_W} + \epsilon}$

Hyperparameters

  • Learning rate $\eta$ must be tuned for each problem
  • $\beta_1 = 0.9$ good default
  • $\beta_2 = 0.999$ good default
  • $\epsilon = 10^{-8}$ (usually not changed)

Learning rate decay

Idea: Slowly reduce learning rate over time

Bigger steps at the beginning and smaller steps when getting closer to minimum

$\eta \gets \frac{1}{1 + \text{decayrate} \cdot \text{epoch}} \eta_0$

Other decay rates

  • $\eta = 0.95^\text{epoch} \cdot \eta_0$ (exponential decay)
  • $\eta = \frac{k}{\sqrt{\text{epoch}}} \cdot \eta_0$
  • Sometime $t$ is used instead of $\text{epoch}$ (decrease learning rate on each iteration)

Are local optima a problem?

With a lot of dimensions there usually is one where we can still make progress

More common: saddle points with $\nabla J = 0$

Problematic are plateaus (regions where gradient is close to zero for long time).

Momentum helps to make sufficient progress