Artificial Intelligence

Back propagation & logistic regression

What is a neural network?

$\color{green} h(x) = w_1 \cdot x + w_0$

$\color{blue} h(x) = \max(w_1 \cdot x + w_0, 0)$

$h(x) = \max(w_1 \cdot x + w_0, 0)$

Activation function: $g(z) = \max(z, 0)$
Recified Linear Unit (ReLU)

Stacking neurons

Each layer outputs new, more abstract features

During training, the algorithm decides what features are most useful.

Input (x) Output (y) Application
House size,... Price Real estate
Ad, cookie history Click on ad? Online advertising
Image Object(s) Photo tagging
Audio Text transcript Speech recognition
English Chinese Machine translation
Image, Radar, Lidar Position other cars Autonomous driving

Image classification

Input Output
Cat (1) or no cat (0)

Notation

  • Input features $x \in \mathbb{R}^n$
  • Label $y \in \{0, 1\}$
  • $(x, y)$ is a single training example
  • $\mathcal{D}_\text{train} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(m_\text{train})}, y^{(m_\text{train})})\} $ are the training examples
  • $m_\text{train}$: Number of training
  • $m_\text{test}$: Number of test examples (in $\mathcal{D}_\text{test}$)

Vectorize all the things

\[ X = \begin{bmatrix} | & | & & | \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ | & | & & | \end{bmatrix} \in \mathbb{R}^{n \times m} \]

\[ Y = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)} \end{bmatrix} \in \mathbb{R}^{1 \times m} \]

How to get a feature vector $x$?

\[ { \color{grey} \begin{bmatrix} 87 & 91 & 92 & \\ 87 & 90 & 91 & \cdots \\ 86 & 89 & 90 & \\ & \vdots & & \ddots \end{bmatrix} } \]
$554 \times 567$ pixel $\mathbb{Z}^{567 \times 554}$

Color images have 3 channels

\[ { \color{blue} \begin{bmatrix} 76 & 81 & 84 & \\ 77 & 81 & 83 & \cdots \\ 79 & 80 & 81 & \\ & \vdots & & \ddots \end{bmatrix} } { \color{green} \begin{bmatrix} 82 & 87 & 88 & \\ 83 & 87 & 87 & \cdots \\ 85 & 86 & 85 & \\ & \vdots & & \ddots \end{bmatrix} }\\[2mm] { \color{red} \begin{bmatrix} 101 & 106 & 106 & \\ 102 & 106 & 105 & \cdots \\ 104 & 105 & 103 & \\ & \vdots & & \ddots \end{bmatrix} } \]

Get a feature vector by stacking the matrices

\[ { \color{blue} \begin{bmatrix} 76 & 81 & 84 & \\ 77 & 81 & 83 & \cdots \\ 79 & 80 & 81 & \\ & \vdots & & \ddots \end{bmatrix} }\\[2mm] { \color{green} \begin{bmatrix} 82 & 87 & 88 & \\ 83 & 87 & 87 & \cdots \\ 85 & 86 & 85 & \\ & \vdots & & \ddots \end{bmatrix} }\\[2mm] { \color{red} \begin{bmatrix} 101 & 106 & 106 & \\ 102 & 106 & 105 & \cdots \\ 104 & 105 & 103 & \\ & \vdots & & \ddots \end{bmatrix} } \] $\Rightarrow$ \[ \begin{bmatrix} {\color{blue} 76}\\ {\color{blue} 81}\\ {\color{blue} 84}\\ \vdots \\ {\color{green} 82}\\ {\color{green} 87}\\ {\color{green} 88}\\ \vdots \\ {\color{red} 101}\\ {\color{red} 106}\\ {\color{red} 106}\\ \vdots \end{bmatrix} \]

The resulting vector has \[ 3 \times 567 \times 554 = 942354 \] dimensions.

Every image has different dimensions

Idea: Rescale image

Just large enough that we can classify it

With $64 \times 64$ color pixel we get $n = 12288$ Features

If we can get away without color it will only be $n = 4096$ Features

Classification with logistic regression

For input $x \in \mathbb{R}^n$ predict chance that $y=1$: \[ \hat y = P(y=1 | x) \]

Output: $\hat y = \sigma( w^T x + b )$

Sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-z}}$

Learned parameters $w \in \mathbb{R}^n$, $b \in \mathbb{R}$

Hypothesis: $h$

$\sigma(z) = \frac{1}{1+e^{- z}}$

"sigmoid" or "logistic function"

Loss function

For any example $(x, y)$ we want $\hat y = h(x) \approx y$

For logistic regression: \[ \mathcal{L}(\hat y, y) = - ( y \log(\hat y) + (1-y) \log(1-\hat y)) \]

Cost function (training loss)

\[ J(w, b) = \frac{1}{m} \sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) \]

How to get parameters $w$ and $b$?

Minimize cost function \[ \min_{w, b} J(w, b) \]

Gradient Descent

Repeat the update step \[ w \gets w - \eta \nabla_w J(w) \]

  • $\eta$: learning rate
  • $\nabla_w$: gradient w.r.t. $w$

\[ \nabla f(x) = \left[ {\begin{array}{c} \frac{\partial}{\partial x_1} f(x) \\ \frac{\partial}{\partial x_2} f(x) \\ \frac{\partial}{\partial x_3} f(x) \\ \vdots \end{array}} \right] \]

Example: $f(x_1, x_2) = x_1 x_2^2$

\[ \nabla f(x_1, x_2) = \left[ {\begin{array}{c} x_2^2 \\ 2 x_1 x_2 \end{array}} \right] \]

\[ \nabla f(3, 2) = \left[ {\begin{array}{c} 4 \\ 12 \end{array}} \right] \]

For each dimension, the gradient tells us the slope (rate of change) in that direction.

The gradient is also the direction in which $f$ increases the most

$f(x_1, x_2) = x_1 x_2^2$

\[ \nabla_{x_1} f(x_1, x_2) = \left[ {\begin{array}{c} x_2^2 \end{array}} \right] \]

Example

$J(w) = (w-2)^2$

$\frac{\partial J}{\partial w}(w) = 2(w-2)$

Start: $w = 0$

$\frac{\partial J}{\partial w}(0) = -4$

$w \gets 0 - \eta \cdot (-4)$

Full update step \[ w \gets w - \eta \nabla_w J(w, b)\\[2mm] b \gets b - \eta \nabla_b J(w, b) \]

Learning rate $\eta$

  • High $\eta$: Fast descent - but less stable
  • Small $\eta$: More stable - but slower

Need trial and error to determine.

How do we get derivatives $\nabla J(w, b)$?

Computation Graph

A directed acyclic graph - every node represents a mathematical operation.

\[ c = a + b \]

\[ (a+{\color{red}\epsilon}) + b = c + {\color{green} 1} {\color{red} \epsilon} \] \[ \frac{\partial c}{\partial a} = {\color{green} 1} \]
\[ a + (b+{\color{red}\epsilon}) = c + {\color{green} 1} {\color{red} \epsilon} \] \[ \frac{\partial c}{\partial b} = {\color{green} 1} \]

\[ c = a \cdot b \]

\[ (a+{\color{red}\epsilon}) b = c + {\color{green} b} {\color{red} \epsilon} \] \[ \frac{\partial c}{\partial a} = {\color{green} b} \]
\[ a (b+{\color{red}\epsilon}) = c + {\color{green} a} {\color{red} \epsilon} \] \[ \frac{\partial c}{\partial b} = {\color{green} a} \]

More building blocks

Subtraction

Square

ReLU

Maximum

We can now create complex functions from simple building blocks.

\[ d = (a \cdot b)^2 \]

Chain rule

\[ \frac{\partial d}{\partial a} = {\color{green} \frac{\partial d}{\partial c}\frac{\partial c}{\partial a}} = (2c) (b) = (2ab)(b) = 2ab^2 \]

Backpropagation

For every node $i$ we define two values:

  • $\color{orange} f_i$: the forward value in node $i$
  • ${\color{magenta} g_i} = \frac{\partial \text{parent}}{\partial i}(f_i)$: gradient of parent of $i$ w.r.t. $i$ at $f_i$

Algorithm: Backpropagation

  • Calculate $\color{orange} f_i$ starting at the leaves
  • Calculate $\color{magenta} g_i$ starting at the root

Example: Logistic Regression

\[ \mathcal{L}(\hat y, y) = -(y \log(\hat y) + (1-y)\log(1- \hat y)) \\[3mm] \hat y = \sigma(z) = \frac{1}{1+e^{-z}} \\[3mm] z = w_1 x_1 + w_2 x_2 + b \]

\[ \hat y = \sigma(z) = \frac{1}{1+e^{-z}} \]

\[ \color{green} \frac{\partial \sigma}{\partial z} = \frac{e^z}{(1+e^z)^2} = \sigma(z)(1-\sigma(z)) = \hat y (1 - \hat y) \]

\[ \mathcal{L}(\hat y, y) = -(y \log(\hat y) + (1-y)\log(1- \hat y)) \]

\[ \color{green} \frac{\partial \mathcal{L}}{\partial \hat y} = - \frac{y}{\hat y} + \frac{1- y}{1-\hat y} \]

Example for \[ x = \begin{bmatrix} -2 \\ 3 \end{bmatrix}, \quad y = 1, \quad w = \begin{bmatrix} 2 \\ 1 \end{bmatrix} \]

\[ \nabla_w \mathcal{L} = \begin{bmatrix} -2.646 \cdot 0.235 \cdot 1 \cdot 1 \cdot -2 \\ -2.646 \cdot 0.235 \cdot 1 \cdot 1 \cdot 3 \end{bmatrix} = \begin{bmatrix} 1.244 \\ -1.865 \end{bmatrix} \\[1mm] \nabla_b \mathcal{L} = -2.646 \cdot 0.235 \cdot 1 = -0.622 \]

Combining $\sigma$ and $\mathcal{L}$.

\[ \frac{d}{d \hat y} \mathcal{L} = -\frac{y}{\hat y} + \frac{1-y}{1-\hat y}\\ \frac{d}{dz} \sigma = \sigma(z)(1-\sigma(z)) = \hat y (1 - \hat y) \]

$\frac{d}{dz}\mathcal{L} = \frac{d}{d \hat y}\mathcal{L} \frac{d}{dz} \sigma$

$= (-\frac{y}{\hat y} + \frac{1-y}{1-\hat y}) \hat y (1 - \hat y)$

$= - y (1 - \hat y) + (1-y) \hat y$

$= - y + y \hat y + \hat y - y \hat y = \hat y - y$

Remember to vectorize things!

\[ \nabla_w w^T x = x \]