What is a neural network?
$\color{green} h(x) = w_1 \cdot x + w_0$
$\color{blue} h(x) = \max(w_1 \cdot x + w_0, 0)$
$h(x) = \max(w_1 \cdot x + w_0, 0)$
Activation function: $g(z) = \max(z, 0)$
Recified Linear Unit (ReLU)
Stacking neurons
Each layer outputs new, more abstract features
During training, the algorithm decides what features are most useful.
| Input (x) | Output (y) | Application |
|---|---|---|
| House size,... | Price | Real estate |
| Ad, cookie history | Click on ad? | Online advertising |
| Image | Object(s) | Photo tagging |
| Audio | Text transcript | Speech recognition |
| English | Chinese | Machine translation |
| Image, Radar, Lidar | Position other cars | Autonomous driving |
Image classification
| Input | Output |
![]() |
Cat (1) or no cat (0) |
Notation
Vectorize all the things
\[ X = \begin{bmatrix} | & | & & | \\ x^{(1)} & x^{(2)} & \cdots & x^{(m)} \\ | & | & & | \end{bmatrix} \in \mathbb{R}^{n \times m} \]
\[ Y = \begin{bmatrix} y^{(1)} & y^{(2)} & \cdots & y^{(m)} \end{bmatrix} \in \mathbb{R}^{1 \times m} \]
How to get a feature vector $x$?
|
\[ { \color{grey} \begin{bmatrix} 87 & 91 & 92 & \\ 87 & 90 & 91 & \cdots \\ 86 & 89 & 90 & \\ & \vdots & & \ddots \end{bmatrix} } \] |
| $554 \times 567$ pixel | $\mathbb{Z}^{567 \times 554}$ |
Color images have 3 channels
|
\[ { \color{blue} \begin{bmatrix} 76 & 81 & 84 & \\ 77 & 81 & 83 & \cdots \\ 79 & 80 & 81 & \\ & \vdots & & \ddots \end{bmatrix} } { \color{green} \begin{bmatrix} 82 & 87 & 88 & \\ 83 & 87 & 87 & \cdots \\ 85 & 86 & 85 & \\ & \vdots & & \ddots \end{bmatrix} }\\[2mm] { \color{red} \begin{bmatrix} 101 & 106 & 106 & \\ 102 & 106 & 105 & \cdots \\ 104 & 105 & 103 & \\ & \vdots & & \ddots \end{bmatrix} } \] |
Get a feature vector by stacking the matrices
| \[ { \color{blue} \begin{bmatrix} 76 & 81 & 84 & \\ 77 & 81 & 83 & \cdots \\ 79 & 80 & 81 & \\ & \vdots & & \ddots \end{bmatrix} }\\[2mm] { \color{green} \begin{bmatrix} 82 & 87 & 88 & \\ 83 & 87 & 87 & \cdots \\ 85 & 86 & 85 & \\ & \vdots & & \ddots \end{bmatrix} }\\[2mm] { \color{red} \begin{bmatrix} 101 & 106 & 106 & \\ 102 & 106 & 105 & \cdots \\ 104 & 105 & 103 & \\ & \vdots & & \ddots \end{bmatrix} } \] | $\Rightarrow$ | \[ \begin{bmatrix} {\color{blue} 76}\\ {\color{blue} 81}\\ {\color{blue} 84}\\ \vdots \\ {\color{green} 82}\\ {\color{green} 87}\\ {\color{green} 88}\\ \vdots \\ {\color{red} 101}\\ {\color{red} 106}\\ {\color{red} 106}\\ \vdots \end{bmatrix} \] |
The resulting vector has \[ 3 \times 567 \times 554 = 942354 \] dimensions.
Every image has different dimensions
Idea: Rescale image
Just large enough that we can classify it
With $64 \times 64$ color pixel we get $n = 12288$ Features
If we can get away without color it will only be $n = 4096$ Features
Classification with logistic regression
For input $x \in \mathbb{R}^n$ predict chance that $y=1$: \[ \hat y = P(y=1 | x) \]
Output: $\hat y = \sigma( w^T x + b )$
Sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-z}}$
Learned parameters $w \in \mathbb{R}^n$, $b \in \mathbb{R}$
Hypothesis: $h$
$\sigma(z) = \frac{1}{1+e^{- z}}$
"sigmoid" or "logistic function"
Loss function
For any example $(x, y)$ we want $\hat y = h(x) \approx y$
For logistic regression: \[ \mathcal{L}(\hat y, y) = - ( y \log(\hat y) + (1-y) \log(1-\hat y)) \]
Cost function (training loss)
\[ J(w, b) = \frac{1}{m} \sum_{(x,y) \in \mathcal{D}_\text{train}} \mathcal{L}(\hat y, y) \]
How to get parameters $w$ and $b$?
Minimize cost function \[ \min_{w, b} J(w, b) \]
Gradient Descent
Repeat the update step \[ w \gets w - \eta \nabla_w J(w) \]
\[ \nabla f(x) = \left[ {\begin{array}{c} \frac{\partial}{\partial x_1} f(x) \\ \frac{\partial}{\partial x_2} f(x) \\ \frac{\partial}{\partial x_3} f(x) \\ \vdots \end{array}} \right] \]
Example: $f(x_1, x_2) = x_1 x_2^2$
\[ \nabla f(x_1, x_2) = \left[ {\begin{array}{c} x_2^2 \\ 2 x_1 x_2 \end{array}} \right] \]
\[ \nabla f(3, 2) = \left[ {\begin{array}{c} 4 \\ 12 \end{array}} \right] \]
For each dimension, the gradient tells us the slope (rate of change) in that direction.
The gradient is also the direction in which $f$ increases the most
$f(x_1, x_2) = x_1 x_2^2$
\[ \nabla_{x_1} f(x_1, x_2) = \left[ {\begin{array}{c} x_2^2 \end{array}} \right] \]
Example
|
$J(w) = (w-2)^2$ $\frac{\partial J}{\partial w}(w) = 2(w-2)$ Start: $w = 0$ $\frac{\partial J}{\partial w}(0) = -4$ $w \gets 0 - \eta \cdot (-4)$ |
|
Full update step \[ w \gets w - \eta \nabla_w J(w, b)\\[2mm] b \gets b - \eta \nabla_b J(w, b) \]
Learning rate $\eta$
Need trial and error to determine.
How do we get derivatives $\nabla J(w, b)$?
Computation Graph
A directed acyclic graph - every node represents a mathematical operation.
\[ c = a + b \]
| \[ (a+{\color{red}\epsilon}) + b = c + {\color{green} 1} {\color{red} \epsilon} \] | \[ \frac{\partial c}{\partial a} = {\color{green} 1} \] |
| \[ a + (b+{\color{red}\epsilon}) = c + {\color{green} 1} {\color{red} \epsilon} \] | \[ \frac{\partial c}{\partial b} = {\color{green} 1} \] |
\[ c = a \cdot b \]
| \[ (a+{\color{red}\epsilon}) b = c + {\color{green} b} {\color{red} \epsilon} \] | \[ \frac{\partial c}{\partial a} = {\color{green} b} \] |
| \[ a (b+{\color{red}\epsilon}) = c + {\color{green} a} {\color{red} \epsilon} \] | \[ \frac{\partial c}{\partial b} = {\color{green} a} \] |
More building blocks
Subtraction
Square
ReLU
Maximum
We can now create complex functions from simple building blocks.
\[ d = (a \cdot b)^2 \]
Chain rule
\[ \frac{\partial d}{\partial a} = {\color{green} \frac{\partial d}{\partial c}\frac{\partial c}{\partial a}} = (2c) (b) = (2ab)(b) = 2ab^2 \]
Backpropagation
For every node $i$ we define two values:
Algorithm: Backpropagation
Example: Logistic Regression
\[ \mathcal{L}(\hat y, y) = -(y \log(\hat y) + (1-y)\log(1- \hat y)) \\[3mm] \hat y = \sigma(z) = \frac{1}{1+e^{-z}} \\[3mm] z = w_1 x_1 + w_2 x_2 + b \]
\[ \hat y = \sigma(z) = \frac{1}{1+e^{-z}} \]
\[ \color{green} \frac{\partial \sigma}{\partial z} = \frac{e^z}{(1+e^z)^2} = \sigma(z)(1-\sigma(z)) = \hat y (1 - \hat y) \]
\[ \mathcal{L}(\hat y, y) = -(y \log(\hat y) + (1-y)\log(1- \hat y)) \]
\[ \color{green} \frac{\partial \mathcal{L}}{\partial \hat y} = - \frac{y}{\hat y} + \frac{1- y}{1-\hat y} \]
Example for \[ x = \begin{bmatrix} -2 \\ 3 \end{bmatrix}, \quad y = 1, \quad w = \begin{bmatrix} 2 \\ 1 \end{bmatrix} \]
\[ \nabla_w \mathcal{L} = \begin{bmatrix} -2.646 \cdot 0.235 \cdot 1 \cdot 1 \cdot -2 \\ -2.646 \cdot 0.235 \cdot 1 \cdot 1 \cdot 3 \end{bmatrix} = \begin{bmatrix} 1.244 \\ -1.865 \end{bmatrix} \\[1mm] \nabla_b \mathcal{L} = -2.646 \cdot 0.235 \cdot 1 = -0.622 \]
Combining $\sigma$ and $\mathcal{L}$.
\[ \frac{d}{d \hat y} \mathcal{L} = -\frac{y}{\hat y} + \frac{1-y}{1-\hat y}\\ \frac{d}{dz} \sigma = \sigma(z)(1-\sigma(z)) = \hat y (1 - \hat y) \]
$\frac{d}{dz}\mathcal{L} = \frac{d}{d \hat y}\mathcal{L} \frac{d}{dz} \sigma$
$= (-\frac{y}{\hat y} + \frac{1-y}{1-\hat y}) \hat y (1 - \hat y)$
$= - y (1 - \hat y) + (1-y) \hat y$
$= - y + y \hat y + \hat y - y \hat y = \hat y - y$
Remember to vectorize things!
\[ \nabla_w w^T x = x \]