Sequence data
| Sentiment classification | "The series I've enjoyed most in my life thus far." | $\Rightarrow$ | ★★★★★ |
| Machine translation | "Artificial intelligence is useful" | $\Rightarrow$ | "La inteligencia artificial es útil" |
| Speech recognition | ![]() |
$\Rightarrow$ | Transcript |
| Speech synthesis | Written text | $\Rightarrow$ | ![]() |
| DNA sequence analysis | "ATGCTTCGGCAAGACTCAAAAATA" | $\Rightarrow$ | Properties |
| Music generation | $\emptyset$ | $\Rightarrow$ |
Notation
| Marcus | spoke | with | Mary | Jane | about | John | yesterday |
| $x^{< 1 >}$ | $x^{< 2 >}$ | $x^{< 3 >}$ | $x^{< 4 >}$ | $x^{< 5 >}$ | $x^{< 6 >}$ | $x^{< 7 >}$ | $x^{< 8 >}$ |
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| $y^{< 1 >}$ | $y^{< 2 >}$ | $y^{< 3 >}$ | $y^{< 4 >}$ | $y^{< 5 >}$ | $y^{< 6 >}$ | $y^{< 7 >}$ | $y^{< 8 >}$ |
How do vectors $x^{< t >}$ look like?
|
Possibility: One hot encoding of words. Each word has corresponding index in vector. |
\[ x^{< t >} = \begin{bmatrix} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \] |
Build up dictionary by assigning index to, e.g., top 10000 words in training data.
Tricks to reduce required dictionary size:
Usually there is one index reserved for unknown words (represented as <UNK>)
Sometimes it makes sense to have an index reserved for end-of-sentence: <EOS>
Regular, fully connected neural networks do not work well for sequence data.
Recurrent neural network
Different way to draw this:
Prediction at each time step only based on current input and predecessors
Parameters are shared for all time steps
Alternative notation:
\[a^{< t >} = g_1({\color{red} W_{a}} \begin{bmatrix} a^{< t -1>} \\ x^{< t >} \end{bmatrix} + {\color{red} b_a})\] where \[ {\color{red} W_{a}} = \begin{bmatrix} {\color{red} W_{aa}} & {\color{red} W_{ax}} \end{bmatrix} \]
Loss for a single time step can be regular cross entropy loss:
\[ \mathcal{L}({\hat y}^{< t>}, y^{< t>}) = - y^{< t>} \log {\hat y}^{< t>} - (1-y^{< t>}) \log (1-{\hat y}^{< t>}) \]
Loss for entire sequence:
\[ \mathcal{L}(\hat y, y) = \sum_{t = 1}^{T_y} \mathcal{L}({\hat y}^{< t>}, y^{< t>}) \]
Gradient calculation through all past time steps
What if $T_x \neq T_y$?
Case: $T_y = 1$
Example: Sentiment classification
"The series I've enjoyed most in my life thus far." $\Rightarrow$ "★★★★★" (5)
![]() |
![]() |
| Many-to-many | Many-to-one |
Loss only for last output
Case: $T_x = 1$
Example: Music generation
| $\emptyset$ | $\Rightarrow$ |
Case: $T_x \neq T_y, T_x > 1, T_y > 1$
Example: Machine translation
| "Artificial intelligence is useful" | $\Rightarrow$ | "La inteligencia artificial es útil" |
Two different building blocks.
Vector in the middle represents entire input.
Language modelling
| "The apple and pair salad." | vs | "The apple and pear salad." |
| Probability: $3.2 \cdot 10^{-13}$ |
Probability: $5.7 \cdot 10^{-10}$ |
Probability of sentence to occur somewhere in the wild (e.g. a newspaper, a Wikipedia entry, ...)
To build such a model we need a large training set (corpus of texts).
Given an sentence we turn each token into a target vector:
| The | Egyptian | Mau | is | a | bread | of | cat | <EOS> |
| $y^{<1>}$ | $y^{<2>}$ | $y^{<3>}$ | $y^{<4>}$ | $y^{<5>}$ | $y^{<6>}$ | $y^{<7>}$ | $y^{<8>}$ | $y^{<9>}$ |
With, e.g., a 10000 words vocabulary we need a 10000 way softmax output for the prediction.
Use the "old" targets as features: $x^{<t>} = y^{<t-1>}$
\[ {\hat y}^{<t>} \to P(y^{<t>} | y^{<1>}, \ldots, y^{<t-1>}) \]
This RNN learns to predict the probability of the next word given the preceding words
The loss function is the regular soft-max cross entropy loss: \[ \mathcal{L}({\hat y}^{<t>}, y^{<t>}) = - \sum_i y^{<t>}_i \log {\hat y}^{<t>}_i\\ \mathcal{L} = \sum_t \mathcal{L}({\hat y}^{<t>}, y^{<t>}) \]
We can calculate the probability of some sentence as:
\[ P(y^{<1>}, y^{<2>}, y^{<3>})\\[2mm] = P(y^{<1>}) P(y^{<2>} | y^{<1>}) P(y^{<3>} | y^{<1>}, y^{<2>}) \]
Given a trained sequence model we can sample new sentences.
Each $y^{<t>}$ gives us a distribution over our vocabulary. From there we can sample a word.
import numpy as np
np.random.choice(["a", "bread", "cat", "Egyptian"],
p=[0.7, 0.2, 0.08, 0.02])
Each time we sample a word we can use it as the next input to the RNN
Stop when sampling <EOS>
Alternative token model: Character level RNNs
Vocabulary = {a, b, c, d, e, ..., A, B, ..., 1, 2, 3, ...}
Consequence: Longer sequences, smaller vocabulary
Example from Andrej Karpathy
PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.
Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.
DUKE VINCENTIO:
Well, your wit is in the care of side and that.
Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.
Long term dependencies
The cat, which already ate a lot of food, was full.
The cats, which already ate a lot of food, were full.
Vanishing gradient problem for RNNs:
Errors from late time steps have a hard time to affect weights at early time steps.
Consequence: It is hard for the RNN to learn relevant aspects that should be memorized over a long time.
Note: Exploding gradients are usually not a big problem here.
How to preserve memory?
Assume we have a flag $c^{<t>}$ remembering if subject is singular
| $c$: | ? | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ? |
| $x$: | The | cat, | which | already | ate | to | much, | was | full |
We call $c^{<t>}$ a memory cell which can be another output of our RNN cell (similar to $a^{<t>}$)
How do we update $c^{<t>}$?
Assume we have a candidate value ${\tilde c}^{<t>}$
$c^{<t>} = \Gamma_u * {\tilde c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}$
$\Gamma_u$ determines if we keep the old value or take the new candidate.
How do we get the candidate value ${\tilde c}^{<t>}$?
${\tilde c}^{<t>} = \tanh (W_c \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_c)$
For a basic RNN cell this would be the normal output.
basic RNN unit
How do we get the $\Gamma_u$ parameter?
$\Gamma_u = \sigma (W_u \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u)$
Putting everything together
basic RNN unit
(simplified) GRU unit
Full GRU
The Long short term memory (LSTM) unit uses both an $a^{<t>}$ and a $c^{<t>}$ vector to transport information to the next layer.
Difference: Separate parameters for update and forgetting
Difference: Extra Gate to control output
LSTM
LSTM unit
The GRU unit was invented more recently as a simpler version of LSTM.
Which unit is better depends on the use case.
GRU or LSTM units still can only take information from the past.
Using information from the future
Combining outputs: ${\hat y}^{<t>} = g (W_y \begin{bmatrix}\overrightarrow{a}^{<t>} \\ \overleftarrow{a}^{<t>} \end{bmatrix} + b_y)$
Bidirectional RNNs (BRNN)
Deep RNNs
$a^{[2]<3>} = g(W_a^{[2]} \begin{bmatrix} a^{[2]<2>} \\ a^{[1]<3>}\end{bmatrix} + b_a^{[2]})$
Can use GRU or LSTM (or any other RNN) unit for deep RNNs
Deep RNN can also be bidirectional.
RNNs often have only a few layers.
The temporal dimension already adds a lot of depth and computational cost.
How to represent words?
One Hot representation
| Man (5391) | Woman (9853) | King (4914) | Queen (7157) | Apple (456) | Orange (6257) |
| $\begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \\ 0 \end{bmatrix}$ | $\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix}$ | $\begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}$ | $\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix}$ | $\begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}$ | $\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \end{bmatrix}$ |
All words are treated the same - similarities are ignored
One hot vectors are independent because they are orthogonal
$v_1 \cdot v_2 = 0$
Word Embeddings
Idea: Find a featurized representation
| Man (5391) | Woman (9853) | King (4914) | Queen (7157) | Apple (456) | Orange (6257) | |
|---|---|---|---|---|---|---|
| Gender | -1 | 1 | -0.95 | 0.97 | 0.00 | 0.01 |
| Royal | 0.01 | 0.02 | 0.93 | 0.95 | -0.01 | 0.00 |
| Age | 0.03 | 0.02 | 0.7 | 0.69 | 0.03 | -0.02 |
| Food | 0.04 | 0.01 | 0.02 | 0.01 | 0.95 | 0.97 |
| ... | ||||||
| $e_{5391}$ | $e_{9853}$ | $e_{4914}$ | $e_{7157}$ | $e_{456}$ | $e_{6257}$ | |
Use embedding vectors $e$ instead of one-hot encodings.
"Apple" and "Orange" have more similar embedding vectors $e_{456}$ and $e_{6257}$
$e_{456} \cdot e_{6257} > 0$
Language model can base prediction on, e.g., the "fruit" feature:
We want to learn the word embedding instead of specifying it.
Do not necessarily get interpretable features.
Example: Named entity recognition RNN
Using embedding vectors as inputs help with generalization.
| $Y$ | 1 | 1 | 0 | 0 | 0 | 0 |
| Sally | Johnson | is | an | orange | farmer | |
| Robert | Lin | is | a | durian | cultivator |
Idea: Learn embedding on (large) dataset / corpus
(possibly different that actual task).
Can reuse embedding for different tasks.
Can use larger vocabulary.
Transfer learning: Use learned knowledge from one dataset on different task.
Transfer learning
Similar to using a pre-trained CNN and adjusting only the last layers.
Analogies
Given the embedding vectors
\[ e_\text{man} = \begin{bmatrix} -1 \\ 0.01 \\ 0.03 \\ 0.09 \end{bmatrix} \quad e_\text{woman} = \begin{bmatrix} 1 \\ 0.02 \\ 0.02 \\ 0.01 \end{bmatrix} \quad e_\text{king} = \begin{bmatrix} -0.95 \\ 0.93 \\ 0.70 \\ 0.02 \end{bmatrix} \quad e_\text{queen} = \begin{bmatrix} 0.97 \\ 0.95 \\ 0.69 \\ 0.01 \end{bmatrix} \]
Man is to woman as king is to ___
$e_\text{man} - e_\text{woman} \approx \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix} \qquad e_\text{king} - e_\text{queen} \approx \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix} $
\[ e_\text{man} = \begin{bmatrix} -1 \\ 0.01 \\ 0.03 \\ 0.09 \end{bmatrix} \quad e_\text{woman} = \begin{bmatrix} 1 \\ 0.02 \\ 0.02 \\ 0.01 \end{bmatrix} \quad e_\text{king} = \begin{bmatrix} -0.95 \\ 0.93 \\ 0.70 \\ 0.02 \end{bmatrix} \quad e_\text{queen} = \begin{bmatrix} 0.97 \\ 0.95 \\ 0.69 \\ 0.01 \end{bmatrix} \]
Man is to woman as king is to ___
$e_\text{man} - e_\text{woman} \approx e_\text{king} - e_\text{queen}$
$e_\text{king} - e_\text{man} + e_\text{woman} \approx \begin{bmatrix} 1 \\ 1 \\ 0.7 \\ 0 \end{bmatrix}$
Now find word with most similar embedding.
Concept still works with learned features!
Works for many simple analogies:
What does "similar" mean
Cosine distance: $\frac{v_1 \cdot v_2}{||v_1||\; ||v_2||}$
Angle is more important than euclidean distance.
Learning word embeddings
We learn embedding matrix $E$. One column per word.
With a 10000 word vocabulary and a 300 dimensional embedding, $E \in \mathbb{R}^{300 \times 10000}$
With one hot vector $o_{6257}$, we have $E o_{6257} = e_{6257}$.
Idea: Learn word embedding as part of a language model
| I | want | a | glass | of | orange | ___ |
| 4343 | 9665 | 1 | 3852 | 6163 | 6257 | |
| $o_{4343}$ | $o_{9665}$ | $o_{1}$ | $o_{3852}$ | $o_{6163}$ | $o_{6257}$ |
Predict next word from last $k$ words.
Train like any other neural network
Shared parameters $E$ (similar to convolutions).
Other contexts for language models
I want a glass of orange juice to go along with my cereal.
Target: juice
Context:
When sampling the context from a corpus we want to put more weight on rare words.
(e.g. weight by inverse document frequency $\log(\frac{\text{\# documents}}{\text{\# documents with word}})$)
Skip-gram: Target within n-word window of context word
"I want a glass of orange juice to go along with my cereal."
| Context | Target |
|---|---|
| orange | juice |
| orange | glass |
| orange | my |
| juice | along |
| juice | orange |
Word2Vec model
Context $c$ (e.g. "orange" 6257) to target $t$ (e.g. "juice" 4834)
Learned parameters
2 Layers
Problem: softmax is slow for large number of targets
\[\hat y_t = \frac{e^{\theta_t \cdot e_c}}{\sum^{10^7}_{j=1} e^{\theta_j \cdot e_c}}\]
Hierarchical softmax
Idea: Use tree of binary classifiers
In practice tree is often unbalanced with rare words deeper than common ones
Negative sampling
Idea: Avoid softmax by changing the objective
"I want a glass of orange juice to go along with my cereal."
| Context | Word | Target? |
|---|---|---|
| orange | juice | 1 |
| orange | king | 0 |
| orange | book | 0 |
| orange | the | 0 |
| orange | of | 0 |
For each positive example generate $k$ negative ones
Model
| $x$ | $y$ | |
|---|---|---|
| $c$ | $t$ | |
| Context | Word | Target? |
| orange | juice | 1 |
| orange | king | 0 |
| orange | book | 0 |
| orange | the | 0 |
| orange | of | 0 |
$\hat y = \sigma(\theta_t \cdot e_c)$
One $\theta$ for each possible target $t$.
How to sample the negative target words
Frequency in corpus yields to many words like "the", "of", "and", "a", ...
Uniform over dictionary is to far from actual distribution.
\[ P(w_i) = \frac{f(w_i)^{0.75}}{\sum_j f(w_j)^{0.75}} \] $f(w_i)$ is the frequency of $w_i$.
GloVe (global vectors for word representation)
"I want a glass of orange juice to go along with my cereal."
$x_{ct}$: Number of times $t$ appears in context of $c$
(with symmetric context window: $x_{ct} = x_{tc}$)
Objective: \[ \min_{\theta, \tilde E} \sum_c \sum_t f(x_{ct}) (\theta_t \cdot \tilde e_c' + b_t + b_c' - \log(x_{ct}))^2 \]
$f$ is a weighting function with $f(x) = 0$ if $x=0$, that gives a small weight to stop words and a large one to rare words.
Roles of $e_w$ and $\theta_w$ are fully symmetric.
Assign final word vector as average: \[ e_w = \frac{\tilde e_w + \theta_w}{2} \]
The individual entries of learned word vectors are usually not interpretable features
(like the outputs of hidden layers)
The analogy reasoning (like "man is to woman as king is to ___") still works.
Using word embeddings
Example: sentiment classification
"The desert is excellent" $\Rightarrow$ ★★★★
Simple sentiment classification model
\[ \hat y = \text{softmax}( \theta \cdot \frac{1}{T_x} \sum_t e_t + b) \]
Averaging the implied features of the embeddings can yield very reasonable predictors.
Especially with very small training data.
Averaging shares missing context problem with other bag of words models:
"Completely lacking in good taste, good service, and good ambience."
RNN with embeddings as input features
Almost any NLP model will generalize better with pretrained word embeddings as inputs.
Bias in word embeddings
Man:Woman as King:Queen
Man:Computer_Programmer as Woman:Homemaker
Father:Doctor as Mother:Nurse
Word embedding learns bias from its corpus.
Bolukbasi et.al.: Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Machine learning models should not reinforce present biases that are undesirable.
Especially in sensitive domains like loan applications, judicial recommendations, ...
|
|
Idea can be applied to other explicit biases:
Most word embeddings you find for download are not debiased.
And if they are, usually just for gender.
Multilingual word embeddings
Core idea:
With a multilingual word embedding models can even work for unseen languages.
Transfer has limits (e.g. grammar is not learned).
Good pre trained word embedding: ConceptNet
Token based encodings
If it is not feasible to build a simple dictionary, we can learn a token based embedding
A token is a grouped string of continuous characters.
It can be a word or part of a word.
Byte pair encoding
Idea: we start with one token for each possible byte (256)
Most common pairs of tokens are merged into a new token.
Example
String: aaabdaaabac
4 tokens: a, b, c, d
aaabdaaabac
aa ist the most common pair
ZabdZabac
5 tokens: a, b, c, d, Z=aa
ZabdZabac
ab is the most common pair
ZYdZYac
6 tokens: a, b, c, d, Z=aa, Y=ab
ZYdZYac
ZY is the most common pair
XdXac
7 tokens: a, b, c, d, Z=aa, Y=ab, X=aaab
At this point no more compression is possible
With BPE we can control the dictionary size as we can choose how long we merge byte pairs
No unknown characters (each byte has a token by default)
Systems like ChatGPT use token based encodings
Token based encodings can also handle cases where building a dictionary can be challenging
For machine translation we turn an input sequence into an output sequence
Encoder-decoder architecture
We use two different RNNs, one for encoding, one for decoding.
Both use the same state vector.
Entire input sequence must be encoded in a single state vector.
As information capacity is limited, we cannot translate arbitrarily long texts in a single step.
Translation system works similar to the simple text model.
Neural network approximates
\[ P(y^{<i>} | x^{<1>}, \ldots, x^{<T_x>}, y^{<1>}, \ldots, y^{<i-1>}) \]Probability of next translated character given the original text and the translation so far.
When picking the next letter, why might sampling by the calculated probability not be the best idea?
A bad translation still has a positive probability. We want to get a good translation.
"Jane visite l'Afrique en septembre."
Idea: Always pick the most likely next symbol.
This is called greedy search.
Greedy search does not need to yield the overall most likely sentence.
Translation so far: "Jane is"
Options:
As "going" is a more common word, we could have
\[ P(\text{going}| x, \text{Jane}, \text{is}) > P(\text{visiting}| x, \text{Jane}, \text{is}) \](where $x$ is the input sentence "Jane visite l'Afrique en septembre.")
Still the probability of the entire continued sentence
"Jane is visiting Africa in September"
should be higher than for
"Jane is going to be visiting Africa in September"
Probability of the entire sentence:
\[ P(\text{Jane is visiting Africa in September}| x) = \\ P(\text{Jane}| x) \cdot P(\text{is}| x, \text{Jane}) \cdot \\ P(\text{visiting}| x, \text{Jane}, \text{is}) \cdot \\ P(\text{Africa}| x, \text{Jane}, \text{is}, \text{visiting}) \cdot \\\ldots \]Greedy search easily ends up in a local optimum.
Beam search
Idea: Keep a list with multiple ($K$) options for the most likely sentence.
At each step, we calculate the most likely next words for each of the $K$ prefixes. Of all options, we keep the $K$ most likely ones.
Greedy Search
Beam Search (K=3)
Example: Translate "Jane visite l'Afrique en septembre."
Most likely first words:
We now need to run the neural network 3 times to get the most likely continuations for each possible prefix:
Using the joined probability we decide which prefix is most likely ($P(y^{<1>} | x) \cdot P(y^{<2>} | x, y^{<1>}$)
We repeat using the three new prefixes
Beam search is still not guaranteed to find a global optimum.
Only a chance to find a better solution than with greedy search.
With $K=1$ we get greedy search.
What beam width $K$ is good?
Computation scales linearly with $K$. We can choose $K$ based on our computational budget.
Often something between 10 and 100.
The product of many probabilities ($< 1$) will lead to very small numbers.
\[ \prod_{i=1}^{T_y} P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]This can lead numerical issues (rounding errors).
Better: Use the logarithm
\[ \log \prod_{i=1}^{T_y} P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \\ = \sum_{i=1}^{T_y} \log P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]Longer sentences tend to have a smaller probability
(multiplying more numbers below 1)
We can remove the length penalty by normalizing by sequence length:
\[ \frac{1}{T_y} \sum_{i=1}^{T_y} \log P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]In practice often a softer normalization yields good results:
\[ \frac{1}{{T_y}^\alpha} \sum_{i=1}^{T_y} \log P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]with $0 < \alpha < 1$ (e.g. $\alpha=0.7$).
How to evaluate the performance of a NLP system?
In many cases just looking at the loss function (on the test set) is misleading
Example: Machine translation
How to automatically evaluate if a translation is good?
Especially as multiple translations can be correct.
BLEU (bilingual evaluation understudy) score
Idea: Measure distance from (multiple) reference translations
French: "Le chat est sur le tapis."
Translation: "The the the the the the the."
Instead of individual words, we can look at bigrams.
French: "Le chat est sur le tapis."
Translation: "The cat the cat on the mat."
|
Score: $\frac{1+1+1+1}{2+1+1+1+1} = \frac{4}{6}$ |
We can do that for each n-gram
\[ p_n = \frac{\sum_{x \in \text{n-grams}(\hat y)} \text{ClippedCount}(x)}{\sum_{x \in \text{n-grams}(\hat y)} \text{Count}(x)} \](n-gram precision)
If a translation equals one of the references, then
\[ p_1 = 1, p_2 = 1, p_3 = 1, \ldots \]Bleu score is a geometric mean of the n-gram precisions:
\[ \text{BLEU} = \text{BP} \cdot \exp \left( \frac{1}{4} \sum_{n=1}^4 \log p_n \right) \]BP (brevity penalty) is a term to penalize short translations.
BP is 1 if the translation is at least as long as the reference.
BP is smaller than 1 if the translation is shorter.
$c$ is the length of the candidate translation, $r$ the length of the reference.
BLEU correlates well with human judgement of translation quality and automates well
BLEU is not a loss function!