Artificial Intelligence

Sequence Models

Sequence data

Sentiment classification "The series I've enjoyed most in my life thus far." $\Rightarrow$ ★★★★★
Machine translation "Artificial intelligence is useful" $\Rightarrow$ "La inteligencia artificial es útil"
Speech recognition $\Rightarrow$ Transcript
Speech synthesis Written text $\Rightarrow$
DNA sequence analysis "ATGCTTCGGCAAGACTCAAAAATA" $\Rightarrow$ Properties
Music generation $\emptyset$ $\Rightarrow$

Notation

MarcusspokewithMaryJaneaboutJohnyesterday
$x^{< 1 >}$$x^{< 2 >}$$x^{< 3 >}$$x^{< 4 >}$$x^{< 5 >}$$x^{< 6 >}$$x^{< 7 >}$$x^{< 8 >}$
10011010
$y^{< 1 >}$$y^{< 2 >}$$y^{< 3 >}$$y^{< 4 >}$$y^{< 5 >}$$y^{< 6 >}$$y^{< 7 >}$$y^{< 8 >}$
  • $T_x = 8$
  • $T_y = 8$

How do vectors $x^{< t >}$ look like?

Possibility: One hot encoding of words.

Each word has corresponding index in vector.

\[ x^{< t >} = \begin{bmatrix} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \]

Build up dictionary by assigning index to, e.g., top 10000 words in training data.

Tricks to reduce required dictionary size:

  • Lower case words
  • Stemming
  • List of stop words

Usually there is one index reserved for unknown words (represented as <UNK>)

Sometimes it makes sense to have an index reserved for end-of-sentence: <EOS>

Regular, fully connected neural networks do not work well for sequence data.

  • Input / Output size not fixed.
  • Inputs cannot be scaled to fixed size.
  • Want to share learned features between different positions.
  • Large number of features + large sequence size leads to huge layers.

Recurrent neural network

Different way to draw this:

Prediction at each time step only based on current input and predecessors

  • He said, "Teddy Roosevelt was a great President."
  • He said, "Teddy bears are on sale."
  • $a^{< 0 >} = 0$
  • $a^{< t >} = g_1({\color{red} W_{aa}} a^{< t -1>} + {\color{red} W_{ax}} x^{< t >} + {\color{red} b_a})$
  • ${\hat y}^{< t >} = g_2({\color{red} W_y} a^{< t>} + {\color{red} b_y})$

Parameters are shared for all time steps

Alternative notation:

\[a^{< t >} = g_1({\color{red} W_{a}} \begin{bmatrix} a^{< t -1>} \\ x^{< t >} \end{bmatrix} + {\color{red} b_a})\] where \[ {\color{red} W_{a}} = \begin{bmatrix} {\color{red} W_{aa}} & {\color{red} W_{ax}} \end{bmatrix} \]

Loss for a single time step can be regular cross entropy loss:

\[ \mathcal{L}({\hat y}^{< t>}, y^{< t>}) = - y^{< t>} \log {\hat y}^{< t>} - (1-y^{< t>}) \log (1-{\hat y}^{< t>}) \]

Loss for entire sequence:

\[ \mathcal{L}(\hat y, y) = \sum_{t = 1}^{T_y} \mathcal{L}({\hat y}^{< t>}, y^{< t>}) \]

  • Forward propagation calculates one time step after the other
  • Backward propagation goes backward in time to calculate gradients (Backpropagation through time)

Gradient calculation through all past time steps

What if $T_x \neq T_y$?

Case: $T_y = 1$

Example: Sentiment classification

"The series I've enjoyed most in my life thus far." $\Rightarrow$ "★★★★★" (5)

Many-to-many Many-to-one

Loss only for last output

Case: $T_x = 1$

Example: Music generation

$\emptyset$ $\Rightarrow$

Case: $T_x \neq T_y, T_x > 1, T_y > 1$

Example: Machine translation

"Artificial intelligence is useful" $\Rightarrow$ "La inteligencia artificial es útil"

Two different building blocks.

Vector in the middle represents entire input.

Language modelling

"The apple and pair salad." vs "The apple and pear salad."
Probability:
$3.2 \cdot 10^{-13}$
Probability:
$5.7 \cdot 10^{-10}$
$P(\text{sentence}) = P(y^{< 1>}, y^{< 2>}, \ldots, y^{< T_y>})$

Probability of sentence to occur somewhere in the wild (e.g. a newspaper, a Wikipedia entry, ...)

To build such a model we need a large training set (corpus of texts).

Given an sentence we turn each token into a target vector:

The Egyptian Mau is a bread of cat <EOS>
$y^{<1>}$ $y^{<2>}$ $y^{<3>}$ $y^{<4>}$ $y^{<5>}$ $y^{<6>}$ $y^{<7>}$ $y^{<8>}$ $y^{<9>}$

With, e.g., a 10000 words vocabulary we need a 10000 way softmax output for the prediction.

Use the "old" targets as features: $x^{<t>} = y^{<t-1>}$

\[ {\hat y}^{<t>} \to P(y^{<t>} | y^{<1>}, \ldots, y^{<t-1>}) \]

This RNN learns to predict the probability of the next word given the preceding words

The loss function is the regular soft-max cross entropy loss: \[ \mathcal{L}({\hat y}^{<t>}, y^{<t>}) = - \sum_i y^{<t>}_i \log {\hat y}^{<t>}_i\\ \mathcal{L} = \sum_t \mathcal{L}({\hat y}^{<t>}, y^{<t>}) \]

We can calculate the probability of some sentence as:

\[ P(y^{<1>}, y^{<2>}, y^{<3>})\\[2mm] = P(y^{<1>}) P(y^{<2>} | y^{<1>}) P(y^{<3>} | y^{<1>}, y^{<2>}) \]

Given a trained sequence model we can sample new sentences.

Each $y^{<t>}$ gives us a distribution over our vocabulary. From there we can sample a word.


          import numpy as np

          np.random.choice(["a", "bread", "cat", "Egyptian"],
              p=[0.7, 0.2, 0.08, 0.02])
        

Each time we sample a word we can use it as the next input to the RNN

Stop when sampling <EOS>

Alternative token model: Character level RNNs

Vocabulary = {a, b, c, d, e, ..., A, B, ..., 1, 2, 3, ...}

Consequence: Longer sequences, smaller vocabulary

Example from Andrej Karpathy

PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

DUKE VINCENTIO:
Well, your wit is in the care of side and that.

Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.
        

Long term dependencies

The cat, which already ate a lot of food, was full.

The cats, which already ate a lot of food, were full.

Vanishing gradient problem for RNNs:
Errors from late time steps have a hard time to affect weights at early time steps.

Consequence: It is hard for the RNN to learn relevant aspects that should be memorized over a long time.

Note: Exploding gradients are usually not a big problem here.

How to preserve memory?
Assume we have a flag $c^{<t>}$ remembering if subject is singular

$c$:?1111111?
$x$:Thecat,whichalreadyatetomuch,wasfull

We call $c^{<t>}$ a memory cell which can be another output of our RNN cell (similar to $a^{<t>}$)

How do we update $c^{<t>}$?

Assume we have a candidate value ${\tilde c}^{<t>}$

$c^{<t>} = \Gamma_u * {\tilde c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}$

$\Gamma_u$ determines if we keep the old value or take the new candidate.

How do we get the candidate value ${\tilde c}^{<t>}$?

${\tilde c}^{<t>} = \tanh (W_c \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_c)$

For a basic RNN cell this would be the normal output.

basic RNN unit

How do we get the $\Gamma_u$ parameter?

$\Gamma_u = \sigma (W_u \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u)$

Putting everything together

  • ${\tilde c}^{<t>} = \tanh (W_c \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_c)$
  • $\Gamma_u = \sigma (W_u \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u)$
  • $c^{<t>} = \Gamma_u * {\tilde c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}$
  • ${\hat y}^{<t>} = \text{softmax}(c^{<t>})$

basic RNN unit

(simplified) GRU unit

Full GRU

  • ${\tilde c}^{<t>} = \tanh (W_c \begin{bmatrix} {\color{red} \Gamma_r *} c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_c)$
  • $\Gamma_u = \sigma (W_u \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u)$
  • $\color{red} \Gamma_r = \sigma (W_r \begin{bmatrix}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_r)$
  • $c^{<t>} = \Gamma_u * {\tilde c}^{<t>} + (1 - \Gamma_u) * c^{<t-1>}$
  • ${\hat y}^{<t>} = \text{softmax}(c^{<t>})$

The Long short term memory (LSTM) unit uses both an $a^{<t>}$ and a $c^{<t>}$ vector to transport information to the next layer.

Difference: Separate parameters for update and forgetting

  • $\Gamma_u = \sigma (W_u \begin{bmatrix}a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u)$
  • $\Gamma_f = \sigma (W_f \begin{bmatrix}a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_f)$
  • $c^{<t>} = \Gamma_u * {\tilde c}^{<t>} + \Gamma_f * c^{<t-1>}$

Difference: Extra Gate to control output

  • $\Gamma_o = \sigma (W_o \begin{bmatrix}a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_o)$
  • $a^{<t>} = \Gamma_o * c^{<t>}$

LSTM

  • ${\tilde c}^{<t>} = \tanh (W_c \begin{bmatrix} a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_c)$
  • $\Gamma_u = \sigma (W_u \begin{bmatrix}a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u)$
  • $\Gamma_f = \sigma (W_f \begin{bmatrix}a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_f)$
  • $\Gamma_o = \sigma (W_o \begin{bmatrix}a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_o)$
  • $c^{<t>} = \Gamma_u * {\tilde c}^{<t>} + \Gamma_f * c^{<t-1>}$
  • $a^{<t>} = \Gamma_o * c^{<t>}$
  • ${\hat y}^{<t>} = \text{softmax}(a^{<t>})$

LSTM unit

The GRU unit was invented more recently as a simpler version of LSTM.

Which unit is better depends on the use case.

GRU or LSTM units still can only take information from the past.

  • He said, "Teddy Roosevelt was a great President."
  • He said, "Teddy bears are on sale."

Using information from the future

Combining outputs: ${\hat y}^{<t>} = g (W_y \begin{bmatrix}\overrightarrow{a}^{<t>} \\ \overleftarrow{a}^{<t>} \end{bmatrix} + b_y)$

Bidirectional RNNs (BRNN)

  • Use information both from past and future of sequence
  • Can use any kind of RNN unit (e.g. GRU or LSTM)

Deep RNNs

$a^{[2]<3>} = g(W_a^{[2]} \begin{bmatrix} a^{[2]<2>} \\ a^{[1]<3>}\end{bmatrix} + b_a^{[2]})$

Can use GRU or LSTM (or any other RNN) unit for deep RNNs

Deep RNN can also be bidirectional.

RNNs often have only a few layers.

The temporal dimension already adds a lot of depth and computational cost.

Artificial Intelligence

Natural Language Processing

How to represent words?

One Hot representation

Man (5391) Woman (9853) King (4914) Queen (7157) Apple (456) Orange (6257)
$\begin{bmatrix} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \\ 0 \end{bmatrix}$ $\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix}$ $\begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}$ $\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix}$ $\begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}$ $\begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \\ 0 \end{bmatrix}$

All words are treated the same - similarities are ignored

  • I want a glass of orange ___
  • I want a glass of apple ___

One hot vectors are independent because they are orthogonal

$v_1 \cdot v_2 = 0$

Word Embeddings

Idea: Find a featurized representation

Man (5391) Woman (9853) King (4914) Queen (7157) Apple (456) Orange (6257)
Gender -1 1 -0.95 0.97 0.00 0.01
Royal 0.01 0.02 0.93 0.95 -0.01 0.00
Age 0.03 0.02 0.7 0.69 0.03 -0.02
Food 0.04 0.01 0.02 0.01 0.95 0.97
...
$e_{5391}$ $e_{9853}$ $e_{4914}$ $e_{7157}$ $e_{456}$ $e_{6257}$

Use embedding vectors $e$ instead of one-hot encodings.

"Apple" and "Orange" have more similar embedding vectors $e_{456}$ and $e_{6257}$

$e_{456} \cdot e_{6257} > 0$

Language model can base prediction on, e.g., the "fruit" feature:

  • I want a glass of orange ___
  • I want a glass of apple ___

We want to learn the word embedding instead of specifying it.

Do not necessarily get interpretable features.

Example: Named entity recognition RNN

Using embedding vectors as inputs help with generalization.

$Y$ 1 1 0 0 0 0
SallyJohnsonisanorangefarmer
RobertLinisaduriancultivator

Idea: Learn embedding on (large) dataset / corpus
(possibly different that actual task).

Can reuse embedding for different tasks.

Can use larger vocabulary.

Transfer learning: Use learned knowledge from one dataset on different task.

Transfer learning

  1. Learn word embedding on large unlabeled text corpus (billions of texts)
    (or download pre-trained embedding)
  2. Use embedding for input features on new task (with smaller training set).
  3. Optional: fine tune embedding to new task.

Similar to using a pre-trained CNN and adjusting only the last layers.

Analogies

Given the embedding vectors

\[ e_\text{man} = \begin{bmatrix} -1 \\ 0.01 \\ 0.03 \\ 0.09 \end{bmatrix} \quad e_\text{woman} = \begin{bmatrix} 1 \\ 0.02 \\ 0.02 \\ 0.01 \end{bmatrix} \quad e_\text{king} = \begin{bmatrix} -0.95 \\ 0.93 \\ 0.70 \\ 0.02 \end{bmatrix} \quad e_\text{queen} = \begin{bmatrix} 0.97 \\ 0.95 \\ 0.69 \\ 0.01 \end{bmatrix} \]

Man is to woman as king is to ___

$e_\text{man} - e_\text{woman} \approx \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix} \qquad e_\text{king} - e_\text{queen} \approx \begin{bmatrix} -2 \\ 0 \\ 0 \\ 0 \end{bmatrix} $

\[ e_\text{man} = \begin{bmatrix} -1 \\ 0.01 \\ 0.03 \\ 0.09 \end{bmatrix} \quad e_\text{woman} = \begin{bmatrix} 1 \\ 0.02 \\ 0.02 \\ 0.01 \end{bmatrix} \quad e_\text{king} = \begin{bmatrix} -0.95 \\ 0.93 \\ 0.70 \\ 0.02 \end{bmatrix} \quad e_\text{queen} = \begin{bmatrix} 0.97 \\ 0.95 \\ 0.69 \\ 0.01 \end{bmatrix} \]

Man is to woman as king is to ___

$e_\text{man} - e_\text{woman} \approx e_\text{king} - e_\text{queen}$

$e_\text{king} - e_\text{man} + e_\text{woman} \approx \begin{bmatrix} 1 \\ 1 \\ 0.7 \\ 0 \end{bmatrix}$

Now find word with most similar embedding.

Concept still works with learned features!

Works for many simple analogies:

  • Man:Woman as Boy:Girl
  • Ottawa:Canada as Nairobi:Kenya
  • Big:Bigger as Tall:Taller
  • Yen:Japan as Pound:England

What does "similar" mean

Cosine distance: $\frac{v_1 \cdot v_2}{||v_1||\; ||v_2||}$

Angle is more important than euclidean distance.

Learning word embeddings

We learn embedding matrix $E$. One column per word.

With a 10000 word vocabulary and a 300 dimensional embedding, $E \in \mathbb{R}^{300 \times 10000}$

With one hot vector $o_{6257}$, we have $E o_{6257} = e_{6257}$.

Idea: Learn word embedding as part of a language model

Iwantaglassoforange___
434396651385261636257
$o_{4343}$$o_{9665}$$o_{1}$$o_{3852}$$o_{6163}$$o_{6257}$

Predict next word from last $k$ words.

Train like any other neural network

Shared parameters $E$ (similar to convolutions).

Other contexts for language models

I want a glass of orange juice to go along with my cereal.

Target: juice

    Context:

  • Last 4 words
  • 4 words on left & right
  • Last 1 word
  • Nearby 1 word

When sampling the context from a corpus we want to put more weight on rare words.
(e.g. weight by inverse document frequency $\log(\frac{\text{\# documents}}{\text{\# documents with word}})$)

Word2Vec

Skip-gram: Target within n-word window of context word

"I want a glass of orange juice to go along with my cereal."

ContextTarget
orangejuice
orangeglass
orangemy
juicealong
juiceorange

Word2Vec model

Context $c$ (e.g. "orange" 6257) to target $t$ (e.g. "juice" 4834)

    Learned parameters

  • Embedding matrix $E$
  • Vector $\theta_t$ for each target.
  • 2 Layers

  • $e_c = E o_c$ (embedding)
  • $\hat y_t = \frac{e^{\theta_t \cdot e_c}}{\sum_j e^{\theta_j \cdot e_c}}$ (softmax)

Problem: softmax is slow for large number of targets

\[\hat y_t = \frac{e^{\theta_t \cdot e_c}}{\sum^{10^7}_{j=1} e^{\theta_j \cdot e_c}}\]

Hierarchical softmax

Idea: Use tree of binary classifiers

In practice tree is often unbalanced with rare words deeper than common ones

Negative sampling

Idea: Avoid softmax by changing the objective

For random context we either select a possible target word or a random one from the dictionary.

"I want a glass of orange juice to go along with my cereal."

ContextWordTarget?
orangejuice1
orangeking0
orangebook0
orangethe0
orangeof0

For each positive example generate $k$ negative ones

Model

$x$$y$
$c$$t$
ContextWordTarget?
orangejuice1
orangeking0
orangebook0
orangethe0
orangeof0

$\hat y = \sigma(\theta_t \cdot e_c)$

One $\theta$ for each possible target $t$.

How to sample the negative target words

Frequency in corpus yields to many words like "the", "of", "and", "a", ...

Uniform over dictionary is to far from actual distribution.

\[ P(w_i) = \frac{f(w_i)^{0.75}}{\sum_j f(w_j)^{0.75}} \] $f(w_i)$ is the frequency of $w_i$.

GloVe (global vectors for word representation)

"I want a glass of orange juice to go along with my cereal."

$x_{ct}$: Number of times $t$ appears in context of $c$

(with symmetric context window: $x_{ct} = x_{tc}$)

Objective: \[ \min_{\theta, \tilde E} \sum_c \sum_t f(x_{ct}) (\theta_t \cdot \tilde e_c' + b_t + b_c' - \log(x_{ct}))^2 \]

$f$ is a weighting function with $f(x) = 0$ if $x=0$, that gives a small weight to stop words and a large one to rare words.

Roles of $e_w$ and $\theta_w$ are fully symmetric.

Assign final word vector as average: \[ e_w = \frac{\tilde e_w + \theta_w}{2} \]

The individual entries of learned word vectors are usually not interpretable features
(like the outputs of hidden layers)

The analogy reasoning (like "man is to woman as king is to ___") still works.

Using word embeddings
Example: sentiment classification

"The desert is excellent" $\Rightarrow$ ★★★★

Simple sentiment classification model

\[ \hat y = \text{softmax}( \theta \cdot \frac{1}{T_x} \sum_t e_t + b) \]

Averaging the implied features of the embeddings can yield very reasonable predictors.

Especially with very small training data.

Averaging shares missing context problem with other bag of words models:

"Completely lacking in good taste, good service, and good ambience."

RNN with embeddings as input features

Almost any NLP model will generalize better with pretrained word embeddings as inputs.

  • Generalize to words not in training data.
  • Less features per word $\Rightarrow$ less parameters $\Rightarrow$ less overfitting.

Bias in word embeddings

Man:Woman as King:Queen

Man:Computer_Programmer as Woman:Homemaker

Father:Doctor as Mother:Nurse

Word embedding learns bias from its corpus.

Bolukbasi et.al.: Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

Machine learning models should not reinforce present biases that are undesirable.

Especially in sensitive domains like loan applications, judicial recommendations, ...

  1. Identify bias defining direction: \[ e_\text{he} - e_\text{she}\\ e_\text{man} - e_\text{woman}\\ e_\text{male} - e_\text{female}\\ \Rightarrow \text{average} \]
  2. Neutralize: Project every word that is not definitional
  3. Equalize remaining pairs

Idea can be applied to other explicit biases:

  • ethnicity
  • religion
  • nationality
  • ...

Most word embeddings you find for download are not debiased.

And if they are, usually just for gender.

Multilingual word embeddings

    Core idea:

  • Train with multilingual corpus
  • Ensure similar words are close (small cosine distance)

With a multilingual word embedding models can even work for unseen languages.

Transfer has limits (e.g. grammar is not learned).

Good pre trained word embedding: ConceptNet

Token based encodings

If it is not feasible to build a simple dictionary, we can learn a token based embedding

A token is a grouped string of continuous characters.

It can be a word or part of a word.

Byte pair encoding

Idea: we start with one token for each possible byte (256)

Most common pairs of tokens are merged into a new token.

Example

String: aaabdaaabac

4 tokens: a, b, c, d

aaabdaaabac

aa ist the most common pair

ZabdZabac

5 tokens: a, b, c, d, Z=aa

ZabdZabac

ab is the most common pair

ZYdZYac

6 tokens: a, b, c, d, Z=aa, Y=ab

ZYdZYac

ZY is the most common pair

XdXac

7 tokens: a, b, c, d, Z=aa, Y=ab, X=aaab

At this point no more compression is possible

With BPE we can control the dictionary size as we can choose how long we merge byte pairs

No unknown characters (each byte has a token by default)

Systems like ChatGPT use token based encodings

Token based encodings can also handle cases where building a dictionary can be challenging

  • Different languages (with different character sets)
  • Misspellings
  • Programming languages

For machine translation we turn an input sequence into an output sequence

Encoder-decoder architecture

We use two different RNNs, one for encoding, one for decoding.

Both use the same state vector.

Entire input sequence must be encoded in a single state vector.

As information capacity is limited, we cannot translate arbitrarily long texts in a single step.

Translation system works similar to the simple text model.

Neural network approximates

\[ P(y^{<i>} | x^{<1>}, \ldots, x^{<T_x>}, y^{<1>}, \ldots, y^{<i-1>}) \]

Probability of next translated character given the original text and the translation so far.

When picking the next letter, why might sampling by the calculated probability not be the best idea?

A bad translation still has a positive probability. We want to get a good translation.

"Jane visite l'Afrique en septembre."

  • "Jane is visiting Africa in September"
  • "Jane is going to be visiting Africa in September"
  • "In September, Jane will visit Africa"
  • "Her African friend welcomed Jane in September"

Idea: Always pick the most likely next symbol.

This is called greedy search.

Greedy search does not need to yield the overall most likely sentence.

Translation so far: "Jane is"

    Options:

  • visiting
  • going

As "going" is a more common word, we could have

\[ P(\text{going}| x, \text{Jane}, \text{is}) > P(\text{visiting}| x, \text{Jane}, \text{is}) \]

(where $x$ is the input sentence "Jane visite l'Afrique en septembre.")

Still the probability of the entire continued sentence
"Jane is visiting Africa in September"
should be higher than for
"Jane is going to be visiting Africa in September"

Probability of the entire sentence:

\[ P(\text{Jane is visiting Africa in September}| x) = \\ P(\text{Jane}| x) \cdot P(\text{is}| x, \text{Jane}) \cdot \\ P(\text{visiting}| x, \text{Jane}, \text{is}) \cdot \\ P(\text{Africa}| x, \text{Jane}, \text{is}, \text{visiting}) \cdot \\\ldots \]

Greedy search easily ends up in a local optimum.

Beam search

Idea: Keep a list with multiple ($K$) options for the most likely sentence.

At each step, we calculate the most likely next words for each of the $K$ prefixes. Of all options, we keep the $K$ most likely ones.

Greedy Search

Beam Search (K=3)

Example: Translate "Jane visite l'Afrique en septembre."

Most likely first words:

  • Jane ($p=0.3$)
  • In ($p=0.25$)
  • September ($p=0.15$)

We now need to run the neural network 3 times to get the most likely continuations for each possible prefix:

  • Jane ($p=0.3$)
    • is ($p=0.4$)
    • visits ($p=0.35$)
    • goes ($p=0.1$)
  • In ($p=0.25$)
    • September ($p=0.6$)
    • Africa ($p=0.1$)
    • visiting ($p=0.05$)
  • September ($p=0.15$)
    • is ($p=0.3$)
    • when ($p=0.05$)
    • the ($p=0.02$)

Using the joined probability we decide which prefix is most likely ($P(y^{<1>} | x) \cdot P(y^{<2>} | x, y^{<1>}$)

  • Jane
    • is ($p=0.12$)
    • visits ($p=0.105$)
    • goes ($p=0.03$)
  • In
    • September ($p=0.15$)
    • Africa ($p=0.025$)
    • visiting ($p=0.0125$)
  • September
    • is ($p=0.045$)
    • when ($p=0.0075$)
    • the ($p=0.003$)

We repeat using the three new prefixes

  • Jane is ($p=0.12$)
    • going ($p=0.22$)
    • visiting ($p=0.2$)
    • traveling ($p=0.15$)
  • Jane visits ($p=0.105$)
    • Africa ($p=0.4$)
    • the ($p=0.15$)
    • in ($p=0.01$)
  • In September ($p=0.15$)
    • Jane ($p=0.3$)
    • Africa ($p=0.05$)
    • a ($p=0.02$)
  • Jane is
    • going ($p=0.0264$)
    • visiting ($p=0.024$)
    • traveling ($p=0.018$)
  • Jane visits
    • Africa ($p=0.042$)
    • the ($p=0.01575$)
    • in ($p=0.00105$)
  • In September
    • Jane ($p=0.045$)
    • Africa ($p=0.0075$)
    • a ($p=0.003$)

Beam search is still not guaranteed to find a global optimum.

Only a chance to find a better solution than with greedy search.

With $K=1$ we get greedy search.

What beam width $K$ is good?

Computation scales linearly with $K$. We can choose $K$ based on our computational budget.

Often something between 10 and 100.

The product of many probabilities ($< 1$) will lead to very small numbers.

\[ \prod_{i=1}^{T_y} P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]

This can lead numerical issues (rounding errors).

Better: Use the logarithm

\[ \log \prod_{i=1}^{T_y} P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \\ = \sum_{i=1}^{T_y} \log P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]

Longer sentences tend to have a smaller probability

(multiplying more numbers below 1)

We can remove the length penalty by normalizing by sequence length:

\[ \frac{1}{T_y} \sum_{i=1}^{T_y} \log P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]

In practice often a softer normalization yields good results:

\[ \frac{1}{{T_y}^\alpha} \sum_{i=1}^{T_y} \log P(y^{<i>} | x, y^{<1>}, \ldots, y^{<i-1>}) \]

with $0 < \alpha < 1$ (e.g. $\alpha=0.7$).

How to evaluate the performance of a NLP system?

In many cases just looking at the loss function (on the test set) is misleading

  • Just a proxy
  • Might be "overoptimized"

Example: Machine translation

How to automatically evaluate if a translation is good?

Especially as multiple translations can be correct.

BLEU (bilingual evaluation understudy) score

Idea: Measure distance from (multiple) reference translations

French: "Le chat est sur le tapis."

  • Reference 1: "The cat is on the mat."
  • Reference 2: "There is a cat on the mat."

Translation: "The the the the the the the."

  • Number of words in translation: $7$
  • Maximum number of times the word "the" appears in a reference: $2$
  • Score: $\frac{2}{7}$

Instead of individual words, we can look at bigrams.

French: "Le chat est sur le tapis."

  • Reference 1: "The cat is on the mat."
  • Reference 2: "There is a cat on the mat."

Translation: "The cat the cat on the mat."

Bigram Count Clipped count
the cat21
cat the10
cat on11
on the11
the mat11
Score: $\frac{1+1+1+1}{2+1+1+1+1} = \frac{4}{6}$

We can do that for each n-gram

\[ p_n = \frac{\sum_{x \in \text{n-grams}(\hat y)} \text{ClippedCount}(x)}{\sum_{x \in \text{n-grams}(\hat y)} \text{Count}(x)} \]

(n-gram precision)

If a translation equals one of the references, then

\[ p_1 = 1, p_2 = 1, p_3 = 1, \ldots \]

Bleu score is a geometric mean of the n-gram precisions:

\[ \text{BLEU} = \text{BP} \cdot \exp \left( \frac{1}{4} \sum_{n=1}^4 \log p_n \right) \]

BP (brevity penalty) is a term to penalize short translations.

BP is 1 if the translation is at least as long as the reference.

BP is smaller than 1 if the translation is shorter.

\[ \text{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - \frac{r}{c}) & \text{if } c \leq r \end{cases} \]

$c$ is the length of the candidate translation, $r$ the length of the reference.

BLEU correlates well with human judgement of translation quality and automates well

BLEU is not a loss function!