Sequence models allow the model to use context before or after the current position
Long distance information still vanishes.
Encoder-Decoder has fixed length memory between units.
State vector is bottleneck for memorizing long sequences
We would like to always have the entire input sequence available as context.
How can we do that with variable sequence length?
Attention mechanism
Idea: More focus on most relevant word from input
Attention is a weighted sum of all input words
For each input word $i$ we calculate two vectors:
At every decoding step $j$ we calculate a query vector $q_j$
The product $q_j \cdot k_i$ determines how relevant the word $i$ is for decoding position $j$.
$q$ and $k$ must be of same dimension $n$.
Using softmax we scale those weights
\[ \tilde w_j = \operatorname{softmax}(\frac{\sum_i q_j \cdot k_i}{\sqrt{n}}) \]Finally we calculate the weighted sum of all values
\[ \sum_i \tilde w_j \cdot v_i \]The $v_i$ are word embeddings similar to those used in sequence models.
More relevant embeddings $v_i$ get a larger share in the final result.
Using the the softmax of the $q_j \cdot k_i$ product we get a differentiable lookup
With attention each input word can be accessed any time and without a decay in relevance over time.
The dot product attention scales quadratically with the sequence length. ($O(n^2)$)
We still cannot have arbitrarily long input sequences.
(though there are some tricks to make this more efficient)
When calculating the initial embeddings for the input tokens ($k_i$, $v_i$), the algorithm should have a way to know the position of the token
Otherwise we get a (complicated) bag of words model
Positional encoding
Idea: Add some part to the embedding vector that allows the identification
The positional encoding of length $d$ at position $p$ is
\[ \begin{bmatrix} \sin(\frac{p}{10000^{\frac{2 \cdot 0}{d}}}) \\[3mm] \cos(\frac{p}{10000^{\frac{2 \cdot 0}{d}}}) \\[3mm] \sin(\frac{p}{10000^{\frac{2 \cdot 1}{d}}}) \\[3mm] \cos(\frac{p}{10000^{\frac{2 \cdot 1}{d}}}) \\[3mm] \vdots \\[3mm] \sin(\frac{p}{10000^{\frac{d}{d}}}) \\[3mm] \cos(\frac{p}{10000^{\frac{d}{d}}}) \\[3mm] \end{bmatrix} \]
We get a pattern of waves with increasing wavelenghts
https://commons.wikimedia.org/wiki/File:Positional_encoding.png
Instead of a static positional encoding it is possible to learn a positional encoding.
https://commons.wikimedia.org/wiki/File:The-Transformer-model-architecture.png
Transformers are very generic and very powerful architectures
Will massively overfit if we do not have enough data.
Current Large Language Models (LLMs) - like GPT-4 - are transformer based language models (decoder only) trained on massive amounts of crawled web texts.
You can download a small to medium sized pre trained LLM and run it on your local machine.
%pip install --upgrade torch torchvision torchaudio
%pip install transformers einops accelerate xformers
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "togethercomputer/RedPajama-INCITE-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to('cuda')
LLMs are trained to calculate conditional probabilities of the next token:
import numpy as np
inputs = tokenizer("Bananas are yellow and apples are", return_tensors="pt").to(model.device)
generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True, max_new_tokens=1)
def softmax(x):
e = np.exp(x.cpu())
return e / e.sum()
probs = softmax(generation_output.scores[0][0])
top_10 = np.argsort(-probs)[:10]
for token_id in top_10:
token = tokenizer.decode(token_id)
print(f"{repr(token)}: {probs[token_id]*100}")
LLMs show a surprising ability to generalize to new tasks.
prompt = """Classify the text into neutral, negative or positive.
Text: Even though the acting was terrific, the whole movie could not stand up to our expectations.
Sentiment:"""
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=3, num_beams=5, return_dict_in_generate=True
)
output_str = tokenizer.decode(outputs.sequences[0,:])
print(output_str)
LLMs can be particularly good at extracting information from unstructured input
prompt = """
Context:
BEIJING, Sept 15 (Reuters) - China's factory output and retail sales grew at a faster pace in August, but tumbling investment in the crisis-hit property sector threatens to undercut a flurry of support steps that are showing signs of stabilising parts of the wobbly economy.
Chinese policymakers are facing a daunting task in trying to revive growth following a brief post-COVID bounce in the wake of persistent weakness in the crucial property industry, a faltering currency and weak global demand for its manufactured goods.
Industrial output rose 4.5% in August from a year earlier, data released on Friday by the National Bureau of Statistics (NBS) showed, accelerating from the 3.7% pace in July and beating expectations for a 3.9% increase in a Reuters poll of analysts. The growth marked the quickest pace since April.
Retail sales, a gauge of consumption, also increased at a faster 4.6% pace in August aided by the summer travel season, and was the quickest growth since May. That compared with a 2.5% increase in July, and an expected 3% rise.
The upbeat data suggest that a flurry of recent measures to shore up a faltering economy are starting to bear fruit.
Yet, a durable recovery is far from assured, analysts say, especially as confidence remains low in the embattled property sector and continues to be a major drag on growth.
"Despite signs of stabilisation in manufacturing and related investment, the deteriorating property investment will continue to pressure economic growth," said Gary Ng, Natixis Asia Pacific senior economist.
The markets, however, showed relief at some of the better-than-expected indicators.
The Chinese yuan touched two-week highs against the dollar, while the blue-chip CSI 300 Index (.CSI300) was up 0.2% and Hong Kong's Hang Seng Index (.HSI) climbed 1% in early morning trade.
Further aiding sentiment, separate commodities data showed China's primary aluminium output hit a record-monthly high in August while oil refinery throughput also rose to a record.
Question:
How does the Chinese yuan perform? (good/bad/neutral)
Answer:"""
prompt = """
Create a JSON representation of the article, containing the following keys: 'source', 'country', 'tags', 'category'.
Article:
BEIJING, Sept 15 (Reuters) - China's factory output and retail sales grew at a faster pace in August, but tumbling investment in the crisis-hit property sector threatens to undercut a flurry of support steps that are showing signs of stabilising parts of the wobbly economy.
Chinese policymakers are facing a daunting task in trying to revive growth following a brief post-COVID bounce in the wake of persistent weakness in the crucial property industry, a faltering currency and weak global demand for its manufactured goods.
Industrial output rose 4.5% in August from a year earlier, data released on Friday by the National Bureau of Statistics (NBS) showed, accelerating from the 3.7% pace in July and beating expectations for a 3.9% increase in a Reuters poll of analysts. The growth marked the quickest pace since April.
Retail sales, a gauge of consumption, also increased at a faster 4.6% pace in August aided by the summer travel season, and was the quickest growth since May. That compared with a 2.5% increase in July, and an expected 3% rise.
The upbeat data suggest that a flurry of recent measures to shore up a faltering economy are starting to bear fruit.
Yet, a durable recovery is far from assured, analysts say, especially as confidence remains low in the embattled property sector and continues to be a major drag on growth.
"Despite signs of stabilisation in manufacturing and related investment, the deteriorating property investment will continue to pressure economic growth," said Gary Ng, Natixis Asia Pacific senior economist.
The markets, however, showed relief at some of the better-than-expected indicators.
The Chinese yuan touched two-week highs against the dollar, while the blue-chip CSI 300 Index (.CSI300) was up 0.2% and Hong Kong's Hang Seng Index (.HSI) climbed 1% in early morning trade.
Further aiding sentiment, separate commodities data showed China's primary aluminium output hit a record-monthly high in August while oil refinery throughput also rose to a record.
JSON:"""
When requesting structured output, one trick is to modify the LLMs probabilities, to enforce correct structures.
These where instances of "zero-shot" prompting.
Sometimes it can be beneficial to provide examples to improve performance ("few-shot" prompting)
prompt = """
This is awesome! - 9/10
This is bad! - 3/10
This was the most horrific show - 1/10
Wow that movie was rad! - 10/10
The movie was quite okay - 6/10
Unimaginative plot but good performance! -"""
Question answering works best, if the LLM has access to a reference text containing the answer.
prompt = """
Context:
In the time of King Frederick William I (1688), shortly after the Thirty Years' War and a century before today's Brandenburg Gate was constructed, Berlin was a small walled city within a star fort with several named gates: Spandauer Tor, St. Georgen Tor, Stralower Tor, Cöpenicker Tor, Neues Tor, and Leipziger Tor (see map). Relative peace, a policy of religious tolerance, and status as capital of the Kingdom of Prussia facilitated the growth of the city. With the construction of Dorotheenstadt around 1670 and its inclusion in Berlin's city fortifications, a first gate was built on the site, consisting of a breach through the raised wall and a drawbridge over the dug moat.
The Berlin Customs Wall with its eighteen gates, around 1855. The Brandenburger Thor (Brandenburg Gate) is on the left.
With the construction of the Berlin Customs Wall (German: Akzisemauer) in 1734, which enclosed the old fortified city and many of its then suburbs, a predecessor of today's Brandenburg Gate was built by the Court Architect Philipp Gerlach as a city gate on the road to Brandenburg an der Havel. The gate system consisted of two Baroque pylons decorated with pilasters and trophies, to which the gate wings were attached. In addition to the ornamental gate, there were simple passages for pedestrians in the wall, which were decorated with ornamental vases at this point.[3]
The current gate was commissioned by Frederick William II of Prussia to represent peace and was originally named the Peace Gate (German: Friedenstor).[4] It was designed by Carl Gotthard Langhans, the Court Superintendent of Buildings, and built between 1788 and 1791, replacing the earlier simple guardhouses which flanked the original gate in the Customs Wall. The gate consists of twelve Doric columns, six to each side, forming five passageways. Citizens were originally allowed to use only the outermost two on each side. Its design is based on the Propylaea, the gateway to the Acropolis in Athens, and is consistent with Berlin's history of architectural classicism (first Baroque, and then neo-Palladian). The gate was the first element of a "new Athens on the river Spree" by Langhans.[5] Atop the gate is a sculpture by Johann Gottfried Schadow of a quadriga—a chariot drawn by four horses—driven by Victoria, the Roman goddess of victory.
Question:
Who designed the Brandenburg Gate?
Answer:"""
How can we provide an LLM with such a reference?
Idea 1: Use websearch and a crawler.
Approach
Idea 2: Use semantic similarity search.
We have seen for word embeddings, that similar words will have a small cosine distance.
A language model also creates vector representations of the text it processes.
We can use those vectors to find similar texts.
Transformers use the same idea when calculating the attention scores.
(with key and query vectors)
Approach
Problem: With a large knowledge base we need to calculate many cosine distances
We search for the $k$ nearest neighbors (KNN) of a given vector
In general this will take $O(n)$ time.
A solution is to use approximate nearest neighbor (ANN) search.
Idea: It is okay if we sometimes do skip over a closer neighbor
One ANN algorithm: Hierarchical Navigable Small World graphs
Vector databases implementing approximate nearest neighbor search:
As long as we can calculate a meaningful embedding, we can use it for searching.
LLMs often show good zero or few shot performance.
If we have at least a few hundred examples an LLM can be fine tuned.
(similar to large pretrained image classifiers)
Having a small fine tuned LLM can give better results than a large non fine tuned one.
This can be relevant because inference for LLMs can be costly.
Some LLMs, like ChatGPT, are "instruction-tuned".
They are fine tuned on a dataset with instruction prompts, so they are better at executing commands.
Transformers are not limited to text data or even sequential data.
A transformer "perceives" its input as a bag of vectors.
Transformers can work on images.
Idea: Divide image into patches
Each patch gets a positional encoding and is flattened (or preprocessed using a CNN)
https://commons.wikimedia.org/wiki/File:Vision_Transformer.gif
Image and text data can even be combined as input for a transformer.
As long as we use vectors $v$ and $k$ of the same length for all forms of input data.
Very large foundational models change the paradigm of how machine learning applications are developed: