|Yegor||Attention and Transformers|
Part 1: Attention
Part 2: Transformers
SIGBOVIK papers - TurkSort and RISE
- Sequence to sequence problems - particularly used for translation tasks, but can also be used for other tasks.
- Generate the next word task.
- Recurrent networks - old approach. Lots of information is crammed into a single vector, and information must take a long, meandering patht rhough the system.
- Recurrent netwrks - previous states are propoagated throughout time steps.
- In seq2seq problems, the output of the enocding is passed into a recurrent decoder.
- Many translation problems have complicated dependencies (e.g. gendered words) that need to be navigated.
- Attention - recurrence free, enables large models. Can model complex dependencies well and is trainable.
- Attention - pick and choose which words have to do with which other words.
- Query, Key, Value
- Query - almost serves a lookup-table like function.
- Keys - how things are looked up.
- Values - what we’re actually looking up.
- Take every word and embedding
- Obtain queries, keys, and values for each token just by taking the embedding and multiplying it by a matrix. This transforms it into a different space. Fundamentally, represents a step to separate information out.
- Attention operation - one formula. Dot product between the query and the key, scaled by the vectors passed into softmax and multiplied by value vector
- Multi-head attention.
- Symbols can have multiple meanings. Multihead attention - project into smaller queries, keys, values and perform attention
- Problem - when we take the weighted average of the words, we lose all positional information; we effectively have a bag of words model.
- Hacky fix - positional encoding. Use multiple sine waves to encode the positional information.
- Transformer - equal path lengths, speeds, avoid gradient vanishing and explosion
Yegor has recommended additional resources to learn about attention and transformers.
- Illustrated Transformer
- Harvard NLP - Attention
- PyTorch - Transformer Tutorial
- “Attention is All You Need” Paper
- Transformer Slides from CSE 447 @ UW