In this video we’re going to keep it simple: let’s build GPT in under an hour. We’re going to go line by line and show what each step of multihead attention does, then we’re going to create a tiny GPT model using multihead attention and feed our data pipeline from Episode 2 through it.
This is a necessary step for coding our paper because Memorizing Transformers is built on variations of the “standard” GPT-style multihead attention decoder blocks…so that’s where we’ll start! We’ll also discuss some intuition about what multihead attention does and why it’s so widely used today.
After that, we’ll cover a few pytorch-specific topics about how to use the library effectively, avoid mistakes, and save yourself hours and hours of debugging by understanding a few principles that underlie pytorch. These include:
Different matrix multiplication operators
Broadcasting
Multidimensional matrix multiplication
Operation properties: contiguous, reshaping, copying
Last but not least a tiny, spelled out example of multihead attention to really solidify your understanding
Links:
Link to Colab Notebook: https://colab.research.google.com/dri...
You can follow me on twitter: / nickcdryan
Check out the membership site for a full course version of the series (coming soon) and lots of other NLP content and code! https://www.chrismccormick.ai/membership
Chapters:
00:00 intro
01:01 building multihead attention overview
04:08 what are query, key, and value vectors?
06:47 creating attention scores
08:33 why do we scale the attention scores?
13:50 attention scores
17:30 causal masking and softmax
24:16 what do we do with attention scores?
26:32 applying values to our attention scores
27:57 story about what happens in self-attention
31:58 why is the transformer so successful?
33:54 ...because it works?
37:21 architecture doesn't matter in the end
39:16 multibatch multihead attention setup
41:45 multihead queries and keys
45:04 reshape vs transpose
46:52 multihead attention
52:01 multihead attention class
54:28 GPT mini!
56:49 matrix multiplication operators
58:30 broadcasting
1:05:35 batch multiplication examples
1:10:33 briefly: operation contiguity, copying, underlying behavior
1:13:49 multihead attention toy example
Watch video Coding a Paper - Ep. 3: Let’s build GPT in an hour online without registration, duration hours minute second in high quality. This video was added by user ChrisMcCormickAI 08 February 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 2,304 once and liked it 75 people.