Output of CausalSelfAttention #118

whchan05 · 2023-07-06T23:07:57Z

It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you

theicfire · 2024-05-06T01:44:36Z

This does have W^O, it's here:

minGPT/mingpt/model.py

Line 42 in 37baab7

self.c_proj = nn.Linear(config.n_embd, config.n_embd)

But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.

The paper does the following:

takes the embedding vector n_heads times, and does a linear projection that reduces the size (3 times)
that results in the k, q, v for each head

You can think of the paper as doing 3 * n_head linear projections.

This repo instead does two things, all via c_attn:

calculates the k, q, v all at once
does this for every head

The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.

Fwiw, you can see him talking about this part here: https://youtu.be/kCc8FmEb1nY?feature=shared&t=4919

I found this excerpt from the paper clarifying:

theicfire mentioned this issue May 6, 2024

What is the purpose of c_proj here? #135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output of CausalSelfAttention #118

Output of CausalSelfAttention #118

whchan05 commented Jul 6, 2023

theicfire commented May 6, 2024 •

edited

Loading

Output of CausalSelfAttention #118

Output of CausalSelfAttention #118

Comments

whchan05 commented Jul 6, 2023

theicfire commented May 6, 2024 • edited Loading

theicfire commented May 6, 2024 •

edited

Loading