Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of CausalSelfAttention #118

Open
whchan05 opened this issue Jul 6, 2023 · 1 comment
Open

Output of CausalSelfAttention #118

whchan05 opened this issue Jul 6, 2023 · 1 comment

Comments

@whchan05
Copy link

whchan05 commented Jul 6, 2023

It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you

@theicfire
Copy link

theicfire commented May 6, 2024

This does have W^O, it's here:

self.c_proj = nn.Linear(config.n_embd, config.n_embd)

But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.

The paper does the following:

  • takes the embedding vector n_heads times, and does a linear projection that reduces the size (3 times)
  • that results in the k, q, v for each head

You can think of the paper as doing 3 * n_head linear projections.

This repo instead does two things, all via c_attn:

  1. calculates the k, q, v all at once
  2. does this for every head

The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.

Fwiw, you can see him talking about this part here: https://youtu.be/kCc8FmEb1nY?feature=shared&t=4919

I found this excerpt from the paper clarifying:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants