You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you
The text was updated successfully, but these errors were encountered:
But I have a similar question. It turns out that this is equivalent to the transformers paper, but it's a bit tricky to understand.
The paper does the following:
takes the embedding vector n_heads times, and does a linear projection that reduces the size (3 times)
that results in the k, q, v for each head
You can think of the paper as doing 3 * n_head linear projections.
This repo instead does two things, all via c_attn:
calculates the k, q, v all at once
does this for every head
The paper instead does not slice up the embeddings but has a linear layer that maps the embedding to a smaller size. attention across the full embeddings, concatenates everything, and reduces the dimension down via W^O.
It seems that the output of this block is simply reshaped from multiple heads. From the original "Attention is all you need" paper, it seems that there is another linear layer, W^O. Mind if I ask if this is intentional or an error? Thank you
The text was updated successfully, but these errors were encountered: