- Notifications
You must be signed in to change notification settings - Fork11.9k
-
In the book I'm reading, I see the following words on page |
BetaWas this translation helpful?Give feedback.
All reactions
Replies: 1 comment 1 reply
-
Thanks for the feedback! When I understand correctly, when removing the output projection layer, the results from using 1 large head (via the I think this could be because we use a very simple dataset and short training here. In practice, on a larger-scale, I don't think that this will be true. I.e., I think the AI answer is incorrect. That's because a single head with the same overall dimensionality is not equivalent to multi-head, because the single head has just one Q/K/V projection, while multi-head has Regarding the |
BetaWas this translation helpful?Give feedback.
All reactions
-
You are absolute right. I redo test with both large and small dimensions and found that even for small dimensions, even if sharing W_key, W_query, W_value between Multi-head and single-head, the sub-dimension operation of multi-head results in different attain_scores in both classes, even if without masking, scaling, or softmax, the final matrix multiplication of multi-head just use sub-dimension attain_scores to multiple sub-dimension of values, and then just concat along the last dimension (d_out), which leads to fundamental and meaningful different output. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1