- Notifications
You must be signed in to change notification settings - Fork11.9k
-
I find it hard to understand how the gated delta rule is exactly computed. Specifically, where can I find the exact computation or formula for the alpha parameter in the gated delta rule? Is this described in any related paper? LLMs-from-scratch/ch04/08_deltanet/README.md Lines 187 to 192 in488bef7
alpha seems to be called Additionally, for Qwen3-Next it seems that alpha and beta have activation functions applied – are these different from what is shown in the explanatory image? |
BetaWas this translation helpful?Give feedback.
All reactions
In the LitGPT code, I think they called itgk for "gate for step k" (whereas it is "alpha for step t" in the paper).
But if you considergk.float().exp() later, I think that corresponds to the paper's
In my code I am calling it alpha:
alpha=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias )
But this is more of a pre-alpha. The real alpha comes later in
S=S*a_t.exp()
Maybe to make this clear, I could rename it as follows?
alpha_log=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias)alpha=alpha_log.exp()
Replies: 3 comments 21 replies
-
You might like the original Gated DeltaNet code here, (based on the LitGPT library I helped develop a few years ago):https://github.com/NVlabs/GatedDeltaNet/blob/main/lit_gpt/gated_delta_net.py |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
According to the "Improving Mamba2 with Delta Rule" paper, it says that alpha is a value between 0 and 1 that varies with I might be wrong, but for me it seems that -self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias ) is the |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
In the LitGPT code, I think they called it But if you consider In my code I am calling it alpha: alpha=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias ) But this is more of a pre-alpha. The real alpha comes later in S=S*a_t.exp() Maybe to make this clear, I could rename it as follows? alpha_log=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias)alpha=alpha_log.exp() |
BetaWas this translation helpful?Give feedback.
All reactions
-
Thanks for the awesome feedback and discussion you two. I'll think about how to best update the figure (in the next few days) to make it more informative. (I'll also update the |
BetaWas this translation helpful?Give feedback.
All reactions
👍 2
-
Once you update the figure, there’s one small thing in the text to fix as well:
This should be |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
I might have found another interesting detail:
Also, as a side note, Kimi Linear uses for the last transformer blocks a 2:1 ratio, which I found out through the config: |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Do you specifically mean the "which also uses a latent representation of query" aspect? I had mentioned that in my "The Big Architecture Comparison Article":
I think it was only during training and not during inference, but I have to go back and double check the code.
Interesting, I haven't checked, but off the top of my head, you think it's because it otherwise doesn't work with their number of blocks (i.e., it not being divisible by 4?) |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Yes, you are right. It's explained in Figure 3 in the DeepSeek v2 paper: -> "(...) Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cacheduring inference."
Yeah, the odd number of transformer blocks was what wondered me, therefore I have checked the config. Just an interesting hidden detail I wanted to share. 🙂 |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
From my limited experience of having re-implemented Qwen3-Next and my understanding, you are right Edit: beta is indeed activated by a sigmoid which isn't in shown in the Qwen image, alpha can be considered "activated" since it's squashed too but not by a traditional sigmoid The image you linked from Songlin is just DeltaNet For the alpha formula it's derived from eq. 4 of this paper:https://arxiv.org/abs/2312.00752. It's basically what Sebastian wrote above. defcompute_alpha_factor(log_A,a,dt_bias):""" Calculates the state decay factor alpha following Qwen3-Next/SSM-style formula. Alpha is the exponential decay factor applied to the previous state memory in Gated Delta Rule. This controls how much of the previous state memory we keep or forget. alpha = e^(-A * Δt) (can be seen as e^(-Rate * Time)) where A > 0 and Δt > 0: - A is learned as log_A and then exponentiated (e^log_A) to ensure positivity. - Δt is passed through a softplus to ensure positivity. both positivity ensures that alpha via e^ is always in (0, 1) as a final decay factor. Δt is the result of the affine function Wx + dt with "a" as Wx (this makes Δt dynamic per token and thus the decay) Δt represents how much duration to apply the decay (time step). args: log_A: (num_v_heads,) represents the base (log) decay rate per value head (will be a constant per head) a: (b, seq_len, num_v_heads) the tokens to num_v_heads projections (will be dynamic per token) dt_bias: (num_v_heads,) learnable bias for time step Δt returns: alpha: (b, seq_len, num_v_heads) final decay factor per token, range (0, 1) """A=torch.exp(log_A)# retrieves positive A from the learned logarithmdelta_t=torch.nn.functional.softplus(a+dt_bias)# Δtalpha=torch.exp(-A*delta_t)# e^(-Rate * Time)returnalpha |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Yeah, the back and forth between log and exp really confused me, plus that the decay factor is referred to as alpha when the code doesn’t have any comment or variable named that. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 2
-
I didn't had a look in Kimi Linear but I get it now why you were mentioning activation from Sebastian's picture. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
-
Exactly! 😄 |
BetaWas this translation helpful?Give feedback.
All reactions
-
I think that was probably just for stability, like log softmax.
Yes, it would preferable to use the terminology that is described in the paper, otherwise it's just unnecessarily confusing. Especially since the term "gate" is already used for other things. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
