rasbt/LLMs-from-scratchPublic

NotificationsYou must be signed in to change notification settings
Fork11.9k
Star80.1k

Qwen3-Next: Computation of Gated Delta Rule#902

Answeredbyrasbt

d-kleine asked this question inQ&A

d-kleine

Nov 4, 2025

· 3 comments· 21 replies

AnsweredbyrasbtReturn to top

Discussion options

d-kleine
Nov 4, 2025

I find it hard to understand how the gated delta rule is exactly computed. Specifically, where can I find the exact computation or formula for the alpha parameter in the gated delta rule? Is this described in any related paper?

LLMs-from-scratch/ch04/08_deltanet/README.md

Lines 187 to 192 in488bef7

	### NEW: Compute delta rule gates
	beta= torch.sigmoid(self.W_beta(x))
	alpha=-self.A_log.exp().view(1,1,-1)* F.softplus(
	self.W_alpha(x)+self.dt_bias
	)
	gate=self.W_gate(x)

alpha seems to be calledg in the HuggingFace implementation of Qwen3-Next gated delta rule implementation:
https://github.com/huggingface/transformers/blob/dd4e048e75d61512a92faba59d7651aad1ce9519/src/transformers/models/qwen3_next/modular_qwen3_next.py#L594-L596

Additionally, for Qwen3-Next it seems that alpha and beta have activation functions applied – are these different from what is shown in the explanatory image?

You must be logged in to vote

Answered by rasbt

Nov 4, 2025

In the LitGPT code, I think they called itgk for "gate for step k" (whereas it is "alpha for step t" in the paper).

But if you considergk.float().exp() later, I think that corresponds to the paper's$\alpha_t$

In my code I am calling it alpha:

alpha=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias    )

But this is more of a pre-alpha. The real alpha comes later in

S=S*a_t.exp()

Maybe to make this clear, I could rename it as follows?

alpha_log=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias)alpha=alpha_log.exp()

View full answer

Replies: 3 comments 21 replies

Comment options

rasbt
Nov 4, 2025
Maintainer

You might like the original Gated DeltaNet code here, (based on the LitGPT library I helped develop a few years ago):https://github.com/NVlabs/GatedDeltaNet/blob/main/lit_gpt/gated_delta_net.py

You must be logged in to vote

1 reply

Comment options

d-kleine Nov 4, 2025
Author

According to the "Improving Mamba2 with Delta Rule" paper, it says that alpha is a value between 0 and 1 that varies with$t$.

I might be wrong, but for me it seems that

-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias  )

is thegate (corresponding tog in the HF implementation)

Comment options

rasbt
Nov 4, 2025
Maintainer

In the LitGPT code, I think they called itgk for "gate for step k" (whereas it is "alpha for step t" in the paper).

But if you considergk.float().exp() later, I think that corresponds to the paper's$\alpha_t$

In my code I am calling it alpha:

alpha=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias    )

But this is more of a pre-alpha. The real alpha comes later in

S=S*a_t.exp()

Maybe to make this clear, I could rename it as follows?

alpha_log=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias)alpha=alpha_log.exp()

You must be logged in to vote

16 replies

Comment options

rasbt Nov 6, 2025
Maintainer

Thanks for the awesome feedback and discussion you two. I'll think about how to best update the figure (in the next few days) to make it more informative. (I'll also update theA_log init, good catch@d-kleine )

Comment options

d-kleine Nov 10, 2025
Author

Once you update the figure, there’s one small thing in the text to fix as well:

LLMs-from-scratch/ch04/08_deltanet/README.md

Line 300 in488bef7

- β (`alpha`) regulates how much the current token at time step*t* updates the memory.

This should bebeta.

Comment options

d-kleine Nov 11, 2025
Author

I might have found another interesting detail:

The MLA in Kimi Linear seems to be based on based on DeepSeek-V3, which also uses a latent representation of query, called$c_t$.

Also, as a side note, Kimi Linear uses for the last transformer blocks a 2:1 ratio, which I found out through the config:
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/blob/main/config.json

Comment options

rasbt Nov 11, 2025
Maintainer

@d-kleine

The MLA in Kimi Linear seems to be based on based on DeepSeek-V3, which also uses a latent representation of query, called

Do you specifically mean the "which also uses a latent representation of query" aspect? I had mentioned that in my "The Big Architecture Comparison Article":

(As a side note, the queries are also compressed, but only during training, not inference.)

I think it was only during training and not during inference, but I have to go back and double check the code.

Also, as a side note, Kimi Linear uses for the last transformer blocks a 2:1 ratio,

Interesting, I haven't checked, but off the top of my head, you think it's because it otherwise doesn't work with their number of blocks (i.e., it not being divisible by 4?)

Comment options

d-kleine Nov 11, 2025
Author

I think it was only during training and not during inference, but I have to go back and double check the code.

Yes, you are right. It's explained in Figure 3 in the DeepSeek v2 paper:
https://arxiv.org/abs/2405.04434

-> "(...) Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cacheduring inference."

Interesting, I haven't checked, but off the top of my head, you think it's because it otherwise doesn't work with their number of blocks (i.e., it not being divisible by 4?)

Yeah, the odd number of transformer blocks was what wondered me, therefore I have checked the config. Just an interesting hidden detail I wanted to share. 🙂

Answer selected byd-kleine

Comment options

casinca
Nov 4, 2025

From my limited experience of having re-implemented Qwen3-Next and my understanding, you are rightg in HF implementation isalpha for Sebastian impl and it'sgk in the original code, but it's not thegate (if you meant by that the gated SiLU connection)

Edit: beta is indeed activated by a sigmoid which isn't in shown in the Qwen image, alpha can be considered "activated" since it's squashed too but not by a traditional sigmoid

The image you linked from Songlin is just DeltaNet

For the alpha formula it's derived from eq. 4 of this paper:https://arxiv.org/abs/2312.00752.
I was also confused by the goal, in my case I made it as a verbose function so I can remember why I was doing that later on.

It's basically what Sebastian wrote above.

defcompute_alpha_factor(log_A,a,dt_bias):"""    Calculates the state decay factor alpha following Qwen3-Next/SSM-style formula.    Alpha is the exponential decay factor applied to the previous state memory in Gated Delta Rule.    This controls how much of the previous state memory we keep or forget.    alpha = e^(-A * Δt) (can be seen as e^(-Rate * Time)) where A > 0 and Δt > 0:    - A is learned as log_A and then exponentiated (e^log_A) to ensure positivity.    - Δt is passed through a softplus to ensure positivity.    both positivity ensures that alpha via e^ is always in (0, 1) as a final decay factor.    Δt is the result of the affine function Wx + dt with "a" as Wx (this makes Δt dynamic per token and thus the decay)    Δt represents how much duration to apply the decay (time step).    args:        log_A: (num_v_heads,) represents the base (log) decay rate per value head (will be a constant per head)        a: (b, seq_len, num_v_heads) the tokens to num_v_heads projections (will be dynamic per token)        dt_bias: (num_v_heads,) learnable bias for time step Δt    returns:        alpha: (b, seq_len, num_v_heads) final decay factor per token, range (0, 1)    """A=torch.exp(log_A)# retrieves positive A from the learned logarithmdelta_t=torch.nn.functional.softplus(a+dt_bias)# Δtalpha=torch.exp(-A*delta_t)# e^(-Rate * Time)returnalpha

You must be logged in to vote

4 replies

Comment options

d-kleine Nov 4, 2025
Author

Yeah, the back and forth between log and exp really confused me, plus that the decay factor is referred to as alpha when the code doesn’t have any comment or variable named that.

Comment options

casinca Nov 4, 2025

I didn't had a look in Kimi Linear but I get it now why you were mentioning activation from Sebastian's picture.
They indeed seem to squash into a (0,1) range via a classic sigmoid, unlike what Qwen is doing, which is more sophisticated inspired by Mamba.
It's true that Alpha is also called "gating term" in regards to GDN, it adds confusion with "gate" from the SiLU gate.

Comment options

d-kleine Nov 4, 2025
Author

It's true that Alpha is also called "gating term" in regards to GDN, it adds confusion with "gate" from the SiLU gate.

Exactly! 😄

Comment options

rasbt Nov 5, 2025
Maintainer

Yeah, the back and forth between log and exp really confused me,

I think that was probably just for stability, like log softmax.

plus that the decay factor is referred to as alpha when the code doesn’t have any comment or variable named that.

Yes, it would preferable to use the terminology that is described in the paper, otherwise it's just unnecessarily confusing. Especially since the term "gate" is already used for other things.

Movatterモバイル変換

Qwen3-Next: Computation of Gated Delta Rule#902

Uh oh!

d-kleineNov 4, 2025

Replies: 3 comments· 21 replies

Uh oh!

rasbtNov 4, 2025 Maintainer

Uh oh!

Uh oh!

d-kleineNov 4, 2025 Author

Uh oh!

Uh oh!

rasbtNov 4, 2025 Maintainer

Uh oh!

rasbtNov 6, 2025 Maintainer

Uh oh!

d-kleineNov 10, 2025 Author

Uh oh!

d-kleineNov 11, 2025 Author

Uh oh!

rasbtNov 11, 2025 Maintainer

Uh oh!

d-kleineNov 11, 2025 Author

Uh oh!

Uh oh!

casincaNov 4, 2025

Uh oh!

d-kleineNov 4, 2025 Author

Uh oh!

casincaNov 4, 2025

Uh oh!

d-kleineNov 4, 2025 Author

Uh oh!

rasbtNov 5, 2025 Maintainer

Uh oh!

d-kleine
Nov 4, 2025

Replies: 3 comments 21 replies

rasbt
Nov 4, 2025
Maintainer

d-kleine Nov 4, 2025
Author

rasbt
Nov 4, 2025
Maintainer

rasbt Nov 6, 2025
Maintainer

d-kleine Nov 10, 2025
Author

d-kleine Nov 11, 2025
Author

rasbt Nov 11, 2025
Maintainer

d-kleine Nov 11, 2025
Author

casinca
Nov 4, 2025

d-kleine Nov 4, 2025
Author

casinca Nov 4, 2025

d-kleine Nov 4, 2025
Author

rasbt Nov 5, 2025
Maintainer