Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Qwen3-Next: Computation of Gated Delta Rule#902

Answeredbyrasbt
d-kleine asked this question inQ&A
Discussion options

I find it hard to understand how the gated delta rule is exactly computed. Specifically, where can I find the exact computation or formula for the alpha parameter in the gated delta rule? Is this described in any related paper?

### NEW: Compute delta rule gates
beta= torch.sigmoid(self.W_beta(x))
alpha=-self.A_log.exp().view(1,1,-1)* F.softplus(
self.W_alpha(x)+self.dt_bias
)
gate=self.W_gate(x)

alpha seems to be calledg in the HuggingFace implementation of Qwen3-Next gated delta rule implementation:
https://github.com/huggingface/transformers/blob/dd4e048e75d61512a92faba59d7651aad1ce9519/src/transformers/models/qwen3_next/modular_qwen3_next.py#L594-L596

Additionally, for Qwen3-Next it seems that alpha and beta have activation functions applied – are these different from what is shown in the explanatory image?

Qwen3-Next versus Kimi Linear

You must be logged in to vote
Answered by rasbtNov 4, 2025

In the LitGPT code, I think they called itgk for "gate for step k" (whereas it is "alpha for step t" in the paper).

But if you considergk.float().exp() later, I think that corresponds to the paper's$\alpha_t$

In my code I am calling it alpha:

alpha=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias    )

But this is more of a pre-alpha. The real alpha comes later in

S=S*a_t.exp()

Maybe to make this clear, I could rename it as follows?

alpha_log=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias)alpha=alpha_log.exp()

Replies: 3 comments 21 replies

Comment options

rasbt
Nov 4, 2025
Maintainer

You might like the original Gated DeltaNet code here, (based on the LitGPT library I helped develop a few years ago):https://github.com/NVlabs/GatedDeltaNet/blob/main/lit_gpt/gated_delta_net.py

You must be logged in to vote
1 reply
@d-kleine
Comment options

According to the "Improving Mamba2 with Delta Rule" paper, it says that alpha is a value between 0 and 1 that varies with$t$.

I might be wrong, but for me it seems that

-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias  )

is thegate (corresponding tog in the HF implementation)

Comment options

rasbt
Nov 4, 2025
Maintainer

In the LitGPT code, I think they called itgk for "gate for step k" (whereas it is "alpha for step t" in the paper).

But if you considergk.float().exp() later, I think that corresponds to the paper's$\alpha_t$

In my code I am calling it alpha:

alpha=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias    )

But this is more of a pre-alpha. The real alpha comes later in

S=S*a_t.exp()

Maybe to make this clear, I could rename it as follows?

alpha_log=-self.A_log.exp().view(1,1,-1)*F.softplus(self.W_alpha(x)+self.dt_bias)alpha=alpha_log.exp()
You must be logged in to vote
16 replies
@rasbt
Comment options

rasbtNov 6, 2025
Maintainer

Thanks for the awesome feedback and discussion you two. I'll think about how to best update the figure (in the next few days) to make it more informative. (I'll also update theA_log init, good catch@d-kleine )

@d-kleine
Comment options

Once you update the figure, there’s one small thing in the text to fix as well:

- β (`alpha`) regulates how much the current token at time step*t* updates the memory.

This should bebeta.

@d-kleine
Comment options

I might have found another interesting detail:

  • The MLA in Kimi Linear seems to be based on based on DeepSeek-V3, which also uses a latent representation of query, called$c_t$.

Also, as a side note, Kimi Linear uses for the last transformer blocks a 2:1 ratio, which I found out through the config:
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/blob/main/config.json

@rasbt
Comment options

@d-kleine

The MLA in Kimi Linear seems to be based on based on DeepSeek-V3, which also uses a latent representation of query, called

Do you specifically mean the "which also uses a latent representation of query" aspect? I had mentioned that in my "The Big Architecture Comparison Article":

(As a side note, the queries are also compressed, but only during training, not inference.)

I think it was only during training and not during inference, but I have to go back and double check the code.

Also, as a side note, Kimi Linear uses for the last transformer blocks a 2:1 ratio,

Interesting, I haven't checked, but off the top of my head, you think it's because it otherwise doesn't work with their number of blocks (i.e., it not being divisible by 4?)

@d-kleine
Comment options

I think it was only during training and not during inference, but I have to go back and double check the code.

Yes, you are right. It's explained in Figure 3 in the DeepSeek v2 paper:
https://arxiv.org/abs/2405.04434

-> "(...) Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cacheduring inference."

Interesting, I haven't checked, but off the top of my head, you think it's because it otherwise doesn't work with their number of blocks (i.e., it not being divisible by 4?)

Yeah, the odd number of transformer blocks was what wondered me, therefore I have checked the config. Just an interesting hidden detail I wanted to share. 🙂

Answer selected byd-kleine
Comment options

From my limited experience of having re-implemented Qwen3-Next and my understanding, you are rightg in HF implementation isalpha for Sebastian impl and it'sgk in the original code, but it's not thegate (if you meant by that the gated SiLU connection)

Edit: beta is indeed activated by a sigmoid which isn't in shown in the Qwen image, alpha can be considered "activated" since it's squashed too but not by a traditional sigmoid

The image you linked from Songlin is just DeltaNet

For the alpha formula it's derived from eq. 4 of this paper:https://arxiv.org/abs/2312.00752.
I was also confused by the goal, in my case I made it as a verbose function so I can remember why I was doing that later on.

It's basically what Sebastian wrote above.

defcompute_alpha_factor(log_A,a,dt_bias):"""    Calculates the state decay factor alpha following Qwen3-Next/SSM-style formula.    Alpha is the exponential decay factor applied to the previous state memory in Gated Delta Rule.    This controls how much of the previous state memory we keep or forget.    alpha = e^(-A * Δt) (can be seen as e^(-Rate * Time)) where A > 0 and Δt > 0:    - A is learned as log_A and then exponentiated (e^log_A) to ensure positivity.    - Δt is passed through a softplus to ensure positivity.    both positivity ensures that alpha via e^ is always in (0, 1) as a final decay factor.    Δt is the result of the affine function Wx + dt with "a" as Wx (this makes Δt dynamic per token and thus the decay)    Δt represents how much duration to apply the decay (time step).    args:        log_A: (num_v_heads,) represents the base (log) decay rate per value head (will be a constant per head)        a: (b, seq_len, num_v_heads) the tokens to num_v_heads projections (will be dynamic per token)        dt_bias: (num_v_heads,) learnable bias for time step Δt    returns:        alpha: (b, seq_len, num_v_heads) final decay factor per token, range (0, 1)    """A=torch.exp(log_A)# retrieves positive A from the learned logarithmdelta_t=torch.nn.functional.softplus(a+dt_bias)# Δtalpha=torch.exp(-A*delta_t)# e^(-Rate * Time)returnalpha
You must be logged in to vote
4 replies
@d-kleine
Comment options

Yeah, the back and forth between log and exp really confused me, plus that the decay factor is referred to as alpha when the code doesn’t have any comment or variable named that.

@casinca
Comment options

I didn't had a look in Kimi Linear but I get it now why you were mentioning activation from Sebastian's picture.
They indeed seem to squash into a (0,1) range via a classic sigmoid, unlike what Qwen is doing, which is more sophisticated inspired by Mamba.
It's true that Alpha is also called "gating term" in regards to GDN, it adds confusion with "gate" from the SiLU gate.

@d-kleine
Comment options

It's true that Alpha is also called "gating term" in regards to GDN, it adds confusion with "gate" from the SiLU gate.

Exactly! 😄

@rasbt
Comment options

rasbtNov 5, 2025
Maintainer

Yeah, the back and forth between log and exp really confused me,

I think that was probably just for stability, like log softmax.

plus that the decay factor is referred to as alpha when the code doesn’t have any comment or variable named that.

Yes, it would preferable to use the terminology that is described in the paper, otherwise it's just unnecessarily confusing. Especially since the term "gate" is already used for other things.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
3 participants
@d-kleine@rasbt@casinca

[8]ページ先頭

©2009-2025 Movatter.jp