Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit488bef7

Browse files
authored
Image resizing
1 parentc6b8332 commit488bef7

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

‎ch04/08_deltanet/README.md‎

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Recently, [Qwen3-Next](https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9c
44

55
Both Qwen3-Next and Kimi Linear use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention, as shown in the figure below.
66

7-
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp"alt="Qwen3-Next versus Kimi Linear"style="zoom:47%;" />
7+
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp"alt="Qwen3-Next versus Kimi Linear">
88

99

1010

@@ -125,7 +125,7 @@ The delta rule part refers to computing the difference (delta, Δ) between new a
125125

126126
Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.)
127127

128-
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/02.webp"alt="Gated DeltaNet"style="zoom:47%;" />
128+
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/02.webp"alt="Gated DeltaNet"width=500px>
129129

130130
However, as shown in the figure above, the "gated" in the Gated DeltaNet also refers to several additional gates:
131131

@@ -271,7 +271,7 @@ context = context.reshape(b, num_tokens, self.d_out)
271271

272272

273273

274-
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/03.webp"alt="Quadratic attention"style="zoom:67%;" />
274+
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/03.webp"alt="Quadratic attention"width=500px />
275275

276276
In Gated DeltaNet, there's no*n*-by-*n* attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what's implemented as, where`S` is the state that gets updated recurrently for each time step*t*.
277277

@@ -353,4 +353,4 @@ uv run plot_memory_estimates_gated_deltanet.py \
353353

354354
Note that the above computes the`head_dim` as`emb_dim / n_heads`. I.e., 2048 / 16 = 128.
355355

356-
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/plot.webp"alt="Gated DeltaNet scaling"style="zoom:47%;" />
356+
<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/plot.webp"alt="Gated DeltaNet scaling"width=500px>

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp