Commit488bef7

authored

Image resizing

1 parentc6b8332 commit488bef7Copy full SHA for 488bef7

File tree

1 file changed

-4

lines changed

ch04/08_deltanet
- README.md

1 file changed

-4

lines changed

`‎ch04/08_deltanet/README.md‎`

Lines changed: 4 additions & 4 deletions

Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@ Recently, [Qwen3-Next](https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9c`
`4`	`4`
`5`	`5`	`Both Qwen3-Next and Kimi Linear use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there’s one block that uses full attention, as shown in the figure below.`
`6`	`6`
`7`		`-<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp"alt="Qwen3-Next versus Kimi Linear"style="zoom:47%;" />`
	`7`	`+<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/01.webp"alt="Qwen3-Next versus Kimi Linear">`
`8`	`8`
`9`	`9`
`10`	`10`
`@@ -125,7 +125,7 @@ The delta rule part refers to computing the difference (delta, Δ) between new a`
`125`	`125`
`126`	`126`	`Gated DeltaNet has a gate similar to the gate in gated attention discussed earlier, except that it uses a SiLU instead of logistic sigmoid activation, as illustrated below. (The SiLU choice is likely to improve gradient flow and stability over the standard sigmoid.)`
`127`	`127`
`128`		`-<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/02.webp"alt="Gated DeltaNet"style="zoom:47%;" />`
	`128`	`+<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/02.webp"alt="Gated DeltaNet"width=500px>`
`129`	`129`
`130`	`130`	`However, as shown in the figure above, the "gated" in the Gated DeltaNet also refers to several additional gates:`
`131`	`131`
`@@ -271,7 +271,7 @@ context = context.reshape(b, num_tokens, self.d_out)`
`271`	`271`
`272`	`272`
`273`	`273`
`274`		`-<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/03.webp"alt="Quadratic attention"style="zoom:67%;" />`
	`274`	`+<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/03.webp"alt="Quadratic attention"width=500px />`
`275`	`275`
`276`	`276`	In Gated DeltaNet, there's non-by-n attention matrix. Instead, the model processes tokens one by one. It keeps a running memory (a state) that gets updated as each new token comes in. This is what's implemented as, where`S` is the state that gets updated recurrently for each time stept.
`277`	`277`
`@@ -353,4 +353,4 @@ uv run plot_memory_estimates_gated_deltanet.py \`
`353`	`353`
`354`	`354`	Note that the above computes the`head_dim` as`emb_dim / n_heads`. I.e., 2048 / 16 = 128.
`355`	`355`
`356`		`-<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/plot.webp"alt="Gated DeltaNet scaling"style="zoom:47%;" />`
	`356`	`+<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/gated_deltanet/plot.webp"alt="Gated DeltaNet scaling"width=500px>`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit488bef7

File tree

1 file changed

1 file changed

`‎ch04/08_deltanet/README.md‎`

0 commit comments