Commitbf039ff

authored

Add alternative attention structure (#880)

1 parent6eb6adf commitbf039ffCopy full SHA for bf039ff

File tree

2 files changed

+17

-3

lines changed

README.md
ch04
- README.md

2 files changed

+17

-3

lines changed

`‎README.md‎`

Lines changed: 4 additions & 3 deletions

Original file line number	Diff line number	Diff line change
`@@ -168,9 +168,10 @@ Several folders contain optional materials as a bonus for interested readers:`
`168`	`168`	`-Chapter 4: Implementing a GPT model from scratch`
`169`	`169`	`-[FLOPS Analysis](ch04/02_performance-analysis/flops-analysis.ipynb)`
`170`	`170`	`-[KV Cache](ch04/03_kv-cache)`
`171`		`--[Grouped-Query Attention](ch04/04_gqa)`
`172`		`--[Multi-Head Latent Attention](ch04/05_mla)`
`173`		`--[Sliding Window Attention](ch04/06_swa)`
	`171`	`+-[Attention alternatives](ch04/#attention-alternatives)`
	`172`	`+-[Grouped-Query Attention](ch04/04_gqa)`
	`173`	`+-[Multi-Head Latent Attention](ch04/05_mla)`
	`174`	`+-[Sliding Window Attention](ch04/06_swa)`
`174`	`175`	`-Chapter 5: Pretraining on unlabeled data:`
`175`	`176`	`-[Alternative Weight Loading Methods](ch05/02_alternative_weight_loading/)`
`176`	`177`	`-[Pretraining GPT on the Project Gutenberg Dataset](ch05/03_bonus_pretraining_on_gutenberg)`

`‎ch04/README.md‎`

Lines changed: 13 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -11,11 +11,24 @@`
`11`	`11`	`-[02_performance-analysis](02_performance-analysis) contains optional code analyzing the performance of the GPT model(s) implemented in the main chapter`
`12`	`12`	`-[03_kv-cache](03_kv-cache) implements a KV cache to speed up the text generation during inference`
`13`	`13`	`-[ch05/07_gpt_to_llama](../ch05/07_gpt_to_llama) contains a step-by-step guide for converting a GPT architecture implementation to Llama 3.2 and loads pretrained weights from Meta AI (it might be interesting to look at alternative architectures after completing chapter 4, but you can also save that for after reading chapter 5)`
	`14`	`+`
	`15`	`+`
	`16`	`+ `
	`17`	`+##Attention Alternatives`
	`18`	`+`
	`19`	`+ `
	`20`	`+`
	`21`	`+<imgsrc="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/attention-alternatives/attention-alternatives.webp">`
	`22`	`+`
	`23`	`+ `
	`24`	`+`
`14`	`25`	`-[04_gqa](04_gqa) contains an introduction to Grouped-Query Attention (GQA), which is used by most modern LLMs (Llama 4, gpt-oss, Qwen3, Gemma 3, and many more) as alternative to regular Multi-Head Attention (MHA)`
`15`	`26`	`-[05_mla](05_mla) contains an introduction to Multi-Head Latent Attention (MLA), which is used by DeepSeek V3, as alternative to regular Multi-Head Attention (MHA)`
`16`	`27`	`-[06_swa](06_swa) contains an introduction to Sliding Window Attention (SWA), which is used by Gemma 3 and others`
`17`	`28`
`18`	`29`
	`30`	`+ `
	`31`	`+##More`
`19`	`32`
`20`	`33`	`In the video below, I provide a code-along session that covers some of the chapter contents as supplementary material.`
`21`	`34`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitbf039ff

File tree

2 files changed

2 files changed

`‎README.md‎`

`‎ch04/README.md‎`

0 commit comments