Commit9276edb

authored

- docs(moe): correct arXiv link for DeepSeekMoE (#890)

- docs(moe): correct paper name for 2022

1 parent218221a commit9276edbCopy full SHA for 9276edb

File tree

1 file changed

-2

lines changed

ch04/07_moe
- README.md

1 file changed

-2

lines changed

`‎ch04/07_moe/README.md‎`

Lines changed: 2 additions & 2 deletions

Original file line number	Diff line number	Diff line change
`@@ -23,7 +23,7 @@ Because only a few experts are active at a time, MoE modules are often referred`
`23`	`23`
`24`	`24`	`For example, DeepSeek-V3 has 256 experts per MoE module and a total of 671 billion parameters. Yet during inference, only 9 experts are active at a time (1 shared expert plus 8 selected by the router). This means just 37 billion parameters are used for each token inference step as opposed to all 671 billion.`
`25`	`25`
`26`		`-One notable feature of DeepSeek-V3's MoE design is the use of a shared expert. This is an expert that is always active for every token. This idea is not new and was already introduced in the[2022DeepSeekMoE](https://arxiv.org/abs/2201.05596) and the[2024 DeepSeek MoE](https://arxiv.org/abs/2201.05596) papers.`
	`26`	`+One notable feature of DeepSeek-V3's MoE design is the use of a shared expert. This is an expert that is always active for every token. This idea is not new and was already introduced in the[2022DeepSpeed-MoE](https://arxiv.org/abs/2201.05596) and the[2024 DeepSeek MoE](https://arxiv.org/abs/2401.06066) papers.`
`27`	`27`
`28`	`28`	` `
`29`	`29`
`@@ -33,7 +33,7 @@ One notable feature of DeepSeek-V3's MoE design is the use of a shared expert. T`
`33`	`33`
`34`	`34`	` `
`35`	`35`
`36`		`-The benefit of having a shared expert was first noted in the[DeepSpeedMoE paper](https://arxiv.org/abs/2201.05596), where they found that it boosts overall modeling performance compared to no shared experts. This is likely because common or repeated patterns don't have to be learned by multiple individual experts, which leaves them with more room for learning more specialized patterns.`
	`36`	`+The benefit of having a shared expert was first noted in the[DeepSpeed-MoE paper](https://arxiv.org/abs/2201.05596), where they found that it boosts overall modeling performance compared to no shared experts. This is likely because common or repeated patterns don't have to be learned by multiple individual experts, which leaves them with more room for learning more specialized patterns.`
`37`	`37`
`38`	`38`	` `
`39`	`39`	`##Mixture of Experts (MoE) Memory Savings`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit9276edb

File tree

1 file changed

1 file changed

`‎ch04/07_moe/README.md‎`

0 commit comments