NotificationsYou must be signed in to change notification settings
Fork1k
Star11.3k

Commit67719f4

authored

[npu] add supplementary content to the npu quick start doc (#6727)

1 parent0c41385 commit67719f4Copy full SHA for 67719f4

File tree

2 files changed

+156

-16

lines changed

docs
- source_en/BestPractices
  - NPU-support.md
- source/BestPractices
  - NPU-support.md

2 files changed

+156

-16

lines changed

`‎docs/source/BestPractices/NPU-support.md‎`

Lines changed: 93 additions & 16 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,25 +1,48 @@`
`1`	`1`	`#NPU支持`
`2`	`2`
	`3`	`+我们在 ms-swift 上增加了对昇腾 NPU 的支持，用户可以在昇腾 NPU 上进行模型的微调和推理。`
	`4`	`+`
	`5`	`+本文档介绍了如何在昇腾 NPU 上进行环境准备、模型微调、推理和部署。`
	`6`	`+`
	`7`	`+##安装`
	`8`	`+`
	`9`	`+基础环境准备：`
	`10`	`+`
	`11`	`+\| software\| version\|`
	`12`	`+\| ---------\| ---------------\|`
	`13`	`+\| Python\| >= 3.10, < 3.12\|`
	`14`	`+\| CANN\| == 8.3.RC1\|`
	`15`	`+\| torch\| == 2.7.1\|`
	`16`	`+\| torch_npu\| == 2.7.1\|`
	`17`	`+`
	`18`	`+`
	`19`	`+基础环境准备请参照这份[Ascend PyTorch 安装文档](https://gitcode.com/Ascend/pytorch)。`
	`20`	`+`
	`21`	`+`
`3`	`22`	`##环境准备`
`4`	`23`
`5`		`-实验环境：8 * 昇腾910B3 64G (设备由[@chuanzhubin](https://github.com/chuanzhubin)提供, 感谢对modelscope和swift的支持～)`
	`24`	`+实验环境：8 * 昇腾910B3 64G（设备由[@chuanzhubin](https://github.com/chuanzhubin) 提供，感谢对 ModelScope 和 Swift 的支持～）`
`6`	`25`
`7`	`26`	```shell
`8`		`-#创建新的conda虚拟环境(可选)`
	`27`	`+#创建新的 conda 虚拟环境（可选）`
`9`	`28`	`conda create -n swift-npu python=3.10 -y`
`10`	`29`	`conda activate swift-npu`
`11`	`30`
`12`		`-#设置pip全局镜像 (可选,加速下载)`
	`31`	`+#设置 pip 全局镜像（可选，加速下载）`
`13`	`32`	`pip configset global.index-url https://mirrors.aliyun.com/pypi/simple/`
`14`	`33`	`pip install ms-swift -U`
`15`	`34`
`16`		`-#安装torch-npu`
	`35`	`+#安装 torch-npu`
`17`	`36`	`pip install torch-npu decorator`
`18`		`-#如果你想要使用deepspeed (控制显存占用,训练速度会有一定下降)`
	`37`	`+#如果你想要使用 deepspeed（控制显存占用，训练速度会有一定下降）`
`19`	`38`	`pip install deepspeed`
	`39`	`+`
	`40`	`+# 如果需要使用 evaluation 功能，请安装以下包`
	`41`	`+pip install evalscope[opencompass]`
`20`	`42`	```
`21`	`43`
`22`	`44`	`测试环境是否安装正确，NPU能否被正常加载：`
	`45`	`+`
`23`	`46`	```python
`24`	`47`	`from transformers.utilsimport is_torch_npu_available`
`25`	`48`	`import torch`
`@@ -30,6 +53,7 @@ print(torch.randn(10, device='npu:0'))`
`30`	`53`	```
`31`	`54`
`32`	`55`	`查看NPU的P2P连接，这里看到每个NPU都通过7条HCCS与其他NPU互联`
	`56`	`+`
`33`	`57`	```shell
`34`	`58`	`(valle) root@valle:~/src# npu-smi info -t topo`
`35`	`59`	`NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity`
`@@ -54,6 +78,7 @@ Legend:`
`54`	`78`	```
`55`	`79`
`56`	`80`	`查看NPU状态, npu-smi命令详解可以查看[官方文档](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668)`
	`81`	`+`
`57`	`82`	```shell
`58`	`83`	`(valle) root@valle:~/src# npu-smi info`
`59`	`84`	`+------------------------------------------------------------------------------------------------+`
`@@ -89,19 +114,20 @@ Legend:`
`89`	`114`	```
`90`	`115`
`91`	`116`	`##微调`
	`117`	`+`
`92`	`118`	以下介绍LoRA的微调, 全参数微调设置参数`--train_type full`即可.
`93`	`119`
`94`		`-\| 模型大小\| NPU数量\| deepspeed类型\| 最大显存占用量\|`
`95`		`-\|------\|-------\|-------------\|-----------\|`
`96`		`-\| 7B\| 1\| None\| 1 * 28 GB\|`
`97`		`-\| 7B\| 4\| None\| 4 * 22 GB\|`
`98`		`-\| 7B\| 4\| zero2\| 4 * 28 GB\|`
`99`		`-\| 7B\| 4\| zero3\| 4 * 22 GB\|`
`100`		`-\| 7B\| 8\| None\| 8 * 22 GB\|`
`101`		`-\| 14B\| 1\| None\| 1 * 45 GB\|`
`102`		`-\| 14B\| 8\| None\| 8 * 51 GB\|`
`103`		`-\| 14B\| 8\| zero2\| 8 * 49 GB\|`
`104`		`-\| 14B\| 8\| zero3\| 8 * 31 GB\|`
	`120`	`+\| 模型大小\| NPU数量\| deepspeed类型\| 最大显存占用量\|`
	`121`	`+\|--------\|-------\|-------------\|--------------\|`
	`122`	`+\| 7B\| 1\| None\| 1 * 28 GB\|`
	`123`	`+\| 7B\| 4\| None\| 4 * 22 GB\|`
	`124`	`+\| 7B\| 4\| zero2\| 4 * 28 GB\|`
	`125`	`+\| 7B\| 4\| zero3\| 4 * 22 GB\|`
	`126`	`+\| 7B\| 8\| None\| 8 * 22 GB\|`
	`127`	`+\| 14B\| 1\| None\| 1 * 45 GB\|`
	`128`	`+\| 14B\| 8\| None\| 8 * 51 GB\|`
	`129`	`+\| 14B\| 8\| zero2\| 8 * 49 GB\|`
	`130`	`+\| 14B\| 8\| zero3\| 8 * 31 GB\|`
`105`	`131`
`106`	`132`	`###单卡训练`
`107`	`133`
`@@ -128,6 +154,7 @@ swift sft \`
`128`	`154`
`129`	`155`
`130`	`156`	`###数据并行训练`
	`157`	`+`
`131`	`158`	`我们使用其中的4卡进行ddp训练`
`132`	`159`
`133`	`160`	```shell
`@@ -150,6 +177,7 @@ swift sft \`
`150`	`177`	`###Deepspeed训练`
`151`	`178`
`152`	`179`	`ZeRO2:`
	`180`	`+`
`153`	`181`	```shell
`154`	`182`	`# 实验环境: 4 * 昇腾910B3`
`155`	`183`	`# 显存需求: 4 * 28GB`
`@@ -168,6 +196,7 @@ swift sft \`
`168`	`196`	```
`169`	`197`
`170`	`198`	`ZeRO3:`
	`199`	`+`
`171`	`200`	```shell
`172`	`201`	`# 实验环境: 4 * 昇腾910B3`
`173`	`202`	`# 显存需求: 4 * 22 GB`
`@@ -189,13 +218,15 @@ swift sft \`
`189`	`218`	`##推理`
`190`	`219`
`191`	`220`	`原始模型:`
	`221`	`+`
`192`	`222`	```shell
`193`	`223`	`ASCEND_RT_VISIBLE_DEVICES=0 swift infer \`
`194`	`224`	`--model Qwen/Qwen2-7B-Instruct \`
`195`	`225`	`--streamtrue --max_new_tokens 2048`
`196`	`226`	```
`197`	`227`
`198`	`228`	`LoRA微调后:`
	`229`	`+`
`199`	`230`	```shell
`200`	`231`	`ASCEND_RT_VISIBLE_DEVICES=0 swift infer \`
`201`	`232`	`--adapters xxx/checkpoint-xxx --load_data_argstrue \`
`@@ -211,18 +242,64 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \`
`211`	`242`
`212`	`243`
`213`	`244`	`##部署`
	`245`	`+`
`214`	`246`	`NPU不支持使用vllm进行推理/部署加速, 但是可以使用原生pytorch进行部署.`
`215`	`247`
`216`	`248`	`原始模型:`
	`249`	`+`
`217`	`250`	```shell
`218`	`251`	`ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048`
`219`	`252`	```
`220`	`253`
`221`	`254`	`LoRA微调后:`
	`255`	`+`
`222`	`256`	```shell
`223`	`257`	`ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048`
`224`	`258`
`225`	`259`	`# merge-lora并推理`
`226`	`260`	`ASCEND_RT_VISIBLE_DEVICES=0 swiftexport --adapters xx/checkpoint-xxx --merge_loratrue`
`227`	`261`	`ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048`
`228`	`262`	```
	`263`	`+`
	`264`	`+##支持现状`
	`265`	`+`
	`266`	`+###表 1：SFT 类算法`
	`267`	`+`
	`268`	`+\| algorithm\| model families\| strategy\| hardware\|`
	`269`	`+\| ---------\| ---------------------------\| ---------------------\| -----------------\|`
	`270`	`+\| SFT\| Qwen2.5-0.5B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`271`	`+\| SFT\| Qwen2.5-1.5B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`272`	`+\| SFT\| Qwen2.5-7B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`273`	`+\| SFT\| Qwen2.5-VL-3B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`274`	`+\| SFT\| Qwen2.5-VL-7B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`275`	`+\| SFT\| Qwen2.5-Omni-3B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`276`	`+\| SFT\| Qwen3-8B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`277`	`+\| SFT\| Qwen3-32B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`278`	`+\| SFT\| Qwen3-VL-30B-A3B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`279`	`+\| SFT\| Qwen3-Omni-30B-A3B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`280`	`+\| SFT\| InternVL3-8B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`281`	`+\| SFT\| Ovis2.5-2B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`282`	`+`
	`283`	`+------`
	`284`	`+`
	`285`	`+###表 2：RL 类算法`
	`286`	`+`
	`287`	`+\| algorithm\| model families\| strategy\| rollout engine\| hardware\|`
	`288`	`+\| ---------\| -------------------\| ---------\| --------------\| -----------------\|`
	`289`	`+\|GRPO\| Qwen2.5-7B-Instruct\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`290`	`+\|GRPO\| Qwen3-8B\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`291`	`+\|DPO\| Qwen2.5-7B-Instruct\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`292`	`+\|DPO\| Qwen3-8B\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`293`	`+\|PPO\| Qwen2.5-7B-Instruct\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`294`	`+\|PPO\| Qwen3-8B\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`295`	`+`
	`296`	`+---`
	`297`	`+`
	`298`	`+###表 3：当前 NPU 暂不支持 / 未完全验证的模块`
	`299`	`+`
	`300`	`+\| item\|`
	`301`	`+\| ----------------------\|`
	`302`	`+\| Liger-kernel\|`
	`303`	`+\| 量化/QLoRA相关\|`
	`304`	`+\| Megatron相关\|`
	`305`	`+\| 使用sglang作为推理引擎\|`

`‎docs/source_en/BestPractices/NPU-support.md‎`

Lines changed: 63 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,5 +1,22 @@`
`1`	`1`	`#NPU Support`
`2`	`2`
	`3`	`+We add Ascend NPU support in ms-swift, so you can fine-tune and run inference on Ascend NPUs.`
	`4`	`+`
	`5`	`+This document describes how to prepare the environment, fine-tune, run inference and deploy on NPUs.`
	`6`	`+`
	`7`	`+##Installation`
	`8`	`+`
	`9`	`+Base environment requirements:`
	`10`	`+`
	`11`	`+\| Software\| Version\|`
	`12`	`+\| ---------\| ---------------\|`
	`13`	`+\| Python\| >= 3.10, < 3.12\|`
	`14`	`+\| CANN\| == 8.3.RC1\|`
	`15`	`+\| torch\| == 2.7.1\|`
	`16`	`+\| torch_npu\| == 2.7.1\|`
	`17`	`+`
	`18`	`+For detailed environment setup, please refer to the[Ascend PyTorch installation guide](https://gitcode.com/Ascend/pytorch).`
	`19`	`+`
`3`	`20`	`##Environment Preparation`
`4`	`21`
`5`	`22`	`Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by[@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)`
`@@ -17,6 +34,9 @@ pip install ms-swift -U`
`17`	`34`	`pip install torch-npu decorator`
`18`	`35`	`# If you want to use deepspeed (to control memory usage, training speed might decrease)`
`19`	`36`	`pip install deepspeed`
	`37`	`+`
	`38`	`+# If you need the evaluation functionality, please install the following package`
	`39`	`+pip install evalscope[opencompass]`
`20`	`40`	```
`21`	`41`
`22`	`42`	`Check if the test environment is installed correctly and whether the NPU can be loaded properly.`
`@@ -221,3 +241,46 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new`
`221`	`241`	`ASCEND_RT_VISIBLE_DEVICES=0 swiftexport --adapters xx/checkpoint-xxx --merge_loratrue`
`222`	`242`	`ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048`
`223`	`243`	```
	`244`	`+`
	`245`	`+##Current Support Status`
	`246`	`+`
	`247`	`+###Table 1: SFT Algorithms`
	`248`	`+`
	`249`	`+\| Algorithm\| Model Families\| Strategy\| Hardware\|`
	`250`	`+\| ---------\| ---------------------------\| ---------------------\| -----------------\|`
	`251`	`+\| SFT\| Qwen2.5-0.5B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`252`	`+\| SFT\| Qwen2.5-1.5B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`253`	`+\| SFT\| Qwen2.5-7B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`254`	`+\| SFT\| Qwen2.5-VL-3B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`255`	`+\| SFT\| Qwen2.5-VL-7B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`256`	`+\| SFT\| Qwen2.5-Omni-3B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`257`	`+\| SFT\| Qwen3-8B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`258`	`+\| SFT\| Qwen3-32B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`259`	`+\| SFT\| Qwen3-VL-30B-A3B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`260`	`+\| SFT\| Qwen3-Omni-30B-A3B-Instruct\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`261`	`+\| SFT\| InternVL3-8B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`262`	`+\| SFT\| Ovis2.5-2B\| FSDP1/FSDP2/deepspeed\| Atlas 900 A2 PODc\|`
	`263`	`+`
	`264`	`+---`
	`265`	`+`
	`266`	`+###Table 2: RL Algorithms`
	`267`	`+`
	`268`	`+\| Algorithm\| Model Families\| Strategy\| Rollout Engine\| Hardware\|`
	`269`	`+\| ---------\| -------------------\| ---------\| --------------\| -----------------\|`
	`270`	`+\|GRPO\| Qwen2.5-7B-Instruct\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`271`	`+\|GRPO\| Qwen3-8B\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`272`	`+\|DPO\| Qwen2.5-7B-Instruct\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`273`	`+\|DPO\| Qwen3-8B\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`274`	`+\|PPO\| Qwen2.5-7B-Instruct\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`275`	`+\|PPO\| Qwen3-8B\| deepspeed\| vllm-ascend\| Atlas 900 A2 PODc\|`
	`276`	`+`
	`277`	`+---`
	`278`	`+`
	`279`	`+###Table 3: Modules Not Yet Supported / Fully Verified on NPUs`
	`280`	`+`
	`281`	`+\| Item\|`
	`282`	`+\| ------------------------\|`
	`283`	`+\| Liger-kernel\|`
	`284`	`+\| Quantization/QLoRA\|`
	`285`	`+\| Megatron-related modules\|`
	`286`	`+\| Using sglang as inference engine\|`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit67719f4

File tree

2 files changed

2 files changed

`‎docs/source/BestPractices/NPU-support.md‎`

`‎docs/source_en/BestPractices/NPU-support.md‎`

0 commit comments