Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit67719f4

Browse files
authored
[npu] add supplementary content to the npu quick start doc (#6727)
1 parent0c41385 commit67719f4

File tree

2 files changed

+156
-16
lines changed

2 files changed

+156
-16
lines changed

‎docs/source/BestPractices/NPU-support.md‎

Lines changed: 93 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,48 @@
11
#NPU支持
22

3+
我们在 ms-swift 上增加了对昇腾 NPU 的支持,用户可以在昇腾 NPU 上进行模型的微调和推理。
4+
5+
本文档介绍了如何在昇腾 NPU 上进行环境准备、模型微调、推理和部署。
6+
7+
##安装
8+
9+
基础环境准备:
10+
11+
| software| version|
12+
| ---------| ---------------|
13+
| Python| >= 3.10, < 3.12|
14+
| CANN| == 8.3.RC1|
15+
| torch| == 2.7.1|
16+
| torch_npu| == 2.7.1|
17+
18+
19+
基础环境准备请参照这份[Ascend PyTorch 安装文档](https://gitcode.com/Ascend/pytorch)
20+
21+
322
##环境准备
423

5-
实验环境:8 * 昇腾910B3 64G (设备由[@chuanzhubin](https://github.com/chuanzhubin)提供, 感谢对modelscope和swift的支持~)
24+
实验环境:8 * 昇腾910B3 64G设备由[@chuanzhubin](https://github.com/chuanzhubin) 提供,感谢对 ModelScope 和 Swift 的支持~)
625

726
```shell
8-
#创建新的conda虚拟环境(可选)
27+
#创建新的 conda 虚拟环境(可选)
928
conda create -n swift-npu python=3.10 -y
1029
conda activate swift-npu
1130

12-
#设置pip全局镜像 (可选,加速下载)
31+
#设置 pip 全局镜像(可选,加速下载
1332
pip configset global.index-url https://mirrors.aliyun.com/pypi/simple/
1433
pip install ms-swift -U
1534

16-
#安装torch-npu
35+
#安装 torch-npu
1736
pip install torch-npu decorator
18-
#如果你想要使用deepspeed (控制显存占用,训练速度会有一定下降)
37+
#如果你想要使用 deepspeed(控制显存占用训练速度会有一定下降
1938
pip install deepspeed
39+
40+
# 如果需要使用 evaluation 功能,请安装以下包
41+
pip install evalscope[opencompass]
2042
```
2143

2244
测试环境是否安装正确,NPU能否被正常加载:
45+
2346
```python
2447
from transformers.utilsimport is_torch_npu_available
2548
import torch
@@ -30,6 +53,7 @@ print(torch.randn(10, device='npu:0'))
3053
```
3154

3255
查看NPU的P2P连接,这里看到每个NPU都通过7条HCCS与其他NPU互联
56+
3357
```shell
3458
(valle) root@valle:~/src# npu-smi info -t topo
3559
NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
@@ -54,6 +78,7 @@ Legend:
5478
```
5579

5680
查看NPU状态, npu-smi命令详解可以查看[官方文档](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668)
81+
5782
```shell
5883
(valle) root@valle:~/src# npu-smi info
5984
+------------------------------------------------------------------------------------------------+
@@ -89,19 +114,20 @@ Legend:
89114
```
90115

91116
##微调
117+
92118
以下介绍LoRA的微调, 全参数微调设置参数`--train_type full`即可.
93119

94-
| 模型大小| NPU数量| deepspeed类型| 最大显存占用量|
95-
|------|-------|-------------|-----------|
96-
| 7B| 1| None| 1 * 28 GB|
97-
| 7B| 4| None| 4 * 22 GB|
98-
| 7B| 4| zero2| 4 * 28 GB|
99-
| 7B| 4| zero3| 4 * 22 GB|
100-
| 7B| 8| None| 8 * 22 GB|
101-
| 14B| 1| None| 1 * 45 GB|
102-
| 14B| 8| None| 8 * 51 GB|
103-
| 14B| 8| zero2| 8 * 49 GB|
104-
| 14B| 8| zero3| 8 * 31 GB|
120+
| 模型大小| NPU数量| deepspeed类型| 最大显存占用量|
121+
|--------|-------|-------------|--------------|
122+
| 7B| 1| None| 1 * 28 GB|
123+
| 7B| 4| None| 4 * 22 GB|
124+
| 7B| 4| zero2| 4 * 28 GB|
125+
| 7B| 4| zero3| 4 * 22 GB|
126+
| 7B| 8| None| 8 * 22 GB|
127+
| 14B| 1| None| 1 * 45 GB|
128+
| 14B| 8| None| 8 * 51 GB|
129+
| 14B| 8| zero2| 8 * 49 GB|
130+
| 14B| 8| zero3| 8 * 31 GB|
105131

106132
###单卡训练
107133

@@ -128,6 +154,7 @@ swift sft \
128154

129155

130156
###数据并行训练
157+
131158
我们使用其中的4卡进行ddp训练
132159

133160
```shell
@@ -150,6 +177,7 @@ swift sft \
150177
###Deepspeed训练
151178

152179
ZeRO2:
180+
153181
```shell
154182
# 实验环境: 4 * 昇腾910B3
155183
# 显存需求: 4 * 28GB
@@ -168,6 +196,7 @@ swift sft \
168196
```
169197

170198
ZeRO3:
199+
171200
```shell
172201
# 实验环境: 4 * 昇腾910B3
173202
# 显存需求: 4 * 22 GB
@@ -189,13 +218,15 @@ swift sft \
189218
##推理
190219

191220
原始模型:
221+
192222
```shell
193223
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
194224
--model Qwen/Qwen2-7B-Instruct \
195225
--streamtrue --max_new_tokens 2048
196226
```
197227

198228
LoRA微调后:
229+
199230
```shell
200231
ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
201232
--adapters xxx/checkpoint-xxx --load_data_argstrue \
@@ -211,18 +242,64 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
211242

212243

213244
##部署
245+
214246
NPU不支持使用vllm进行推理/部署加速, 但是可以使用原生pytorch进行部署.
215247

216248
原始模型:
249+
217250
```shell
218251
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
219252
```
220253

221254
LoRA微调后:
255+
222256
```shell
223257
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048
224258

225259
# merge-lora并推理
226260
ASCEND_RT_VISIBLE_DEVICES=0 swiftexport --adapters xx/checkpoint-xxx --merge_loratrue
227261
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
228262
```
263+
264+
##支持现状
265+
266+
###表 1:SFT 类算法
267+
268+
| algorithm| model families| strategy| hardware|
269+
| ---------| ---------------------------| ---------------------| -----------------|
270+
| SFT| Qwen2.5-0.5B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
271+
| SFT| Qwen2.5-1.5B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
272+
| SFT| Qwen2.5-7B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
273+
| SFT| Qwen2.5-VL-3B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
274+
| SFT| Qwen2.5-VL-7B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
275+
| SFT| Qwen2.5-Omni-3B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
276+
| SFT| Qwen3-8B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
277+
| SFT| Qwen3-32B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
278+
| SFT| Qwen3-VL-30B-A3B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
279+
| SFT| Qwen3-Omni-30B-A3B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
280+
| SFT| InternVL3-8B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
281+
| SFT| Ovis2.5-2B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
282+
283+
------
284+
285+
###表 2:RL 类算法
286+
287+
| algorithm| model families| strategy| rollout engine| hardware|
288+
| ---------| -------------------| ---------| --------------| -----------------|
289+
|**GRPO**| Qwen2.5-7B-Instruct| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
290+
|**GRPO**| Qwen3-8B| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
291+
|**DPO**| Qwen2.5-7B-Instruct| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
292+
|**DPO**| Qwen3-8B| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
293+
|**PPO**| Qwen2.5-7B-Instruct| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
294+
|**PPO**| Qwen3-8B| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
295+
296+
---
297+
298+
###表 3:当前 NPU 暂不支持 / 未完全验证的模块
299+
300+
| item|
301+
| ----------------------|
302+
| Liger-kernel|
303+
| 量化/QLoRA相关|
304+
| Megatron相关|
305+
| 使用sglang作为推理引擎|

‎docs/source_en/BestPractices/NPU-support.md‎

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,22 @@
11
#NPU Support
22

3+
We add Ascend NPU support in ms-swift, so you can fine-tune and run inference on Ascend NPUs.
4+
5+
This document describes how to prepare the environment, fine-tune, run inference and deploy on NPUs.
6+
7+
##Installation
8+
9+
Base environment requirements:
10+
11+
| Software| Version|
12+
| ---------| ---------------|
13+
| Python| >= 3.10, < 3.12|
14+
| CANN| == 8.3.RC1|
15+
| torch| == 2.7.1|
16+
| torch_npu| == 2.7.1|
17+
18+
For detailed environment setup, please refer to the[Ascend PyTorch installation guide](https://gitcode.com/Ascend/pytorch).
19+
320
##Environment Preparation
421

522
Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by[@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)
@@ -17,6 +34,9 @@ pip install ms-swift -U
1734
pip install torch-npu decorator
1835
# If you want to use deepspeed (to control memory usage, training speed might decrease)
1936
pip install deepspeed
37+
38+
# If you need the evaluation functionality, please install the following package
39+
pip install evalscope[opencompass]
2040
```
2141

2242
Check if the test environment is installed correctly and whether the NPU can be loaded properly.
@@ -221,3 +241,46 @@ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new
221241
ASCEND_RT_VISIBLE_DEVICES=0 swiftexport --adapters xx/checkpoint-xxx --merge_loratrue
222242
ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
223243
```
244+
245+
##Current Support Status
246+
247+
###Table 1: SFT Algorithms
248+
249+
| Algorithm| Model Families| Strategy| Hardware|
250+
| ---------| ---------------------------| ---------------------| -----------------|
251+
| SFT| Qwen2.5-0.5B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
252+
| SFT| Qwen2.5-1.5B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
253+
| SFT| Qwen2.5-7B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
254+
| SFT| Qwen2.5-VL-3B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
255+
| SFT| Qwen2.5-VL-7B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
256+
| SFT| Qwen2.5-Omni-3B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
257+
| SFT| Qwen3-8B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
258+
| SFT| Qwen3-32B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
259+
| SFT| Qwen3-VL-30B-A3B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
260+
| SFT| Qwen3-Omni-30B-A3B-Instruct| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
261+
| SFT| InternVL3-8B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
262+
| SFT| Ovis2.5-2B| FSDP1/FSDP2/deepspeed| Atlas 900 A2 PODc|
263+
264+
---
265+
266+
###Table 2: RL Algorithms
267+
268+
| Algorithm| Model Families| Strategy| Rollout Engine| Hardware|
269+
| ---------| -------------------| ---------| --------------| -----------------|
270+
|**GRPO**| Qwen2.5-7B-Instruct| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
271+
|**GRPO**| Qwen3-8B| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
272+
|**DPO**| Qwen2.5-7B-Instruct| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
273+
|**DPO**| Qwen3-8B| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
274+
|**PPO**| Qwen2.5-7B-Instruct| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
275+
|**PPO**| Qwen3-8B| deepspeed| vllm-ascend| Atlas 900 A2 PODc|
276+
277+
---
278+
279+
###Table 3: Modules Not Yet Supported / Fully Verified on NPUs
280+
281+
| Item|
282+
| ------------------------|
283+
| Liger-kernel|
284+
| Quantization/QLoRA|
285+
| Megatron-related modules|
286+
| Using sglang as inference engine|

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp