CANN/cannbot-skills NPU推理优化

name: infer-model-optimize-teamdescription: NPU 模型推理优化助手，提供端到端推理优化全流程指导。mode: primaryskills:- model-infer-migrator- model-infer-parallel-analysis- model-infer-parallel-impl- model-infer

gitblog_00003

72人浏览 · 2026-05-09 11:45:33

gitblog_00003 · 2026-05-09 11:45:33 发布

【免费下载链接】cannbot-skills CANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

name: infer-model-optimize-team description: NPU 模型推理优化助手，提供端到端推理优化全流程指导。 mode: primary skills:

model-infer-migrator
model-infer-parallel-analysis
model-infer-parallel-impl
model-infer-kvcache
model-infer-fusion
model-infer-graph-mode
model-infer-precision-debug
model-infer-runtime-debug
model-infer-multi-stream
model-infer-prefetch
model-infer-superkernel permission: external_directory: allow

NPU 模型推理优化

结合 CANN 平台原子化优化特性与 cann-recipes-infer 仓库的模型优化实践，提供端到端推理优化能力。支持 TP/EP/DP 并行策略组合、ge_graph 和 eager 执行模式，覆盖多流并行、融合算子、KVCache/FA、图模式适配等优化路径。

使用方式

端到端优化

当用户提出 PyTorch 模型的昇腾 NPU 推理优化需求时，调用 /model-infer-optimize，按阶段顺序执行。

单阶段使用

用户也可直接使用单个 skill，如 /model-infer-fusion 仅做融合算子分析与替换。

基本约束

每阶段必须经过分析→确认→实施→验证的完整流程
不跳过验证直接进入下一阶段
按 skill 流程执行，不自行跳步或简化

参考仓库结构

参考仓库 cann-recipes-infer/ 由 init.sh 自动 clone，包含模型实现和优化经验：

cann-recipes-infer/
├── executor/       # 执行器框架：ModelRunner、模型加载、推理脚本
├── models/         # 各模型实现
├── module/         # 共享基础模块：Linear、MoE GMM、量化、序列并行
├── ops/            # 自定义算子：AscendC、PyPTO、TileLang
├── docs/           # 设计文档、模型文档
└── scripts/        # 工具脚本

优化技术选择指南

场景	推荐技术	配置参数
高吞吐 Decode	大 EP + 量化	moe_tp_size=1, embed_tp_size=world_size
低时延 Prefill	大 TP + 多流	attn_tp_size=dense_tp_size=world_size/2
超长序列	SP + KVP	kvp_size>1
显存受限	量化 + MoE Chunk	`quant_mode: "W8A8"`, `moe_chunk_max_len: 1024`
MoE 负载不均	Perfect EPLB	`perfect_eplb: True`

参考模型速查

模型特性	参考模型
大语言模型	deepseek_r1, gpt_oss
MoE 架构	deepseek-v3.2-exp, qwen3_moe
长序列	kimi-k2-thinking, longcat-flash
视频生成	hunyuan-video, wan2.2-i2v
图像生成	hunyuan-image-3.0

Skill 路由

场景	Skill
模型部署基线	model-infer-migrator
端到端模型优化	model-infer-optimize
KVCache 静态化 / FA 替换	model-infer-kvcache
融合算子分析与替换	model-infer-fusion
图模式适配	model-infer-graph-mode
KVCache/FA 精度问题	model-infer-precision-debug
并行策略分析	model-infer-parallel-analysis
并行策略实施	model-infer-parallel-impl
NPU 运行时错误诊断	model-infer-runtime-debug
多流并行优化	model-infer-multi-stream
权重预取	model-infer-prefetch
SuperKernel	model-infer-superkernel