NPU TileLang 使用指南

【免费下载链接】cann-recipes-infer 本项目针对LLM与多模态模型推理业务中的典型模型、加速算法,提供基于CANN平台的优化样例 【免费下载链接】cann-recipes-infer 项目地址: https://gitcode.com/cann/cann-recipes-infer

1. 准备

a) 镜像准备

获取 docker 镜像

ARM镜像地址中下载 docker 镜像,然后上传到需要A3服务器每个节点上,并通过命令导入镜像 docker load -i ds_cann_nightly_aarch_pta2.8_image_v1_1.tar

拉起 docker 容器

容器拉起脚本如下,默认容器名为 cann_recipes_infer。

docker run -u root -itd --name cann_recipes_infer --ulimit nproc=65535:65535 --ipc=host \
    --device=/dev/davinci0     --device=/dev/davinci1 \
    --device=/dev/davinci2     --device=/dev/davinci3 \
    --device=/dev/davinci4     --device=/dev/davinci5 \
    --device=/dev/davinci6     --device=/dev/davinci7 \
    --device=/dev/davinci8     --device=/dev/davinci9 \
    --device=/dev/davinci10    --device=/dev/davinci11 \
    --device=/dev/davinci12    --device=/dev/davinci13 \
    --device=/dev/davinci14    --device=/dev/davinci15 \
    --device=/dev/davinci_manager --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    -v /home/:/home \
    -v /data:/data \
    -v /etc/localtime:/etc/localtime \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /etc/ascend_install.info:/etc/ascend_install.info -v /var/log/npu/:/usr/slog \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /sys/fs/cgroup:/sys/fs/cgroup:ro \
    -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/sbin:/usr/local/sbin \
    -v /etc/hccn.conf:/etc/hccn.conf -v /root/.pip:/root/.pip -v /etc/hosts:/etc/hosts \
    -v /usr/bin/hostname:/usr/bin/hostname \
    --net=host \
    --shm-size=128g \
    --privileged \
    ds_cann_nightly_aarch_pta2.8_image_v1_1:v1.1 /bin/bash

进入容器:

docker attach cann_recipes_infer

b) 设置环境变量

source /usr/local/Ascend/ascend-toolkit/set_env.sh

c) 快速启动TileLang

我们为用户提供了完整的docker镜像,TileLang代码仓及其所需要的依赖已经全部预装,支持全量基础AscendC算子Api。您可以直接运行代码,也可根据第2节自行安装TileLang。

以exp为例,直接运行代码仅需执行如下指令:

cd /workspace/tilelang-ascend
source set_env.sh
python /workspace/tilelang-ascend/examples/elementwise/elementwise_add.py 

成功后会打印:

init successful!
Kernel Output Match!

2. TileLang 安装

安装TileLang有几种方法,更多详情请参考TileLang社区的安装指南:

https://github.com/tile-ai/tilelang-ascend?tab=readme-ov-file#installation

这里使用从源码编译的方式安装。

a) 下载代码

git clone --recursive https://github.com/tile-ai/tilelang-ascend
cd tilelang-ascend

b) 编译和安装

bash install_ascend.sh

c) 环境变量设置

source set_env.sh

3. 运行


在本节中,您将学习如何调用NPU TileLang算子。

这里以sparse_flash_attention算子为例来介绍。

cd {your_cann_recipes_path}/ops/tilelang/{ds_version}/examples # 比如{your_cann_recipes_path} 改成 /home/code/cann-recipes-infer
python3 test_sfa.py

成功后会打印:

Test passed!

以下的代码片段都包含在sparse_flash_attention.py中,这里详细来介绍下如何运行和测试该算子。

a) 算子调用

sparse_fa_func是算子的入口,调用他来进行sparse_flash_attention的计算:

sparse_fa_func = sparse_attention_fwd(
    heads=128,
    dim=512,
    tail_dim=64,
    topk=2048,
    kv_stride=1,
)

b) 生成Golden值

下面的代码来生成算子的golden值。

def ref_sparse_attention_fwd_interface(q_param, kv, indices, q_start_index_s, kv_stride=4,
                                       sm_scale_param=None, is_casual=True):
    q = q_param.float()
    kv = kv.float()
    indices = indices.transpose(1, 2)
    b, sq, h, dim_q = q.shape
    _, sk, g, _ = kv.shape
    if q_start_index_s is None:
        q_start_index_s = sk * kv_stride - sq

assert kv.shape[-1] == 576, 'you should assign dim otherwise'
dim = 512
k = kv
v = kv[..., :dim]

b, _, _, dim_v = v.shape
num_kv_per_index = 1
g_index = g
h_index = h // g
compressed_casual_mask = torch.arange(q_start_index_s, sq + q_start_index_s, dtype=torch.int32).view(-1, 1)
                       >= torch.arange(kv_stride - 1, sk * kv_stride, kv_stride, dtype=torch.int32).view(1, -1)

mask = q.new_zeros(b, g_index, sq, sk + 1, dtype=torch.bool).scatter(3, indices.long(), 1)
mask = mask[..., :-1]
mask = mask & compressed_casual_mask.view(1, 1, sq, sk)
mask[:, :, :kv_stride - 1, 0] = True
mask = mask.view(b, g_index, 1, sq, sk)

q = q.view(b, sq, g, -1, dim_q)
score = torch.einsum("bmghd,bngd->bghmn", q, k)
sm_scale = dim_q ** -0.5 if sm_scale is None else sm_scale
score = score.masked_fill(~mask, float("-inf")).mul(sm_scale)
p = score.softmax(dim=-1)
p = p.view(b, g_index, h_index, -1, sq, sk)
p = p.view(b, g, -1, sq, sk)
o = torch.einsum("bghmn,bngd->bmghd", p.type(v.dtype), v)
o = o.reshape(b, sq, h, dim_v)
return o.to(torch.float16)

c) 算子的整个调用和测试对比

os.environ["ACL_OP_INIT_MODE"] = "1"
B, S, SKV, H, HKV, DQK, DV, topk = 1, 128, 32768, 128, 1, 576, 512, 2048
dtype = torch.float16

KV_stride = 1
q_start_s_index = 4096 * 7

q = torch.randn((B, S, H, DQK), dtype=dtype)
kv = torch.randn((B, SKV, HKV, DQK), dtype=dtype)
indices = torch.full((B, S, HKV, topk), SKV, dtype=torch.int32)
for b in range(B):
    for t in range(S):
        for h in range(HKV):
            i_i = torch.randperm(max(1, ((t + q_start_s_index) // KV_stride)))[:topk]
            indices[b, t, h, :len(i_i)] = i_i

output = torch.empty((B, S, H, DV), dtype=dtype)

workspace_1 = torch.zeros((256, 64, 512), dtype=dtype)
workspace_2 = torch.zeros((256, 64, 64), dtype=dtype)
workspace_3 = torch.zeros((256, 64, 64), dtype=torch.float)
workspace_4 = torch.zeros((256, 64, 64), dtype=dtype)
workspace_5 = torch.zeros((256, 64, 512), dtype=torch.float)

torch.npu.synchronize()
print("init successful!")

output = func(q, kv, indices, workspace_1, workspace_2, workspace_3, workspace_4, workspace_5)

torch.npu.synchronize()

ref_output = ref_sparse_attention_fwd_interface(q, kv, indices, q_start_s_index, KV_stride)
torch.npu.synchronize()
torch.testing.assert_close(ref_output, output, rtol=1e-2, atol=1e-2)

【免费下载链接】cann-recipes-infer 本项目针对LLM与多模态模型推理业务中的典型模型、加速算法,提供基于CANN平台的优化样例 【免费下载链接】cann-recipes-infer 项目地址: https://gitcode.com/cann/cann-recipes-infer

Logo

免费领 100 小时云算力,进群参与显卡、AI PC 幸运抽奖

更多推荐