TileLang简介
最近Deepseekv3.2发布关注到TileLang,简单总结下;
·
最近Deepseekv3.2发布关注到TileLang,简单总结下;
TileLang是什么:
官方简介:Tile语言(tile-lang)是一种简洁的领域特定语言(DSL),旨在简化高性能GPU/CPU内核(例如GEMM、反量化GEMM、FlashAttention、线性注意力)的开发。通过在TVM之上采用具有底层编译器基础设施的Python式语法,tile-lang使开发者能够专注于提高生产力,同时又不牺牲实现先进性能所需的底层优化。
简单讲:TileLang类似python语法,只需专注算法逻辑,不用具体的实现,就能开发出高性能的AI算子;
以矩阵乘算子举例,对比多年写的CUDA C的矩阵乘法算子,TiLeLang简单明了;
import tilelang
import tilelang.language as T
# @tilelang.jit(target="cuda")
# target currently can be "cuda" or "hip" or "cpu".
# if not specified, it will be inferred from the input tensors during compile time
@tilelang.jit
def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float"):
@T.prim_func
def matmul_relu_kernel(
A: T.Tensor((M, K), dtype),
B: T.Tensor((K, N), dtype),
C: T.Tensor((M, N), dtype),
):
# Initialize Kernel Context
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
A_shared = T.alloc_shared((block_M, block_K), dtype)
B_shared = T.alloc_shared((block_K, block_N), dtype)
C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
# Enable rasterization for better L2 cache locality (Optional)
# T.use_swizzle(panel_size=10, enable=True)
# Clear local accumulation
T.clear(C_local)
for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
# Copy tile of A
# This is a sugar syntax for parallelized copy
T.copy(A[by * block_M, ko * block_K], A_shared)
# Copy tile of B
T.copy(B[ko * block_K, bx * block_N], B_shared)
# Perform a tile-level GEMM on the shared buffers
# Currently we dispatch to the cute/hip on Nvidia/AMD GPUs
T.gemm(A_shared, B_shared, C_local)
# relu
for i, j in T.Parallel(block_M, block_N):
C_local[i, j] = T.max(C_local[i, j], 0)
# Copy result back to global memory
T.copy(C_local, C[by * block_M, bx * block_N])
return matmul_relu_kernel
代码说明:
matmul_relu_kernel函数
外循环,N 和 M 分别按block_N 和 block_M大小分块,块索引bx,by;
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
首先在share memory分配了2块内存A_share,B_share 分别存储A、B矩阵的block,
A_shared = T.alloc_shared((block_M, block_K), dtype)
B_shared = T.alloc_shared((block_K, block_N), dtype)
C_local在线程寄存器分配空间,存储A_share和B_share相乘的中间结果;
累加前先清零;
C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
T.clear(C_local)
内循环,按K方向分块,分块大小block_K;
for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
分别将A、B的矩阵的block从全局显存copy到share memory中;
T.copy(A[by * block_M, ko * block_K], A_shared)
T.copy(B[ko * block_K, bx * block_N], B_shared)
矩阵乘法累加结果到C_local;
T.gemm(A_shared, B_shared, C_local)
将累加的结果并行执行标量运算Relu,并将最终结果copy到全局变量中;
# relu
for i, j in T.Parallel(block_M, block_N):
C_local[i, j] = T.max(C_local[i, j], 0)
# Copy result back to global memory
T.copy(C_local, C[by * block_M, bx * block_N])
从上面的TileLang代码中可以看出,开发者只需要关注运算的逻辑,并不需要关注具体执行的细节线程绑定、内存布局等,提高的开发效率;
(完)
更多推荐



所有评论(0)