最近Deepseekv3.2发布关注到TileLang,简单总结下;

TileLang是什么:

官方简介:Tile语言(tile-lang)是一种简洁的领域特定语言(DSL),旨在简化高性能GPU/CPU内核(例如GEMM、反量化GEMM、FlashAttention、线性注意力)的开发。通过在TVM之上采用具有底层编译器基础设施的Python式语法,tile-lang使开发者能够专注于提高生产力,同时又不牺牲实现先进性能所需的底层优化。

简单讲:TileLang类似python语法,只需专注算法逻辑,不用具体的实现,就能开发出高性能的AI算子;

以矩阵乘算子举例,对比多年写的CUDA C的矩阵乘法算子,TiLeLang简单明了;

import tilelang
import tilelang.language as T

# @tilelang.jit(target="cuda")
# target currently can be "cuda" or "hip" or "cpu".
# if not specified, it will be inferred from the input tensors during compile time
@tilelang.jit
def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float"):

    @T.prim_func
    def matmul_relu_kernel(
            A: T.Tensor((M, K), dtype),
            B: T.Tensor((K, N), dtype),
            C: T.Tensor((M, N), dtype),
    ):
        # Initialize Kernel Context
        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
            A_shared = T.alloc_shared((block_M, block_K), dtype)
            B_shared = T.alloc_shared((block_K, block_N), dtype)
            C_local = T.alloc_fragment((block_M, block_N), accum_dtype)

            # Enable rasterization for better L2 cache locality (Optional)
            # T.use_swizzle(panel_size=10, enable=True)

            # Clear local accumulation
            T.clear(C_local)

            for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
                # Copy tile of A
                # This is a sugar syntax for parallelized copy
                T.copy(A[by * block_M, ko * block_K], A_shared)

                # Copy tile of B
                T.copy(B[ko * block_K, bx * block_N], B_shared)

                # Perform a tile-level GEMM on the shared buffers
                # Currently we dispatch to the cute/hip on Nvidia/AMD GPUs
                T.gemm(A_shared, B_shared, C_local)
            
            # relu
            for i, j in T.Parallel(block_M, block_N):
                C_local[i, j] = T.max(C_local[i, j], 0)

            # Copy result back to global memory
            T.copy(C_local, C[by * block_M, bx * block_N])

    return matmul_relu_kernel

代码说明:

matmul_relu_kernel函数

外循环,N 和 M 分别按block_N 和 block_M大小分块,块索引bx,by;

with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):

首先在share memory分配了2块内存A_share,B_share 分别存储A、B矩阵的block,

A_shared = T.alloc_shared((block_M, block_K), dtype)
B_shared = T.alloc_shared((block_K, block_N), dtype)

C_local在线程寄存器分配空间,存储A_share和B_share相乘的中间结果;

累加前先清零;

 C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
 T.clear(C_local) 

内循环,按K方向分块,分块大小block_K;

 for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):

分别将A、B的矩阵的block从全局显存copy到share memory中;

T.copy(A[by * block_M, ko * block_K], A_shared)
T.copy(B[ko * block_K, bx * block_N], B_shared)

矩阵乘法累加结果到C_local;

T.gemm(A_shared, B_shared, C_local)

将累加的结果并行执行标量运算Relu,并将最终结果copy到全局变量中;

# relu
for i, j in T.Parallel(block_M, block_N):
  C_local[i, j] = T.max(C_local[i, j], 0)

# Copy result back to global memory
T.copy(C_local, C[by * block_M, bx * block_N])

从上面的TileLang代码中可以看出,开发者只需要关注运算的逻辑,并不需要关注具体执行的细节线程绑定、内存布局等,提高的开发效率;

(完)

Logo

免费领 100 小时云算力,进群参与显卡、AI PC 幸运抽奖

更多推荐