CANN/cannbot-skills AscendC内核目录
For initial filtering, prefer `agent/references/examples/kernel-index.md` (lean one-line table).Come here only to read the `study_for` and `do_not_copy_when` detail of the ≤3 candidates you already p
Kernel Catalog
For initial filtering, prefer agent/references/examples/kernel-index.md (lean one-line table). Come here only to read the study_for and do_not_copy_when detail of the ≤3 candidates you already picked. Do not copy a kernel body directly just because the formula looks similar.
How to read this file
This file is large (~700 lines) and entries are independent. Do NOT read it linearly.
Workflow:
- Get one or more candidate paths (e.g.
agent/example/kernels/a2/flash_attn_full.py) fromkernel-index.mdfirst. - Use Grep to find the matching
###heading line, e.g. pattern^### .kernels/a2/flash_attn_full\.py.against this file. - Read with
offset=<line>andlimit=25to load only that one entry. - If the entry includes
deep_note, open it only when the short entry still is not enough. - Repeat for each candidate. Stop after ≤3 entries.
If you find yourself reading a ## section header you did not intend to land in, you are scrolling — go back to step 2.
For each entry the schema is:
formula: the main reference contracttopology: the staged pipeline shapestudy_for: what this file is actually good fordeep_note: optional extra rationale for the few kernels that need more than one short entrydo_not_copy_when: when the resemblance is misleading
Sections (for orientation only — jump by Grep, not by scrolling)
- Vec-only baselines
- Cube-only baselines
- Cube -> vec postprocess (a5)
- Vec -> cube preprocess (a5)
- Vec -> cube -> vec fusion (a5)
- Vec -> cube -> vec -> cube -> vec state bridge (a5)
- Cube -> vec -> cube -> vec lookahead pipeline (a5)
- Pure vec and micro references
- a2 kernels (cube-only, vec-only, single / double / triple GM bridge, causal and hif8 variants)
Index schema (for machine readers, not for kernel authors)
This file also feeds agent/index/kernels.json. The builder reads each ### entry heading as one kernel record, the surrounding ## section as the category, and top-level / nested bullets as ordered fields. If you edit this catalog, keep formula / topology / study_for / do_not_copy_when stable.
Vec-only baselines
agent/example/kernels/a2/to_hif8_torch.py
- formula:
y = to_hif8_torch(x)with float32 output that emulates hif8 rounding and uses finite saturation sentinels for overflow - topology:
vec-only - study_for:
- a2 pure-vec elementwise quantization without cube stages
- exponent-bit extraction through
reinterpret+vand/vnot - explicit
RoundMode.TRUNCbased implementation ofsign(x) * floor(abs(x) + 0.5) - preserving
NaN/Infinputs while replacing finite overflow with large finite saturation values
- do_not_copy_when:
- you need true
DT.hif8runtime output loading rather than float32 emulation - your kernel is fundamentally cube-bound or mixed cube/vec rather than vec-only
- you need true
agent/example/kernels/a5/chunk_row_cumsum.py
- formula:
- split
x:[M,H]into contiguous row chunks of sizechunk_size - for each chunk,
y[0,:] = x[0,:] - for each later row in the same chunk,
y[i,:] = x[i,:] + y[i-1,:]
- split
- topology:
vec-only - study_for:
- a5 vec-only row-recursive accumulation authored entirely inside
@vf() - keeping the carry on the previous output row instead of trying to reinterpret
cpaddas a scan primitive - flattening each row into 64-lane register slices so the same
@vf()works for both wide rows and the paddedH < 64path - using
gm_to_ub_pad/ub_to_gm_padto preserve a logicalx:[chunk_size,H]contract while storing narrow rows in[chunk_size,64]UB buffers - handling a tail on the final chunk in the row dimension while keeping the first version column-aligned when
H >= 64
- a5 vec-only row-recursive accumulation authored entirely inside
- do_not_copy_when:
- you need a true global cumsum across chunk boundaries instead of restarting at each
chunk_sizeblock - your
His non-64-aligned and also>= 64, which still needs a wider-column tail path - your recurrence needs cross-row state more complex than plain add
- you need a true global cumsum across chunk boundaries instead of restarting at each
Cube-only baselines
agent/example/kernels/a5/matmul_float_mmad.py
- formula:
z = x @ y.t() - topology:
cube-only - study_for:
- shortest end-to-end cube matmul baseline
- minimal simulator validation story
- first sanity check for pure cube lowering
- do_not_copy_when:
- you need tiled DBuff structure
- you need mixed cube/vec ownership
- you need large-shape split selection
agent/example/kernels/a5/matmul_e5m2_shortcut.py
- formula:
z = x.float() @ y.float().t()with float8 inputs - topology:
cube-only - study_for:
- float8 input staging into float accumulation
- minimal float8 guard pattern in the runnable section
- do_not_copy_when:
- your problem is mainly about tiling, not dtype
- you need vec-side postprocess or quantized output
agent/example/kernels/a5/matmul_kmkn_fp32_out.py
- formula:
z = x.float().t() @ y.float()withx:[K,M],y:[K,N] - topology:
cube-only - study_for:
- transpose-at-matmul-call-site pattern
KM @ KN -> MNlayout reasoning- explicit
%16guards andshape_bindings
- do_not_copy_when:
- your data is naturally
MKNK - your main issue is mixed pipeline staging
- your data is naturally
agent/example/kernels/a5/matmul_mknk_2dgrid_splitn.py
- formula:
z = x.float() @ y.float().t()withx:[M,K],y:[N,K] - topology:
cube-only - study_for:
- 2D core split for aligned
MKNK - outer tile/core split selection with
splitn - sharing one
l1_cntfor one operand-pair lifetime
- 2D core split for aligned
- do_not_copy_when:
- K-side staging is the real capacity bottleneck
- you need vec postprocess after cube output
agent/example/kernels/a5/matmul_mknk_2dgrid_splitk.py
- formula:
z = x.float() @ y.float().t()withsplitk - topology:
cube-only - study_for:
- large-
KalignedMKNKstrategy - keeping outer
TILE_K=256while legalizing inner staging withSPLIT_K=64 - 2D split selection plus split-
kaccumulation ownership
- large-
- do_not_copy_when:
- N-side width is the real issue and
splitnis cleaner - your kernel is mainly about postprocess logic rather than cube staging
- N-side width is the real issue and
Cube -> vec postprocess
agent/example/kernels/a5/basic_cube_vec_mix.py
- formula:
z = abs(x @ y.t()) + 1.0 - topology:
cube -> vec - study_for:
- smallest mixed pipeline baseline
- basic
CvMutexownership story - standard half-row vec writeback
- do_not_copy_when:
- your kernel needs advanced tile scheduling
- your vec stage has rowwise reductions or multiple outputs
agent/example/kernels/a5/matmul_half_splitn_bias10p2_vf.py
- formula:
z = ((x.float() @ y.float()) + 10.2).half() - topology:
cube -> vec - study_for:
- float accumulation followed by vec-side downcast
- stable FIX->V handoff pattern
- half-output writeback after vec postprocess
- do_not_copy_when:
- your output should stay float
- you need large-shape 2D split logic
agent/example/kernels/a5/matmul_rowwise_norm.py
- formula:
z = (x @ y.t()) / row_sum(x @ y.t()) - topology:
cube -> vec - study_for:
- rowwise vec reduction after cube output
cadd()+dup()normalization pattern
- do_not_copy_when:
- you need a two-pass large-
Nstrategy - you need quantized or fp8 output
- you need a two-pass large-
agent/example/kernels/a5/matmul_rowwise_norm_large_nk.py
- formula: same as
matmul_rowwise_norm.py - topology:
cube -> vec - study_for:
- two-pass normalization for larger
N/K - temporary output persistence plus later reload
- row-sum lifetime that spans multiple
Ntiles
- two-pass normalization for larger
- do_not_copy_when:
- your normalized stage fits comfortably in one pass
- your main problem is cube-side capacity rather than vec-side persistence
agent/example/kernels/a5/matmul_rowwise_l2_norm.py
- formula:
z = x.float() @ w.float().t()out = z / sqrt(sum(z^2, dim=1, keepdim=True))
- topology:
cube -> vec - study_for:
- two-pass rowwise L2 normalization after matmul
- per-row squared-sum accumulation across
Ntiles - explicit
SHAPE_BINDINGSand aligned-shape guard in the Python wrapper
- do_not_copy_when:
- your normalization is sum-based rather than L2-based
- your shape is not naturally aligned to the validated contract (
M%64,N%256,K%128)
agent/example/kernels/a5/matmul_chunk_absmax_norm128.py
- formula: normalize each 128-column chunk by per-row absmax
- topology:
cube -> vec - study_for:
- blockwise row statistics per fixed
CHUNK_N abs -> cmax -> dup -> divideidiom
- blockwise row statistics per fixed
- do_not_copy_when:
- your block size is not naturally tied to cube
Ntiles - you need scale output rather than normalized values only
- your block size is not naturally tied to cube
agent/example/kernels/a5/matmul_kmkn_blockwise_quant128.py
- formula:
z_tmp = x.float().t() @ y.float()scale = absmax(z_tmp, block=128) / 224z = (z_tmp / scale).to(e5m2)
- topology:
cube -> vec - study_for:
- blockwise scale generation
- fp8 quantized output after float accumulation
- optional
pack4()fallback path
- do_not_copy_when:
- you do not need quantized output
- your layout is
MKNKrather thanKMKN
agent/example/kernels/a5/matmul_mknk_2dgrid_splitk_add1.py
- formula:
z = x.float() @ y.float().t() + 1.0 - topology:
cube -> vec - study_for:
- large-
Ksplit-kcube stage with vec postprocess - estimator-driven 2D split selection plus vec tie-break concerns
- separate counter for longer postprocess lifetime
- large-
- do_not_copy_when:
- you only need a pure cube baseline
- your vec stage is more complex than in-place elementwise add
agent/example/kernels/a5/cube_vec_atomic_add_two_outputs.py
- formula:
out_cube += x @ y.t()out_vec += abs(x @ y.t()) + 1
- topology:
cube -> vecwith dual outputs and atomics - study_for:
- atomics on both cube and vec writeback paths
- mixed dual-output ownership
- do_not_copy_when:
- you do not need atomic accumulation
- you only have one output path
Vec -> cube preprocess
agent/example/kernels/a5/vec_cube_abs_sqrt_matmul.py
- formula:
z = x.float().abs().sqrt().half().float() @ y.float().t() - topology:
vec -> cube - study_for:
- ND vec preprocess then cube consume
- subblock row publish into
L1 VcMutexownership edge
- do_not_copy_when:
- your preprocess should stay packed NZ end-to-end
- your host-side contract already gives cube-ready input
agent/example/kernels/a5/vec_cube_abs_sqrt_matmul_nz.py
- formula: same as
vec_cube_abs_sqrt_matmul.py - topology:
vec -> cube - study_for:
- NZ publish path after vec preprocess
- deinterleave +
reg_to_ubpacking beforel1 <<= ub.nz()
- do_not_copy_when:
- ND publish is enough and simpler
- you do not actually need packed-NZ staging
agent/example/kernels/a5/recompute_wu_cube_vec.py
- formula:
k_cumdecay = attn.float() @ (k_beta * decay_exp).float()kv = attn.float() @ v.float()
- topology:
vec -> cube - study_for:
- strict
[*,1]scalar-broadcast preprocess viasingle() - dual cube outputs after one vec preprocessing stage
- flattened batch-axis scheduling with
BHN
- strict
- do_not_copy_when:
- your dimensions are not close to this specialized recurrent/WU structure
- you need a generic attention template rather than this specific dual-output path
Vec -> cube -> vec fusion
agent/example/kernels/a5/vec_cube_vec_scale2_abs_add1_matmul.py
- formula:
z = abs((x * 2).half().float() @ y.float().t()) + 1.0 - topology:
vec -> cube -> vec - study_for:
- one fused preprocess + cube + postprocess chain
- explicit use of both
VcMutexandCvMutex - stage separation with independent counters
- do_not_copy_when:
- you have not yet validated the simpler vec->cube or cube->vec stage independently
- your fusion requires delayed reuse across iterations
Vec -> cube -> vec -> cube -> vec state bridge
agent/example/kernels/a5/delta_h_state_bridge_v1_c8.py
- formula:
- snapshot current recurrent state into
h_out vprime = w @ state.Tv_new = u - vprimev_scaled = v_new * exp(g_last - g_row)state = state * exp(g_last) + k.T @ v_scaled
- snapshot current recurrent state into
- topology:
vec -> cube -> vec -> cube -> vec - study_for:
- persistent UB state carried across chunk iterations
- a5
VcMutex/CvMutexownership transfer around a delayed second cube stage - aligned state-bridge scheduling with fixed
C=8,L=64,D=128
- do_not_copy_when:
- any of
C,L, orDmust be dynamic or tail-safe - your second cube stage does not reuse the same delayed-state bridge pattern
- you still need the experimental wrappers in
tmp/rather than the checked-in kernel body
- any of
agent/example/kernels/a5/delta_h_psudo_state_bridge_c8.py
- formula: pseudo-reference comparison kernel for the same
delta_hstate bridge contract - topology:
vec -> cube -> vec -> cube -> vec - study_for:
- keeping a pseudo-reference experiment on the same stable pipeline as the baseline kernel
- comparing cycle-equivalent kernels against a looser reference tolerance
- separating experiment wrappers in
tmp/from the checked-in kernel body
- do_not_copy_when:
- you need an exact pseudo residual/correction implementation rather than the stable v1-style schedule
- you want a general reusable state-bridge kernel instead of this fixed aligned experiment specialization
Cube -> vec -> cube -> vec lookahead pipeline
agent/example/kernels/a5/test_mla_entire.py
- formula: streamed MLA-style score, softmax, delayed
p @ k_nope, and final normalization - topology:
cube -> vec -> cube -> vec - study_for:
- one-tile lookahead scheduling with warmup and drain
- delayed-consumer counters (
stage1_cntvsstage2_cnt) - on-chip delayed reuse instead of forced GM round-trip
- streamed
row_max/row_sum/ numerator accumulation
- do_not_copy_when:
- your kernel does not truly need delayed stage reuse
- you have not yet stabilized the simpler two-stage or three-stage version of the formula
agent/example/kernels/a5/mha_ifa.py
- formula: streamed single-row attention
softmax(q @ k.t()) @ v - topology:
cube -> vec -> cube -> vec - study_for:
- row-specialized
L=1decode-style attention on a5 - flattened
BHscheduling with one query row kept resident while streamingS - simpler standard-attention lookahead flow than
agent/example/kernels/a5/test_mla_entire.py
- row-specialized
- do_not_copy_when:
- you need multi-row query tiles
- you need rope/nope fusion, fp8 staging, or MLA-specific math
- your delayed stage cannot stay on chip cleanly
agent/example/kernels/a5/mha_ifa_256.py
- formula: streamed single-row attention
softmax(q @ k.t()) @ vwithBASES=256 - topology:
cube -> vec -> cube -> vec - study_for:
- keeping a
256-wide on-chip score/value tile for half-input single-row attention on a5 - using
splitk=64for theq @ k.t()stage andsplitn=64for thep @ vstage without shrinking the outerBASES - simpler ND baseline for
BASES=256before trying NZ-published probability tiles
- keeping a
- do_not_copy_when:
- your tile does not actually need a
256-wide outerSchunk - you need multi-row query tiles
- you have not first validated the simpler
BASES=128path
- your tile does not actually need a
agent/example/kernels/a5/mha_ifa_fp8_scale_256.py
- formula: streamed single-row attention
softmax((q * scale_q) @ (k * scale_k).t() / sqrt(D)) @ (v * scale_v)with fp8q/k/v,BASES=256, and fp8-scaledptiles - topology:
cube -> vec -> cube -> vec - study_for:
- row-specialized decode-style attention with
e4m3q/k/vplus external float scales on a5 - tail-safe
valid_colsmasking beforerowmaxwhen the lastStile is narrower thanBASES - publishing vec-produced
ptiles toL1ase4m3afterP_SCALE, then compensating with finalscale_v / P_SCALE
- row-specialized decode-style attention with
- do_not_copy_when:
- your inputs are half and the simpler
agent/example/kernels/a5/mha_ifa_256.pyalready matches the contract - you want the delayed
ptile in NZ layout instead of the simpler ND bridge - your query side is not truly row-specialized (
L != 1)
- your inputs are half and the simpler
agent/example/kernels/a5/flash_attn_full_fp8_causal.py
- formula:
score_j = q.float() @ k_j.float().t() * scale- score tiles obey left-up causal masking
k_pos <= q_pos - tail columns behave like
-infbeforerowmax curr_m = maximum(prev_m, rowmax(score_j))expdiff_j = exp(prev_m - curr_m)p_j = exp(score_j - curr_m).to(e5m2)row_sum = row_sum * expdiff_j + p_j.float().sum(-1)pv_j = p_j.float() @ v_j.float()out = out * expdiff_j + pv_jout = out / row_sum- returns final
out,rowmax, androwsum
- topology:
cube -> vec -> cube -> vec(a5 on-chip lookahead with NDl1pbridge and fp8 probability tiles) - study_for:
- full-sequence multi-row attention on a5 with
TILE_M=TILE_N=128and fixedD=128, not theL=1decode-stylemha_ifa*family - tail-safe normalized online softmax when both
S1andS2may be non-aligned, with score-domain tail invalidation and diagonal-tile causal masking - keeping delayed
p @ vfully on chip by publishing vec-producede5m2ptiles intoL1, while keeping separate score /pvlocal families for stability
- full-sequence multi-row attention on a5 with
- deep_note:
agent/references/examples/deep/a5-flash-attn-full-fp8-causal.md - do_not_copy_when:
- your query side is still row-specialized (
L=1) and the simplermha_ifa*family already matches the contract - your delayed stage-2 consumer wants NZ-published probability tiles instead of the ND
l1ppath - your contract uses externally scaled or differently formatted fp8 inputs rather than plain
e5m2q/k/v - your head dimension is not the validated fixed
D=128
- your query side is still row-specialized (
agent/example/kernels/a5/mha_ifa_nz.py
- formula: streamed single-row attention
softmax(q @ k.t()) @ vwith NZ-published probability tiles - topology:
cube -> vec -> cube -> vec - study_for:
- publishing vec-produced
ptiles toL1in NZ layout for the delayed cube consumer - row-specialized
L=1decode-style attention when stage 2 wants packed-NZ input - explicit
reg_to_ub(...).nz()bridge inside a lookahead attention pipeline
- publishing vec-produced
- do_not_copy_when:
- delayed stage 2 is fine with the simpler ND
l1ppath fromagent/example/kernels/a5/mha_ifa.py - you need multi-row query tiles
- your consumer does not actually benefit from packed-NZ staging
- delayed stage 2 is fine with the simpler ND
agent/example/kernels/a5/mha_ifa_nz_256.py
- formula: streamed single-row attention
softmax(q @ k.t()) @ vwithBASES=256and NZ-published probability tiles - topology:
cube -> vec -> cube -> vec - study_for:
- widening the NZ-published
ptile to256on a5 while keeping the lookahead decode-style schedule - splitting a
256-wide half row into two128-lane micro registers beforeub_to_l1_nz - pairing
splitk=64/splitn=64with an NZl1phandoff instead of the simpler ND path
- widening the NZ-published
- do_not_copy_when:
- delayed stage 2 is fine with the simpler ND
l1ppath fromagent/example/kernels/a5/mha_ifa_256.py - you need tail-safe
Shandling without a fullBASES-wide GM slice - your consumer does not actually benefit from packed-NZ staging
- delayed stage 2 is fine with the simpler ND
Pure vec and micro references
agent/example/kernels/a5/recurrent_state_attn_vec.py
- formula: recurrent attention-state update specialized for
D=128 - topology:
vec-only - study_for:
- pure vec stateful update pattern
RegList-heavy row math- flattening
(B,H,S,D)into vec-friendly layouts
- do_not_copy_when:
- your kernel needs cube compute
- your dimension pattern is not this specialized state update
agent/example/kernels/a5/vec_unaligned_gm_to_ub_pad.py
- formula: vec compute on padded unaligned GM width (
exp + 2) - topology:
vec-only - study_for:
- unaligned-width
gm_to_ub_padbehavior - UB second-dim padding strategy
- quick padded-transfer sanity checks
- unaligned-width
- do_not_copy_when:
- your real problem is cross-side staging rather than vec padding
agent/example/kernels/a5/micro_cast_fp8_pack4_dual.py
- formula:
out_e5m2 = src.to(float8_e5m2)out_e4m3 = src.to(float8_e4m3fn)
- topology:
micro-only - study_for:
- micro cast path
RegLayout.ZEROplus requiredpack4()squeeze before UB writeback- dual-fp8-output micro flow
- do_not_copy_when:
- your kernel is mainly a cube or vec pipeline
- you only need a single conventional cast without micro-specific layout concerns
a2 kernels
agent/example/kernels/a2/qk_matmul_batched.py
- formula:
qk = q.float() @ k.float().t()with batched BH flattening - topology:
cube-only - study_for:
- simplest a2 kernel baseline
- batched M-tile distribution with BH flattening
- L0C capacity verification for a2 (128 KB)
- do_not_copy_when:
- you need vec postprocessing
- you target a5
agent/example/kernels/a2/sort_rows.py
- formula: per-row ascending sort of a
[ROWS, COLS]float32 matrix, emittingsorted_valueandsorted_idxequivalent totorch.sort(x, dim=-1)(contract:COLS=4096,ROWS=40, inter-bufferINTER_COLS = 2 * COLS) - topology:
vec-only - study_for:
- a2 pure-vec row-wise sort pipeline built from
sort32+mergesort4stages +mergesort_2seqfinal merge - sign-flip trick (
val * -1) to reuse an ascending sort primitive for descending-then-flip ordering - interleaved
(value, index)packed layout manipulated viareinterprettouint32/int gatherde-interleave with a precomputedgather_offset_ubto split merged(value, idx)pairs into two contiguous output UBs- per-core row slab split by
GetVecNum()/GetVecIdx()withCeilDiv - single
with auto_sync():scope covering the full per-row MTE2 -> V -> MTE3 cycle
- a2 pure-vec row-wise sort pipeline built from
- do_not_copy_when:
- your input width is not a power-of-two multiple of 32 (merge-stage radices 32/128/512/2048 are hard-coded)
- you need stable sort semantics beyond
torch.sortreference matching - the problem is cube-bound or mixes with matmul stages
agent/example/kernels/a2/attn_backward_dense_stage1_tail_dbuf.py
- formula:
qk = q.float() @ k.float().t()dp = grad.float() @ v.float().t()
- topology:
cube-only - study_for:
- tail-safe stage-1 dense backward on a2 while keeping the stage split at
qk/dp - using
DBuffstaging together with tail-timeset_constant_to_l1(...)on the concrete slot buffer - preserving correct NZ/ZZ behavior by letting
matmul(...)infer layout instead of forcing explicitm/n/k - using direct
<<=L0C -> GM writeback on tail tiles after the layout path is stabilized
- tail-safe stage-1 dense backward on a2 while keeping the stage split at
- do_not_copy_when:
- you already need vec-side
p/dqkreconstruction - you want the final
gq/gk/gvfused kernel rather than the stage-1 cube slice
- you already need vec-side
agent/example/kernels/a2/attn_backward_dense_stage12_tail.py
- formula:
qk = q.float() @ k.float().t()dp = grad.float() @ v.float().t()p = exp(qk * scale - qkmax) / qksumdqk = p * (dp - sum(o.float() * grad.float(), dim=-1)) * scale
- topology:
cube -> vec - study_for:
- fusing the dense backward
qk/dpcube stage directly into thep/dqkvec stage on a2 without yet adding the final gradient cube writeback - using one
CvMutex-guarded workspace bridge for bothqkanddpbecause they share the same stage-1 lifetime - keeping the a2 workspace bridge tail-safe by writing and reading full-width workspace tiles, then handling
valid_nwith vec masking and final GM boundaries - computing
odoonce per half-row vec tile before the delayedK/Vloop consumes the previous workspace slot - shrinking the vec hot path to
QUAT_M = 32row chunks soqk/dp/p/dqkcan move ontoDBufflineage without increasing UB usage - the follow-on rule for later vec-only extensions such as probability quantization: re-chunk the whole vec hot path so each chunk still owns one complete
MTE2 -> V -> MTE3story instead of borrowing a live stage buffer as scratch - using
bar_all()around vec-side tail zero-fill that must complete before latergm_to_ub_padloads
- fusing the dense backward
- do_not_copy_when:
- you only need the stage-1 cube slice (
agent/example/kernels/a2/attn_backward_dense_stage1_tail_dbuf.py) - you already need the final
gq/gk/gvfused kernel rather than thep/dqkintermediate - you want a minimal aligned-only teaching example instead of the tail-safe a2 workspace bridge pattern
- you only need the stage-1 cube slice (
agent/example/kernels/a2/attn_backward_dense_total_tail.py
- formula:
qk = q.float() @ k.float().t()dp = grad.float() @ v.float().t()p = exp(qk * scale - qkmax) / qksumdqk = p * (dp - sum(o.float() * grad.float(), dim=-1)) * scalegq = dqk_half.float() @ k.float()gk = dqk_half.float().transpose(-1, -2) @ q.float()gv = p_half.float().transpose(-1, -2) @ grad.float()
- topology:
cube -> vec -> cube - study_for:
- tail-safe end-to-end dense attention-backward fusion on a2 with both
S1andS2tails - keeping the cube -> vec and vec -> cube GM workspace bridges on full-tile shapes while handling
valid_m/valid_nonly at GM boundaries and vec masks - shrinking the stage-1 vec hot path into chunk-local loops so
qk/dp/p/dqkcan move ontoDBufflineage without inflating UB usage - keeping helper scratch separate from live stage buffers instead of borrowing stage slot families
- reusing delayed
k_jon chip for the finalgq += dqk_j @ k_jmatmul instead of reloadingk_jfrom GM - tile-level
atomic_add()writeback forgq/gk/gvwhen the fused schedule is split byQtiles first
- tail-safe end-to-end dense attention-backward fusion on a2 with both
- deep_note:
agent/references/examples/deep/a2-attn-backward-dense-total-tail.md - do_not_copy_when:
- you want the smallest aligned-only teaching example instead of the fully tail-safe fused version
- you do not want caller-side zero-initialization before the atomic accumulation phase
agent/example/kernels/a2/attn_backward_dense_total_tail_causal.py
- formula:
qk = q.float() @ k.float().t()dp = grad.float() @ v.float().t()p = causal_mask(exp(qk * scale - qkmax) / qksum)dqk = p * (dp - sum(o.float() * grad.float(), dim=-1)) * scalegq = dqk_half.float() @ k.float()gk = dqk_half.float().transpose(-1, -2) @ q.float()gv = p_half.float().transpose(-1, -2) @ grad.float()
- topology:
cube -> vec -> cube - study_for:
- tail-safe causal dense attention-backward fusion on a2 with both
S1andS2tails - skipping full-future
Ntiles early withactive_tiles_n = Min(tiles_n, CeilDiv(row_in_bh + valid_m, TILE_N)) - applying diagonal causal masking in
p-domain with one packed-uint8full-tileselect(...)over[HALF_M, TILE_N] - prebuilding one static
[HALF_M, TILE_N // 8]diagonal mask per subblock for full128x128tiles - rebuilding the packed diagonal mask only for tail
Mtiles becausehalf_rows = CeilDiv(valid_m, 2)changesrow_begin - generating packed mask bytes from a reusable integer column-index tensor instead of per-element mask writes
- passing the full packed-mask tensor into helpers that internally
reinterpret(...), then slicing only at the laterselect(...)site; sliced helper inputs can violate simulator-v2 storage assumptions - using non-quantized
16 x 128stage-1 vecDBuffchunks forqk/dp/p/dqk, so the causal kernel keeps the newer chunk-localMTE2 -> V -> MTE3lineage while staying significantly lighter than the hif8 variant - practical UB point for the non-quantized chunked version: about
121.375 KB / 192 KB
- tail-safe causal dense attention-backward fusion on a2 with both
- do_not_copy_when:
- the caller cannot supply
qkmax/qksumfrom the same causal forward contract - you want score-domain
-infmasking before rowmax/running-sum updates rather thanp-domain zeroing - your target kernel does not have a stable full-tile diagonal geometry that benefits from static packed-mask reuse
- the caller cannot supply
agent/example/kernels/a2/attn_backward_dense_total_tail_causal_hif8.py
- formula:
qk = q.float() @ k.float().t()dp = grad.float() @ v.float().t()p = causal_mask(exp(qk * scale - qkmax) / qksum)p_hif8 = hif8_quantize_positive_finite(p)dqk = p * (dp - sum(o.float() * grad.float(), dim=-1)) * scalegq = dqk_half.float() @ k.float()gk = dqk_half.float().transpose(-1, -2) @ q.float()gv = p_hif8.half().float().transpose(-1, -2) @ grad.float()
- topology:
cube -> vec -> cube - study_for:
- extending the causal dense backward tail kernel with inline hif8 probability quantization while preserving the original causal-mask and delayed stage-3 structure
- stage-1 vec-side causal
preconstruction plus hif8 quantization on chunk-local16 x 128MTE2 -> V -> MTE3loops, soqk/dp/p/dqkstay on stableDBufflineage andNOT balanced auto_sync eventsstays clear - implementing the positive-finite
p-only hif8 path inline: keeple15/le7/le3pluskeep_mask, but skip generic finite/overflow handling because causal probabilities are already finite and non-negative - budgeting the extra hif8 scratch explicitly with dedicated
quant_meta/quant_scale/quant_factor/quant_keepflag/quant_flag; this version runs at about157.875 KB / 192 KBUB
- do_not_copy_when:
- you need a plain causal dense backward kernel without probability quantization
- your probability tensor can contain negative, non-finite, or overflow cases that require the full generic hif8 conversion contract
agent/example/kernels/a2/flash_attn_score.py
- formula: per-block
exp(Q @ K^T / sqrt(D) - row_max)cast to half - topology:
cube -> vec(GM workspace bridge) - study_for:
- a2 cube → vec via GM workspace (no
l0c_to_ub) CvMutex(FIX → MTE2)cross-side synchronizationsplit_workspacewith pingpong double-buffer[CubeNum, 2, M, N]- sub-block split with
GetSubBlockIdx()for independent UB vmax → cmax → brcb → subrow-max pattern on a2 vec- continuous vs sliced vec operation distinction
- float → half output cast
- a2 cube → vec via GM workspace (no
- do_not_copy_when:
- target is a5 (use
l0c_to_ub+@vfinstead) - no vec postprocessing needed
- the reduction pattern differs from per-row max
- target is a5 (use
agent/example/kernels/a2/flash_attn_score_iter.py
- formula: per-block
exp(Q @ K^T / sqrt(D) - running_row_max)with cross-tile max accumulation, cast to half - topology:
cube -> vec(GM workspace bridge) - study_for:
- running state accumulation across inner-loop iterations on a2
dup(neg_large)initialization for the running-max identity-element pattern (avoids conditional logic while staying hardware-safe)vmaxon[M, 1]scalar format: why it covers all rows while[M, 8]does notdupplacement insideauto_syncouter loop (safe, generates extra V→MTE3 event)- incremental extension of an existing kernel (diff from
flash_attn_score.pyis 3 lines)
- do_not_copy_when:
- you need full softmax (this is the unnormalized intermediate — no sum/divide pass)
- you need per-tile independent max (use
flash_attn_score.pyinstead) - target is a5 (use register-level running state instead)
agent/example/kernels/a2/flash_attn_score_pv.py
- formula:
score_j = q.float() @ k_j.float().t() * scalem = maximum(m, rowmax(score_j))p_j = exp(score_j - m).half()pv_j = p_j.float() @ v_j.float()
- topology:
cube -> vec -> cube(double GM workspace bridge, one-tile lookahead) - study_for:
- a2 delayed-consumer pipeline with
n_loops + 1warmup/drain schedule - reuse of one
L0Cfamily across two cube stages with one sharedl0c_cnt - a2
vec -> cubebridge viaUB -> GM workspace -> L1whenub_to_l1_*is unavailable - two-
workspacedesign: float score bridge plus half probability bridge - preserving per-block running-max semantics while feeding the delayed
p @ vcube stage - flattened output layout
[ (bh * n_tiles + tile_n) * S1 + row, D ]
- a2 delayed-consumer pipeline with
- do_not_copy_when:
- you need normalized online softmax with running sum/divide
- your target is a5 and direct
UB -> L1publish is available - the second stage does not truly consume the vec result one iteration later
- your
Dis not fixed/aligned to the validated128
agent/example/kernels/a2/flash_attn_unnorm.py
- formula:
score_j = q.float() @ k_j.float().t() * scalecurr_m = maximum(prev_m, rowmax(score_j))expdiff_j = exp(prev_m - curr_m)p_j = exp(score_j - curr_m).half()pv_j = p_j.float() @ v_j.float()out = out * expdiff_j + pv_j
- topology:
cube -> vec -> cube -> vec(triple GM bridge, one-tile lookahead) - study_for:
- a2 streamed unnormalized attention numerator with delayed final vec accumulation
- reusing one physical
L0Cfamily across the two cube stages on a2 - triple ownership edge:
CvMutex -> VcMutex -> CvMutex - keeping running max, delayed
expdiff, and finalaccumresident in vec UB - using one extra GM workspace for delayed
pv_jbecause a2 cannot keep the stage-2 output on chip for vec reuse - safe copy pattern for
[M,1]scalar state on a2 (add(..., zero)instead ofub_to_ub)
- do_not_copy_when:
- you need normalized online softmax with running sum/final divide
- your target is a5 and direct on-chip handoff is available
- your second-stage output does not need to return to vec for delayed accumulation
- your
Dis not fixed/aligned to the validated128
agent/example/kernels/a2/flash_attn_full.py
- formula:
score_j = q.float() @ k_j.float().t() * scalecurr_m = maximum(prev_m, rowmax(score_j))expdiff_j = exp(prev_m - curr_m)p_j = exp(score_j - curr_m)row_sum = row_sum * expdiff_j + p_j.sum(-1)pv_j = p_j.half().float() @ v_j.float()out = out * expdiff_j + pv_jout = out / row_sum
- topology:
cube -> vec -> cube -> vec(triple GM bridge, one-tile lookahead, final vec divide) - study_for:
- a2 normalized online flash attention with running
row_maxand runningrow_sum - preserving the exact
p_j.half().float()value-path contract while keepingrow_sumin float - reducing
sum_jfrom the float probability tile before the cast - final sliced
divof[M,128]accumulators by a narrow[M,8]row-sum broadcast - reusing the
flash_attn_unnorm.pydelayed numerator pipeline and extending it with full normalization
- a2 normalized online flash attention with running
- do_not_copy_when:
- you only need the unnormalized numerator (use
flash_attn_unnorm.py) - your target is a5 and direct on-chip handoff is available
- your contract does not require the exact
p.half().float()value path - your
Dis not fixed/aligned to the validated128
- you only need the unnormalized numerator (use
agent/example/kernels/a2/flash_attn_full_pj_hif8.py
- formula:
score_j = q.float() @ k_j.float().t() * scalecurr_m = maximum(prev_m, rowmax(score_j))expdiff_j = exp(prev_m - curr_m)p_j = exp(score_j - curr_m)row_sum = row_sum * expdiff_j + p_j.sum(-1)p_q = to_hif8_torch(p_j * 128.0) / 128.0pv_j = p_q.half().float() @ v_j.float()out = out * expdiff_j + pv_jout = out / row_sum- returns final
out,rowmax, androwsum
- topology:
cube -> vec -> cube -> vec(same triple bridge, scaled hif8 simulation in the stage-1 vec path) - study_for:
- the contract-first baseline for this scaled hif8 probability path, with separate vec scratch for stage-1 score and stage-2
pv - preserving float
row_sumwhile swapping the value path fromp.half().float()toto_hif8_torch(p * 128) / 128 - exporting final
rowmax/rowsumthrough extra GM outputs without changing the delayedp @ vpipeline - extending the same kernel family to non-aligned
S2andS1without giving up the triple-bridge contract
- the contract-first baseline for this scaled hif8 probability path, with separate vec scratch for stage-1 score and stage-2
- deep_note:
agent/references/examples/deep/a2-flash-attn-full-pj-hif8.md - do_not_copy_when:
- your contract still wants the unscaled
p.half().float()path (useflash_attn_full.py) - you need a generic float-domain hif8 kernel instead of the non-negative probability specialization
- your
Dis not fixed/aligned to the validated128
- your contract still wants the unscaled
agent/example/kernels/a2/flash_attn_full_pj_hif8_causal.py
- formula:
- same math and outputs as
flash_attn_full_pj_hif8.py - score tiles additionally obey left-up causal masking
k_pos <= q_pos - returns final
out,rowmax, androwsum
- same math and outputs as
- topology:
cube -> vec -> cube -> vec(same triple bridge and hif8 probability path, plus shared vec-side slot buffer, diagonal-tile rowwise causal masking, and future-tile skip) - study_for:
- the causal extension of the scaled-hif8 online-softmax kernel after moving vec scratch onto the shared
DBufflineage used to improve theMTE2 -> Vubinqueueing story - treating causal as a score-domain fix before
cmax/rowmax, not as a laterp-domain repair - recognizing that only the diagonal
nt == lmttile needs mixed causal invalidation, while future fully-invalid tiles can be skipped withactive_tiles_n = Min(tiles_n, lmt + 1) - prebuilding reusable left/right packed-bit causal masks once per subblock, then reusing them on every diagonal-tile visit
- generating those packed mask bytes with
compare_scalar(...)over a reusable[0..63]integer column-index row instead of filling mask bytes one by one - reducing control overhead by populating the column-index tensor through an
int64reinterpret so each write covers twoint32entries - using a Python-unrolled row loop only for the row-dependent causal threshold, while the final score invalidation itself is done by packed
select(...) - combining diagonal causal masking with ordinary final-tile
valid_ntail masking by applying causal first and tail second - reusing one shared
ub_score_pv + score_pv_cntfamily for stage-1 score tiles and delayed stage-2pvtiles while still keepingstage1_cntandstage2_cntseparate - validating the same kernel family across
S1 == S2,S1 < S2,S1 > S2, and multi-head shapes
- the causal extension of the scaled-hif8 online-softmax kernel after moving vec scratch onto the shared
- do_not_copy_when:
- your contract is non-causal (use
flash_attn_full_pj_hif8.py) - you want the same shared vec scratch lineage without causal masking noise (use
flash_attn_full_pj_hif8_commonub.py) - your causal layout is not the left-up
k_pos <= q_poscontract validated here
- your contract is non-causal (use
agent/example/kernels/a2/flash_attn_full_pj_half_block32_causal.py
- formula:
score_j = q.float() @ k_j.float().t() * scalecurr_m = maximum(prev_m, rowmax(score_j))expdiff_j = exp(prev_m - curr_m)p_j = exp(score_j - curr_m)row_sum = row_sum * expdiff_j + p_j.sum(-1)pv_j = p_j.half().float() @ v_j.float()- score tiles additionally obey blockwise causal masking
floor(k_pos / 32) <= floor(q_pos / 32) out = out * expdiff_j + pv_jout = out / row_sum- returns final
out,rowmax, androwsum
- topology:
cube -> vec -> cube -> vec(same triple bridge, half probability value path, plus shared vec-side slot buffer, block-32 diagonal-tile causal masking, and future-tile skip) - study_for:
- the contract-first half-probability causal variant that keeps
row_sumin float while rounding only the delayedp @ vvalue path - treating blockwise causal as a score-domain fix before
cmax/rowmax, not as a laterp-domain repair - recognizing that future
128x128score tiles remain fully invalid under the32x32block-causal rule, soactive_tiles_n = Min(tiles_n, lmt + 1)still applies - prebuilding reusable left/right packed-bit masks for the diagonal tile once per subblock, with row-dependent
32/64valid-column thresholds inside each64-column half - reusing one shared
ub_score_pv + score_pv_cntfamily for stage-1 score tiles and delayed stage-2pvtiles so the vecubinedge follows the same slot-buffer lineage asflash_attn_full_pj_hif8_commonub.py - validating block-boundary behavior around
31/32/33,127/128/129, and non-squareS1/S2shapes without reintroducing hif8 quantization helpers
- the contract-first half-probability causal variant that keeps
- do_not_copy_when:
- your contract is non-causal (use
flash_attn_full.pyor another non-causal variant) - your probability path must simulate scaled hif8 values (use
flash_attn_full_pj_hif8.pyorflash_attn_full_pj_hif8_causal.py) - your causal layout is not the blockwise
floor(k_pos / 32) <= floor(q_pos / 32)contract validated here
- your contract is non-causal (use
agent/example/kernels/a2/flash_attn_full_pj_hif8_commonub.py
- formula:
- same math and outputs as
flash_attn_full_pj_hif8.py - returns final
out,rowmax, androwsum
- same math and outputs as
- topology:
cube -> vec -> cube -> vec(same triple bridge and delayedp @ vcontract, but with a shared vec-side slot buffer for stage-1 score tiles and stage-2pvtiles) - study_for:
- comparing against
flash_attn_full_pj_hif8.pyto see what changes when vec scratch moves from two plainTensorviews to one sharedDBuff - introducing a dedicated scratch-family counter for shared local storage while still keeping
stage1_cntandstage2_cntseparate - improving same-side vec preload / compute overlap without changing the cross-side mutex ownership model
- studying the queueing win from
ub_score_pv + score_pv_cnt, not a different math contract
- comparing against
- deep_note:
agent/references/examples/deep/a2-flash-attn-full-pj-hif8-commonub.md - do_not_copy_when:
- you are still deriving the math contract and want the simplest readable version first (start from
flash_attn_full_pj_hif8.py) - you are debugging row-max / row-sum correctness and do not want shared vec scratch lineage in the picture yet
- your goal is only UB-capacity reduction; this version keeps the same total UB footprint and mainly improves queueing structure
- you are still deriving the math contract and want the simplest readable version first (start from
- simplest cube -> vec baseline ->
agent/example/kernels/a5/basic_cube_vec_mix.py - float -> half vec postprocess ->
agent/example/kernels/a5/matmul_half_splitn_bias10p2_vf.py - rowwise normalize ->
agent/example/kernels/a5/matmul_rowwise_norm.py - rowwise L2 normalize ->
agent/example/kernels/a5/matmul_rowwise_l2_norm.py - blockwise quantization ->
agent/example/kernels/a5/matmul_kmkn_blockwise_quant128.py - vec preprocess before cube ->
agent/example/kernels/a5/vec_cube_abs_sqrt_matmul.py - recurrent WU dual-output preprocess ->
agent/example/kernels/a5/recompute_wu_cube_vec.py - fused vec -> cube -> vec ->
agent/example/kernels/a5/vec_cube_vec_scale2_abs_add1_matmul.py - delayed lookahead mixed pipeline ->
agent/example/kernels/a5/test_mla_entire.py - a5 multi-row causal full attention with fp8
p_jbridge ->agent/example/kernels/a5/flash_attn_full_fp8_causal.py
更多推荐




所有评论(0)