CANN技能库a2模式文档-Seo优化-塔城地区网站建设公司

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体，本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing an a2 (easyasc.a2, deviceb3) kernel with:

one cube stage that produces a score tile
vec logic that updates running row state and emits a delayed cube input
a later cube stage that consumes that delayed tile
a final vec stage that accumulates the delayed cube output

Typical target formula:

score_j = q.float() @ k_j.float().t() * scale
curr_m = maximum(prev_m, rowmax(score_j))
expdiff_j = exp(prev_m - curr_m)
p_j = exp(score_j - curr_m).half()
pv_j = p_j.float() @ v_j.float()
out = out * expdiff_j + pv_j

This isnotnormalized online softmax. It keeps running max and a rescaled numerator only. There is no running sum or final divide. If you need runningrow_sumand a finalout / row_sum, switch toagent/references/patterns/a2-cube-vec-cube-vec-softmax.md.

Why this needs its own a2 pattern

This topology combines all a2 bridge constraints in one kernel:

cube -> vec cannot usel0c_to_ub
vec -> cube cannot useub_to_l1_*
the delayed cube output must return to vec for the final accumulation

So the stable data path is:

GM(q,k,v) -> L1 -> L0 -> L0C(score) -> GM(score_ws) -> UB(score)-> GM(p_ws) -> L1 -> L0 -> L0C(pv) -> GM(pv_ws) -> UB(pv) -> UB(accum) -> GM(out)

Use explicit workspaces instead of pretending this can stay on chip end-to-end.

Workspaces and ownership edges

Use three GM workspaces:

score_ws
- dtype:float
- shape:[GetCubeNum(), 2, TILE_M, TILE_N]
- purpose:L0C(score)->UB(score)
p_ws
- dtype:half
- shape:[GetCubeNum(), 2, TILE_M, TILE_N]
- purpose:UB(p_j)->L1(p_j)
pv_ws
- dtype:float
- shape:[GetCubeNum(), 2, TILE_M, D]
- purpose:L0C(pv_j)->UB(pv_j)

Ownership edges:

stage 1 cube -> vec:CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)
stage 1 vec -> stage 2 cube:VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)
stage 2 cube -> stage 3 vec:CvMutex(2, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

Stable schedule

Use one-tile lookahead:

for ni in range(0, tiles_n + 1): if ni < tiles_n: # stage 1: produce tile j = ni if ni > 0: # stage 2 + stage 3: consume tile j = ni - 1

This gives:

warmup: first iteration only produces
steady state: producejwhile consumingj - 1
drain: final iteration only consumes the last delayed tile

Shared`L0C`rule

Reuse one physicalL0Cfamily across the two cube stages.

Why this is the stable a2 choice here:

stage 1 writes a full float[TILE_M, TILE_N]score tile
stage 2 writes a full float[TILE_M, D]pv_jtile with the same validatedD == 128
a2 only has128 KBL0C, so a second full float family would be a misleading design target

Stable ownership story:

keep onel0c = DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)
let stage 1 publishscore_wsbefore stage 2 reuses that slot
let stage 2 publishpv_wsbefore the next stage-1 reuse
advance one sharedl0c_cnt

This is a capacity-driven exception, not a general license to merge unrelated counters. Only the physicalL0Cfamily is shared. Other stage-owned lifetimes stay separate.

Counter layout

Keep these lifetimes separate:

l1qk_cnt: stage-1q/kloads
l1pv_cnt: stage-2p/vloads
l0c_cnt: shared physicalL0Cfamily across the two cube stages
stage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiff
stage2_cnt: delayed slot rhythm forp_wsconsumption andpv_ws

Do not hide the delayed accumulator lifetime behindstage1_cnt.

Vec-resident persistent state

Keep these values in per-subblock UB across the whole inner loop:

running row max:[HALF_M, 1]
delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)
final numerator accumulation:[HALF_M, D]

UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.

Critical scalar-state rule on a2

Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.

Reason:

ub_to_ubinfers burst length in units ofC0blocks
for[64, 1]float views, that means copying 8 elements per row
this silently miscopies row-scalar state such asprev_m

Stable fix:

keep scalar state in[HALF_M, 1]
copy it with a vec binary op that respects the[M,1]stride model, for example:

dup(ub_zero_s, 0.0) add(expdiff_buf[slot], ub_rmax_s, ub_zero_s)

Then update or transform that copied buffer with more vec ops.

Delayed`expdiff`handling

expdiff_jbelongs to the delayed consumer lifetime, not only to stage 1.

Stable pattern:

stage 1 copiesprev_minto the delayedexpdiffslot
stage 1 updates running max
stage 1 overwrites the delayed slot withexp(prev_m - curr_m)
stage 3 later reads that same slot and broadcasts it before scalingaccum

Usestage1_cntparity for the write slot andstage2_cntparity for the read slot.

Final vec accumulation

After loadingpv_jback into UB:

brcbthe delayedexpdiffslot to[HALF_M, 8]
scaleaccum[:, 0:64]
scaleaccum[:, 64:128]
add(accum, accum, pv_j)

Why sliced scaling is required:

accumis wide ([HALF_M, 128])
expdiffbroadcast is narrow ([HALF_M, 8])
follow the same sliced-row rule used for row-max subtraction

Validation target

Keep the first validated contract narrow:

D == 128
S1 % 128 == 0
S2 % 128 == 0
inputq/k/varefloat16
output isfloat32

Suggested cases:

(1, 1, 256, 512, 128)
(1, 3, 256, 512, 128)
(1, 3, 2048, 4096, 128)

Files to study

agent/example/kernels/a2/flash_attn_score_iter.py
agent/example/kernels/a2/flash_attn_score_pv.py
agent/example/kernels/a2/flash_attn_unnorm.py
agent/references/patterns/a2-cube-vec.md
agent/references/patterns/a2-cube-vec-cube.md
agent/references/constraints/a2-device.md
agent/references/constraints/vec-reduction-a2.md
agent/references/constraints/vec-stride.md

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

CANN技能库a2模式文档

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

Why this needs its own a2 pattern

Workspaces and ownership edges

Stable schedule

Shared`L0C`rule

Counter layout

Vec-resident persistent state

Critical scalar-state rule on a2

Delayed`expdiff`handling

Final vec accumulation

Validation target

Files to study

ArcGIS Pro 3 里 OSGB 转 SLPK 踩坑实录：从中文路径崩溃到坐标系谜团，我是怎么解决的

STM32F103VET6通过FSMC驱动2.8寸ILI9341彩屏的双库工程（标准库+HAL）

Foobox：为foobar2000注入现代灵魂的终极界面解决方案

BugLab：基于对抗训练的自我监督代码缺陷检测与修复方法解析

用Python爬取中国大学MOOC的34万条评论，我发现了选课的这些秘密

MHD Hybrid Nanofluid EV Thermal Surrogate vs 传统CFD：25-35%温差降低的秘密

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

Why this needs its own a2 pattern

Workspaces and ownership edges

Stable schedule

SharedL0Crule

Counter layout

Vec-resident persistent state

Critical scalar-state rule on a2

Delayedexpdiffhandling

Final vec accumulation

Validation target

Files to study

ArcGIS Pro 3 里 OSGB 转 SLPK 踩坑实录：从中文路径崩溃到坐标系谜团，我是怎么解决的

STM32F103VET6通过FSMC驱动2.8寸ILI9341彩屏的双库工程（标准库+HAL）

Foobox：为foobar2000注入现代灵魂的终极界面解决方案

BugLab：基于对抗训练的自我监督代码缺陷检测与修复方法解析

用Python爬取中国大学MOOC的34万条评论，我发现了选课的这些秘密

MHD Hybrid Nanofluid EV Thermal Surrogate vs 传统CFD：25-35%温差降低的秘密

Shared`L0C`rule

Delayed`expdiff`handling