news 2026/6/3 4:35:11

CANN技能库a2模式文档

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
CANN技能库a2模式文档

a2 Cube-to-Vec-to-Cube-to-Vec Pattern (Triple Bridge, Delayed Numerator Accumulation)

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

Read this file when writing an a2 (easyasc.a2, deviceb3) kernel with:

  • one cube stage that produces a score tile
  • vec logic that updates running row state and emits a delayed cube input
  • a later cube stage that consumes that delayed tile
  • a final vec stage that accumulates the delayed cube output

Typical target formula:

  • score_j = q.float() @ k_j.float().t() * scale
  • curr_m = maximum(prev_m, rowmax(score_j))
  • expdiff_j = exp(prev_m - curr_m)
  • p_j = exp(score_j - curr_m).half()
  • pv_j = p_j.float() @ v_j.float()
  • out = out * expdiff_j + pv_j

This isnotnormalized online softmax. It keeps running max and a rescaled numerator only. There is no running sum or final divide. If you need runningrow_sumand a finalout / row_sum, switch toagent/references/patterns/a2-cube-vec-cube-vec-softmax.md.

Why this needs its own a2 pattern

This topology combines all a2 bridge constraints in one kernel:

  • cube -> vec cannot usel0c_to_ub
  • vec -> cube cannot useub_to_l1_*
  • the delayed cube output must return to vec for the final accumulation

So the stable data path is:

GM(q,k,v) -> L1 -> L0 -> L0C(score) -> GM(score_ws) -> UB(score)-> GM(p_ws) -> L1 -> L0 -> L0C(pv) -> GM(pv_ws) -> UB(pv) -> UB(accum) -> GM(out)

Use explicit workspaces instead of pretending this can stay on chip end-to-end.

Workspaces and ownership edges

Use three GM workspaces:

  1. score_ws

    • dtype:float
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose:L0C(score)->UB(score)
  2. p_ws

    • dtype:half
    • shape:[GetCubeNum(), 2, TILE_M, TILE_N]
    • purpose:UB(p_j)->L1(p_j)
  3. pv_ws

    • dtype:float
    • shape:[GetCubeNum(), 2, TILE_M, D]
    • purpose:L0C(pv_j)->UB(pv_j)

Ownership edges:

  • stage 1 cube -> vec:CvMutex(0, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)
  • stage 1 vec -> stage 2 cube:VcMutex(1, src_end_pipe=Pipe.MTE3, dst_end_pipe=Pipe.FIX)
  • stage 2 cube -> stage 3 vec:CvMutex(2, src_end_pipe=Pipe.FIX, dst_end_pipe=Pipe.MTE2)

Stable schedule

Use one-tile lookahead:

for ni in range(0, tiles_n + 1): if ni < tiles_n: # stage 1: produce tile j = ni if ni > 0: # stage 2 + stage 3: consume tile j = ni - 1

This gives:

  • warmup: first iteration only produces
  • steady state: producejwhile consumingj - 1
  • drain: final iteration only consumes the last delayed tile

SharedL0Crule

Reuse one physicalL0Cfamily across the two cube stages.

Why this is the stable a2 choice here:

  • stage 1 writes a full float[TILE_M, TILE_N]score tile
  • stage 2 writes a full float[TILE_M, D]pv_jtile with the same validatedD == 128
  • a2 only has128 KBL0C, so a second full float family would be a misleading design target

Stable ownership story:

  • keep onel0c = DBuff(DT.float, [TILE_M, TILE_N], Position.L0C)
  • let stage 1 publishscore_wsbefore stage 2 reuses that slot
  • let stage 2 publishpv_wsbefore the next stage-1 reuse
  • advance one sharedl0c_cnt

This is a capacity-driven exception, not a general license to merge unrelated counters. Only the physicalL0Cfamily is shared. Other stage-owned lifetimes stay separate.

Counter layout

Keep these lifetimes separate:

  • l1qk_cnt: stage-1q/kloads
  • l1pv_cnt: stage-2p/vloads
  • l0c_cnt: shared physicalL0Cfamily across the two cube stages
  • stage1_cnt: delayed slot rhythm forscore_ws,p_ws, andexpdiff
  • stage2_cnt: delayed slot rhythm forp_wsconsumption andpv_ws

Do not hide the delayed accumulator lifetime behindstage1_cnt.

Vec-resident persistent state

Keep these values in per-subblock UB across the whole inner loop:

  • running row max:[HALF_M, 1]
  • delayedexpdiffslots:DBuff(DT.float, [HALF_M, 1], Position.UB)
  • final numerator accumulation:[HALF_M, D]

UseGetSubBlockIdx()so each vec lane owns only its ownHALF_Mrows.

Critical scalar-state rule on a2

Donotcopy[HALF_M, 1]scalar-format state withub_to_ub.

Reason:

  • ub_to_ubinfers burst length in units ofC0blocks
  • for[64, 1]float views, that means copying 8 elements per row
  • this silently miscopies row-scalar state such asprev_m

Stable fix:

  • keep scalar state in[HALF_M, 1]
  • copy it with a vec binary op that respects the[M,1]stride model, for example:
dup(ub_zero_s, 0.0) add(expdiff_buf[slot], ub_rmax_s, ub_zero_s)

Then update or transform that copied buffer with more vec ops.

Delayedexpdiffhandling

expdiff_jbelongs to the delayed consumer lifetime, not only to stage 1.

Stable pattern:

  1. stage 1 copiesprev_minto the delayedexpdiffslot
  2. stage 1 updates running max
  3. stage 1 overwrites the delayed slot withexp(prev_m - curr_m)
  4. stage 3 later reads that same slot and broadcasts it before scalingaccum

Usestage1_cntparity for the write slot andstage2_cntparity for the read slot.

Final vec accumulation

After loadingpv_jback into UB:

  1. brcbthe delayedexpdiffslot to[HALF_M, 8]
  2. scaleaccum[:, 0:64]
  3. scaleaccum[:, 64:128]
  4. add(accum, accum, pv_j)

Why sliced scaling is required:

  • accumis wide ([HALF_M, 128])
  • expdiffbroadcast is narrow ([HALF_M, 8])
  • follow the same sliced-row rule used for row-max subtraction

Validation target

Keep the first validated contract narrow:

  • D == 128
  • S1 % 128 == 0
  • S2 % 128 == 0
  • inputq/k/varefloat16
  • output isfloat32

Suggested cases:

  1. (1, 1, 256, 512, 128)
  2. (1, 3, 256, 512, 128)
  3. (1, 3, 2048, 4096, 128)

Files to study

  • agent/example/kernels/a2/flash_attn_score_iter.py
  • agent/example/kernels/a2/flash_attn_score_pv.py
  • agent/example/kernels/a2/flash_attn_unnorm.py
  • agent/references/patterns/a2-cube-vec.md
  • agent/references/patterns/a2-cube-vec-cube.md
  • agent/references/constraints/a2-device.md
  • agent/references/constraints/vec-reduction-a2.md
  • agent/references/constraints/vec-stride.md

【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体,本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/6/3 4:32:57

STM32F103VET6通过FSMC驱动2.8寸ILI9341彩屏的双库工程(标准库+HAL)

本文还有配套的精品资源&#xff0c;点击获取 简介&#xff1a;这个资源包提供一套开箱即用的STM32F103VET6驱动2.8英寸TFT彩屏方案&#xff0c;屏幕主控为ILI9341&#xff0c;采用FSMC并行总线实现高速数据传输。里面包含两套完整Keil MDK工程&#xff1a;一套基于ST标准外…

作者头像 李华
网站建设 2026/6/3 4:32:02

Foobox:为foobar2000注入现代灵魂的终极界面解决方案

Foobox&#xff1a;为foobar2000注入现代灵魂的终极界面解决方案 【免费下载链接】foobox-cn DUI 配置 for foobar2000 项目地址: https://gitcode.com/GitHub_Trending/fo/foobox-cn 你是否曾经对foobar2000原生的简陋界面感到失望&#xff1f;或者厌倦了那些臃肿、卡顿…

作者头像 李华
网站建设 2026/6/3 4:32:01

BugLab:基于对抗训练的自我监督代码缺陷检测与修复方法解析

1. 项目概述&#xff1a;当深度学习遇上“捉虫”游戏作为一名在软件工程一线摸爬滚打了十多年的开发者&#xff0c;我深知调试&#xff08;Debug&#xff09;这件事有多磨人。它不像构建新功能那样充满创造性的快感&#xff0c;更像是在一堆逻辑迷宫里寻找那只捣乱的“虫子”&a…

作者头像 李华
网站建设 2026/6/3 4:29:14

用Python爬取中国大学MOOC的34万条评论,我发现了选课的这些秘密

34万条MOOC评论背后的选课密码&#xff1a;用Python数据挖掘避开学习陷阱第一次点开中国大学MOOC的课程页面时&#xff0c;我和大多数人一样&#xff0c;被精美的课程封面和权威的授课机构吸引。但当真正投入学习后才发现&#xff0c;有些课程的实际体验与宣传相去甚远——视频…

作者头像 李华