5 Comments
User's avatar
Ara's avatar

Incredible

Expand full comment
Pavan Jayasinha's avatar

100% amazing

Expand full comment
Ali's avatar
May 26Edited

very soldidddd

remark:

"We see that tensor core operations are 2x faster than loads. Store operation is 6.4x slower, but only runs once compared to Load+Tensor Core loop which runs 128 times. These numbers change with different tile sizes, but quantifying them gives us a good picture what’s happening"

It's actually 256 times, since

K = 4096, BK = 64, WGMMA_K = 16

Number of BK chunks = K / BK = 4096 / 64 = 64 chunks, and within each BK chunk, number of WGMMA calls = BK / WGMMA_K = 64 / 16 = 4, totalling 64*4 = 256

Expand full comment
Kane's avatar

Amazing!

"n=256 demands a whopping 40KB of SMEM and 128 registers per thread!"

(64 + 256) × 16 × 2 = 10KB,maybe it's a typo?

Expand full comment
Kane's avatar

There's a mistake in the picture with A's shape denoted mxn which should be mxk?

Expand full comment