"We see that tensor core operations are 2x faster than loads. Store operation is 6.4x slower, but only runs once compared to Load+Tensor Core loop which runs 128 times. These numbers change with different tile sizes, but quantifying them gives us a good picture what’s happening"
It's actually 256 times, since
K = 4096, BK = 64, WGMMA_K = 16
Number of BK chunks = K / BK = 4096 / 64 = 64 chunks, and within each BK chunk, number of WGMMA calls = BK / WGMMA_K = 64 / 16 = 4, totalling 64*4 = 256
Incredible
100% amazing
very soldidddd
remark:
"We see that tensor core operations are 2x faster than loads. Store operation is 6.4x slower, but only runs once compared to Load+Tensor Core loop which runs 128 times. These numbers change with different tile sizes, but quantifying them gives us a good picture what’s happening"
It's actually 256 times, since
K = 4096, BK = 64, WGMMA_K = 16
Number of BK chunks = K / BK = 4096 / 64 = 64 chunks, and within each BK chunk, number of WGMMA calls = BK / WGMMA_K = 64 / 16 = 4, totalling 64*4 = 256
Amazing!
"n=256 demands a whopping 40KB of SMEM and 128 registers per thread!"
(64 + 256) × 16 × 2 = 10KB,maybe it's a typo?
There's a mistake in the picture with A's shape denoted mxn which should be mxk?