Outperforming cuBLAS on H100: a Worklog

Pranjal Shankhdhar

Nov 29, 2024

CUDA matmul kernel - from scratch

Read →

5 Comments

Ara

Dec 1

Incredible

Expand full comment

Pavan Jayasinha

Dec 1

100% amazing

Expand full comment

Ali

May 26Edited

very soldidddd

remark:

"We see that tensor core operations are 2x faster than loads. Store operation is 6.4x slower, but only runs once compared to Load+Tensor Core loop which runs 128 times. These numbers change with different tile sizes, but quantifying them gives us a good picture what’s happening"

It's actually 256 times, since

K = 4096, BK = 64, WGMMA_K = 16

Number of BK chunks = K / BK = 4096 / 64 = 64 chunks, and within each BK chunk, number of WGMMA calls = BK / WGMMA_K = 64 / 16 = 4, totalling 64*4 = 256

Expand full comment