hgemm
Here are 5 public repositories matching this topic...
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
- Updated
Mar 19, 2025 - Cuda
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
- Updated
Sep 8, 2024 - Cuda
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
- Updated
Mar 4, 2025 - Cuda
Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
- Updated
Nov 3, 2023 - Cuda
Improve this page
Add a description, image, and links to thehgemm topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with thehgemm topic, visit your repo's landing page and select "manage topics."