#
flash-mla
Here are 3 public repositories matching this topic...
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, MLA, Parallelism, Prefix-Cache, Chunked-Prefill, etc. 🎉🎉
mlavllmllm-inferenceawesome-llmflash-attentiontensorrt-llmpaged-attentiondeepseekflash-attention-3deepseek-v3minimax-01deepseek-r1flash-mla
- Updated
Mar 4, 2025
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
- Updated
Mar 19, 2025 - Cuda
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
cudaattentionsdpamlamlsystensor-coresflash-attentiondeepseekdeepseek-v3deepseek-r1fused-mlaflash-mla
- Updated
Mar 17, 2025 - Cuda
Improve this page
Add a description, image, and links to theflash-mla topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with theflash-mla topic, visit your repo's landing page and select "manage topics."