flash-attention
Here are 33 public repositories matching this topic...
Sort:Most stars
The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
- Updated
Feb 25, 2025 - Python
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
- Updated
Sep 23, 2024 - Python
Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).
- Updated
Feb 7, 2025 - Python
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
- Updated
Mar 4, 2025
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
- Updated
Mar 22, 2025 - Cuda
FlashInfer: Kernel Library for LLM Serving
- Updated
Mar 24, 2025 - Cuda
MoBA: Mixture of Block Attention for Long-Context LLMs
- Updated
Mar 7, 2025 - Python
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
- Updated
Mar 20, 2025 - Python
[CVPR 2025] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.
- Updated
Jan 16, 2025 - Python
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
- Updated
Mar 23, 2025 - Cuda
Triton implementation of FlashAttention2 that adds Custom Masks.
- Updated
Aug 14, 2024 - Python
Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.
- Updated
Feb 5, 2024 - Python
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
- Updated
Feb 27, 2025 - C++
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
- Updated
Mar 9, 2025 - C++
Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.
- Updated
Nov 4, 2024 - Python
Python package for rematerialization-aware gradient checkpointing
- Updated
Oct 31, 2023 - Python
A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).
- Updated
Mar 4, 2025 - Python
Utilities for efficient fine-tuning, inference and evaluation of code generation models
- Updated
Oct 3, 2023 - Python
An simple pytorch implementation of Flash MultiHead Attention
- Updated
Feb 5, 2024 - Jupyter Notebook
🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations
- Updated
Nov 30, 2024 - Shell
Improve this page
Add a description, image, and links to theflash-attention topic page so that developers can more easily learn about it.
Add this topic to your repo
To associate your repository with theflash-attention topic, visit your repo's landing page and select "manage topics."