Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
#

flash-attention

Here are 33 public repositories matching this topic...

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.

  • UpdatedFeb 25, 2025
  • Python
Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

  • UpdatedSep 23, 2024
  • Python

Official release of InternLM series (InternLM, InternLM2, InternLM2.5, InternLM3).

  • UpdatedFeb 7, 2025
  • Python
Awesome-LLM-Inference

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉

  • UpdatedMar 4, 2025
CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

  • UpdatedMar 22, 2025
  • Cuda

FlashInfer: Kernel Library for LLM Serving

  • UpdatedMar 24, 2025
  • Cuda

MoBA: Mixture of Block Attention for Long-Context LLMs

  • UpdatedMar 7, 2025
  • Python

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

  • UpdatedMar 20, 2025
  • Python

[CVPR 2025] The official CLIP training codebase of Inf-CL: "Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss". A super memory-efficiency CLIP training scheme.

  • UpdatedJan 16, 2025
  • Python
ffpa-attn-mma

📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.

  • UpdatedMar 23, 2025
  • Cuda

Triton implementation of FlashAttention2 that adds Custom Masks.

  • UpdatedAug 14, 2024
  • Python

Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.

  • UpdatedFeb 5, 2024
  • Python

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

  • UpdatedFeb 27, 2025
  • C++

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

  • UpdatedMar 9, 2025
  • C++

Fast and memory efficient PyTorch implementation of the Perceiver with FlashAttention.

  • UpdatedNov 4, 2024
  • Python

Python package for rematerialization-aware gradient checkpointing

  • UpdatedOct 31, 2023
  • Python

A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/Pallas/JAX).

  • UpdatedMar 4, 2025
  • Python

Utilities for efficient fine-tuning, inference and evaluation of code generation models

  • UpdatedOct 3, 2023
  • Python

An simple pytorch implementation of Flash MultiHead Attention

  • UpdatedFeb 5, 2024
  • Jupyter Notebook

🚀 Automated deployment stack for AMD MI300 GPUs with optimized ML/DL frameworks and HPC-ready configurations

  • UpdatedNov 30, 2024
  • Shell

Improve this page

Add a description, image, and links to theflash-attention topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with theflash-attention topic, visit your repo's landing page and select "manage topics."

Learn more


[8]ページ先頭

©2009-2025 Movatter.jp