Bruce-Lee-LY Bruce-Lee-LY

LLM Infer, AI Infra, CUDA

Achievements

decoding_attentiondecoding_attentionPublic
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
C++ 35 2
flash_attention_inferenceflash_attention_inferencePublic
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
C++ 35 3
cuda_hgemmcuda_hgemmPublic
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Cuda 374 76
cuda_hookcuda_hookPublic
Hooked CUDA-related dynamic libraries by using automated code generation tools.
C 150 41
cuda_hgemvcuda_hgemvPublic
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
Cuda 59 5
cutlass_gemmcutlass_gemmPublic
Multiple GEMM operators are constructed with cutlass to support LLM inference.
C++ 17 2