- Notifications
You must be signed in to change notification settings - Fork8
SqueezeAILab/SqueezedAttention
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Squeezed Attention is a method to accelerate attention for long input prompts where a large portion of the input prompt is fixed across successive user queries.
Many LLM applications require processing long input prompts for tasks such as document analysis and code generation. However, long context length inference is prohibitively expensive since the inference costs increase linearly with sequence length. For many LLM applications, a large portion of the input prompt is fixed across successive user queries (for example, when a user asks multiple questions against a document or codebase). This means that we can perform offline optimizations which can allow us to accelerate attention for user inputs when they are received.
Squeezed Attention accelerates attention from the user input to the fixed context in the prompt byidentifying which keys are important for a given query, and thenonly computing attention with these important keys. This identification is performed by comparing the query with clusters of keys, and then progressively refining this comparison using finer-grained clusters in order to identify the important keys for the query.
- Create a conda environment
conda create --name fixedprompt python=3.9 -yconda activate fixedprompt
- Clone and install the dependencies (including the local transformers environment)
cd transformerspip install -e .cd ..pip install -e .
run_offline_clustering.sh
provides an example of how to run offline clustering for a given LongBench dataset
LongBench/run_evaluation.sh
provides an example of how to run online evaluation for a given LongBench dataset
- add evaluation code for RULER
- add dataset generation and evaluation code for PreFixQA
- merge efficient kernels for generation into evaluation pipeline
- add in environment for kernel benchmarking
Squeezed Attention has been developed as part of the following paper. We appreciate it if you would please cite the following paper if you found the library useful for your work:
@article{hooper2024squeezed, title={Squeezed Attention: Accelerating Long Context Length LLM Inference}, author={Hooper, Coleman and Kim, Sehoon and Mohammadzadeh, Hiva and Maheswaran, Monishwaran and Paik, June and Mahoney, Michael W and Keutzer, Kurt and Gholami, Amir}, journal={arXiv preprint arXiv:2411.09688}, year={2024}}
This code reuses components from existing libraries includingSqueezeLLM andLongBench.
The Triton kernel implementations in this repository are built on top of theFlashAttention2 Triton Tutorial as well as prior work onSparse FlashAttention kernels.
About
SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.