You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
So far this is very a standard FIM completion using "local" context. Adding more and more context will usually improve the quality of the completion, but it will also increase the latency. As a datapoint, consider that a 7B LLM running on a 76 core M2 Ultra GPU roughly takes ~1 second to process 1000 tokens of context. Modern LLMs have training contexts of more than 32k tokens, so filling the entire context with local context and reprocessing it on each completion request is obviously not feasible for local completion, as it would be exceedingly slow. For good user experience, we aim at a latency of about ~1 second or less per completion suggestion, while utilizing the full context of the model at the same time. Read more on how we solve this problem further down the text.
Global context
In addition to the local context around the current cursor location, we can significantly improve the quality of the generated suggestions by including extra "global" context. This extra context can come either from other places in the same file that we are currently editing, or from other recently edited or opened files. There are a lot of different techniques for deciding which extra context specifically to include in the request that could be potentially relevant to the current completion task. In thellama.vim plugin, we use a simple approach:
We create a ring buffer ofg:llama_config.ring_n_chunks chunks ofg:llama_config.ring_chunk_size lines each
On every completion request we add 1 prefix and 1 suffix chunk, randomly picked relative to the cursor position but in a much larger scope (g:llama_config.ring_scope lines around the cursor)
Upon entering and leaving a Vim buffer, we pick a chunk around the last cursor position
Upon saving a file, we pick a chunk around the current cursor position
Upon yanking a text block, we add it as a chunk to the ring buffer
Upon trying to add a chunk, we evict old chunks that are very similar to the new one
" gather chunks upon yankingautocmdTextYankPost*ifv:event.operator==#'y' |calls:pick_chunk(v:event.regcontents,v:false,v:true) |endif" gather chunks upon entering/leaving a bufferautocmdBufEnter*calltimer_start(100, {->s:pick_chunk(getline(max([1,line('.')-g:llama_config.ring_chunk_size/2]), min([line('.') + g:llama_config.ring_chunk_size/2,line('$')])),v:true,v:true)})autocmdBufLeave*calls:pick_chunk(getline(max([1,line('.')-g:llama_config.ring_chunk_size/2]), min([line('.') + g:llama_config.ring_chunk_size/2,line('$')])),v:true,v:true)" gather chunk upon saving the fileautocmdBufWritePost*calls:pick_chunk(getline(max([1,line('.')-g:llama_config.ring_chunk_size/2]), min([line('.') + g:llama_config.ring_chunk_size/2,line('$')])),v:true,v:true)
Upon each FIM completion request, we now send both the local and global contexts together. The latter is passed through the"input_extra" field of the/infill request in the following format:
[ { "filename": string "text": string, }, ... max of g:llama_config.ring_n_chunks ...]
With this design, as we edit the files in our Neovim session, the overall context grows to a certain amount (determined by the ring buffer size) and usually contains up-to-date relevant information for the editing task at hand. The specific events and logic for gathering chunks can be easily modified and customized if needed.
Note that the entire state of the context is stored client-side and is sent to the server on each request.
Server-side processing
Upon receiving a request withN extra context chunks, the server constructs the following repo-level FIM prompt:
<|repo_name|>{repo_name}" --\<|file_sep|>{filename_0}" |{text_0}" |<|file_sep|>{filename_1}" | extra (global) prompt{text_1}" |..." |<|file_sep|>{filename_N-1}" |{text_N-1}" --/<|file_sep|>{filename}" --\<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}" --/ local FIM prompt
This is based on the work inhttps://arxiv.org/pdf/2409.12186. Note that not all models are trained for this pattern, so it is recommended to use models that support it, such asQwen2.5-Coder. This prompt format has important advantages that allow efficient context reuse, discussed in the following paragraphs.
In this FIM prompt, the components correspond to:
<|repo_name|>,<|file_sep|>,<|fim_prefix|>,<|fim_suffix|>,<|fim_middle|> - special tokens defined by the model
filename_i - thefilename of the i'th chunk in the"input_extra" array
text_i - thetext of the i'th chunk in the"input_extra" array
prefix,suffix,prompt - the input from the"input_prefix","input_suffix", and"prompt" fields of the request
The server processes the constructed prompt and then generates a maximum number of tokens that represent the FIM completion. The generation can be terminated early by several different conditions:
An end-of-generation (EOG) token is sampled
A maximum time-limit optionally specified by the client is exceed
An indentation constraint optionally specified by the client is not satisfied
The generated text is sent back to the client for display as a suggestion via virtual text overlay.
KV cache reuse : global prefix
The first optimization technique for improving long-context performance is to simply reuse the computed KV cache common prefix from the previous request. This allows us to very efficiently append new chunks of extra context, in-between the<|fim_prefix|> token and the existing chunks in the extra context:
<|repo_name|>{repo_name}" --\<|file_sep|>{filename_0}" |{text_0}" |<|file_sep|>{filename_1}" | extra context, cached and reused (no processing){text_1}" |..." |<|file_sep|>{filename_N-1}" |{text_N-1}" --/<|file_sep|>{filename_N}" --\ new chunk,{text_N}" --/ processed with the new request<|file_sep|>{filename}<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}
Reusing the KV cache prefix is supported generally by thellama-server and requires simply to provide the"cache_prompt": true flag in the completion requests. With this option, each new completion request will reuse the largest common prefix of tokens between the old and the new request. This saves a large part of the prompt processing in situations where the extra context does not change, or was extended by appending a new chunk at the end.
KV cache reuse : context shift
The previous optimization is only useful up tog:ring_n_chunks chunks of extra context. When the ring buffer becomes full, the first chunk would be evicted and would therefore "shift" all following chunks into a new position relative to the start of the prompt:
<|repo_name|>{repo_name}" --\<|file_sep|>{filename_1}" |{text_1}" |<|file_sep|>{filename_2}" | chunk 0 has been evicted{text_2}" | the rest of the chunks have 'moved' one step towards the front..." |<|file_sep|>{filename_N}" |{text_N}" --/<|file_sep|>{filename}<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}
Because of this relative shift ofD0 tokens, it is no longer possible to directly reuse the KV cache of the extra context. The reason for this is because the position of the tokens is encoded inside the KV cache data (e.g. via the RoPE operator) and now the tokens are no longer in those particular positions (for more info, see#71 (comment)).
However, quite early in the project (#2060), we realized that the cache in this case can actually be efficiently reused by "updating" the encoded positions in the K cache. This follows from the observation that theRoPE operator is "additive". Roughly speaking, applying a RoPE with positionp1 = p0 + d is equivalent to applying:
RoPE at positionp0
RoPE at positiond on the already RoPE'd data in the previous step
This provides a very cheap way to "move" the remaining chunks in the ring buffer forward, towards the beginning of the context: simply apply RoPE with position-D0 to all tokens in the K cache that we want to reuse. Doing so, we can again save the computation of a large portion of the extra prompt.
Note that the described context shifting method is not mathematically identical to recomputing the entire prompt from scratch. It can be easily seen that the embeddings at each token position are "entangled" with all the embeddings before that position, so simply "shifting" the K cache positions will not produce the exact same numbers as full reprocessing. Regardless of this, the context shifting feature has been applied and used by the localllama.cpp community for more than an year now and empirical results indicate that it is very effective and does not seem to degrade the quality of the output in a significant way. The cache reuse techniques described here heavily rely on this "trick".
The described context shifting strategy can also be applied when the evicted chunk is somewhere in the middle of the ring buffer or even if there are multiple evicted chunks at a time. A detailed description of the implementation can be found in#5793 and#9866.
This context reuse strategy requires thellama-server to be started with the--cache-reuse N command-line argument. TheN argument is the minimum size of the chunks (in number of tokens) that we will accept and shift in the KV cache for reuse purposes. The logic is that we don't want to reuse very small bits (e.g. individual tokens) from random places of the old context and instead we are interested in reusing large continuous blocks. Note that the implementation preserves the order of the reused chunks, so that a shifted chunk will never move over another chunk (i.e. reused chunks always appear in the same order to each other as when they were originally computed).
Applying these two techniques, we can now efficiently update the extra context of our FIM requests by adding and evicting chunks any way the client decides. Existing chunks will not be recomputed and the server will process only new chunks that were not present in the previous request. Thellama.vim plugging periodically updates the extra context ring buffer on the client side and sends the information to the server whenever it detects inactivity (i.e. the cursor hasn't moved for certain period of time or we are currently inNormal mode). This makes the processing of the extra global context almost entirely seamless for the user, mitigating a huge portion of the latency in the naive approach.
KV cache reuse : local prefix
Let's now focus again on the local context part of the request and explain one additional cache reuse strategy that helps to further reduce the completion latency in some typical cases. All of the following examples will assume the PSM (Prefix-Suffix-Middle) FIM pattern. Similar analysis can be made for the SPM (Suffix-Prefix-Middle) pattern which is supported via the--spm-infill command line argument ofllama-server.
" the PSM FIM pattern<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>{prompt}
Assume also that we are in the middle of editing a line of text and the client has already received a suggestion from the server:
" this is the text file that we are editing:{prefixlines outside the scope of the local FIM context}{prefix_line_1}{prefix_line_2}{prefix_line_3}...{prefix_line_P}{cur_line_prefix}█{cur_line_suffix}" --> currently have a completion suggestion for this position{suffix_line_1}{suffix_line_2}...{suffix_line_S}{suffixlines outside the scope of the local FIM context}
Here is how the local FIM prompt looks like in more details:
From here, there are 3 typical follow-up completion requests that occur in most situations:
Same line FIM: the cursor moves left or right on the same line
Next line FIM: the cursor moves to the next line
Prev line FIM: the cursor moves to the previous line
Same line FIM
For clarity, assume the cursor moved{dx} tokens to the right (moving to the left follows the same logic). The new FIM prompt would look like this:
<|fim_prefix|>{prefix_line_1}" --\{prefix_line_2}" |{prefix_line_3}" | the cache is reused from the previous request..." |{prefix_line_P}" --/<|fim_suffix|>{cur_line_suffix- dx}" --\{suffix_line_1}" |{suffix_line_2}" |..." | computed in the new request{suffix_line_S}" |<|fim_middle|>{cur_line_prefix+ dx}" --/
In this case the entire local prefix will be reused since it's contents and position are the same as in the previous request. This means that attempting FIM anywhere on the same line will be quite cheap and will involve recomputing only the suffix tokens.
Next line FIM
In this case, the new FIM prompt after moving to the next line, looks like this:
<|fim_prefix|>{prefix_line_2}" --\{prefix_line_3}" |{prefix_line_4}" | the cache is reused from previous request via context shift..." |{prefix_line_P}" --/{prefix_line_P+1}" --> this is a new line added to the FIM prefix<|fim_suffix|>{new_line_suffix}" --\{suffix_line_2}" |{suffix_line_3}" |..." | computed in the new request{suffix_line_S+1}" |<|fim_middle|>{new_line_prefix}" --/
The old{prefix_line_1} line is now out of the FIM prefix scope and a new{prefix_line_P+1} line is within the FIM prefix scope. We can reuse the cache for lines[2, P] via context shifting, as explained earlier. So in this case, we compute only the new prefix line{prefix_line_P+1}, together with the new FIM suffix.
Prev line FIM
This case is the most cache unfriendly one. Moving a line up, the new FIM prompt will look like this:
<|fim_prefix|>{prefix_line_0}" --> this line is completely new, so it breaks the cache reuse sequence very early{prefix_line_1}{prefix_line_2}...{prefix_line_P-1}<|fim_suffix|>{new_line_suffix}{suffix_line_0}{suffix_line_1}...{suffix_line_S-1}<|fim_middle|>{new_line_prefix}
Because we haven't computed the{prefix_line_0} line in the previous request, the cache reuse logic has to stop at the very start of the local FIM prompt. Therefore in this case we don't reuse any of the previous local FIM cache and we need to compute the entire local FIM prompt.
Expected performance
On each FIM request, the server takes a maximum of 1 full batch of tokens from the provided local context. The prefix and suffix tokens are split in a ratio of3:1:
This means that for new FIM requests, there will be at most--batch tokens to process, while in most cases the processed tokens would be much less due to the cache reuse optimizations described above. Knowing this, we can estimate the typical performance of FIM requests using thellama-batched-bench tool. Here are some analysis on M1 Pro and M2 Ultra usingQwen2.5-Coder 1.5B and 7B models:
From these numbers we can estimate the prompt processing and text generation speeds, as well as the expected FIM time at different levels of context occupation. Here we assume that the FIM request would require to process 1/4 of--batch tokens as prompt and generate32 tokens as suggestion:
Usingllama.vim on M1 Pro (2021) withQwen2.5-Coder 1.5B Q8_0:
The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is15186 tokens and the maximum is32768. There are30 chunks in the ring buffer with extra context (out of64). So far,1 chunk has been evicted in the current session and there are0 chunks in queue. The newly computed prompt tokens for this request were260 and the generated tokens were25. It took1245 ms to generate this suggestion after entering the letterc on the current line.
Usingllama.vim on M2 Ultra withQwen2.5-Coder 7B Q8_0:
llama.vim-0-lq.mp4
Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.
TODO
add infill sampler
suggest gray text
auto-suggest
rolling context
clipboard/yanked contents
test if multi-user works (increase-np and-c as needed)
This plugin (or script?) was quite fun to implement! Will be merging after a few days of testing. If anyone gives this a try, would be happy to hear any feedback. This is running pretty smooth on M2 Ultra with Qwen2.5 7B Q8, though I think it should work reasonably well even on lower end hardware.
As far as I know, the async job and virtual text APIs are a bit different in Vim. Though it's probably quite easy to adapt the script to work both with Vim and Neovim.
Thellama.vim plugin is now available as a standalone repo athttps://github.com/ggml-org/llama.vim. This makes it possible to install the plugin through popular plugin managers.
The llama.cpp server relies heavily on KV cache reuse to achieve low-latencycompletions, as explained here:ggml-org/llama.cpp#9787When we send the contents of the ring buffer (via `input_extra`) for a FIMcompletion request, we only send chunks from the ring buffer that correspond tothe active project. (The llama.vim implementation does not have the concept ofan active project, only a single global ring buffer.)However, we were priming the KV cache in the background by sending *all* chunksfrom the global ring buffer, undermining the ability use reuse the KV cache forreal FIM completion requests.This commit changes the implementation of background cache priming to beproject aware, including chunks only from a single project per request.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading.Please reload this page.
refggml-org/p1#1
The plugin is now developed here:https://github.com/ggml-org/llama.vim
Overview
Add a simple Neovim plugin for local LLM-assisted code/text completion.
Features
InsertmodeCtrl+FTabShift+TabUsage
Setup a
llama-serverinstance with a FIM-compatible model (RoPE required). For example:works best with Qwen2.5-Coder models (not Instruct)
Copy or symlinkexamples/llama.vim to
~/.config/nvim/autoload/llama.vimStart Neovim and run:
For more advanced options, check the parameters in
g:llama_configinexamples/llama.vim:https://github.com/ggerganov/llama.cpp/blob/acf6d1909ca38cbd5e057debbd291e89205c3afc/examples/llama.vim#L43-L86
Sample configs based on hardware
High-end hardware with GPU
Mid-end hardware with GPU
Low-end hardware with GPU
Low-end hardware (CPU only)
Backend changes
Debugging
llama-server .. -lv 1GGML_DEBUG_SAMPLER_INFILLinllama-sampling.cppTechnical details
The plugin uses the
/infillendpoint of thellama-server. It sends asynchronous FIM requests to the server via thecurltool:The
"input_prefix"and"input_suffix"are constructed by picking nearby lines around the cursor location:The
"prompt"is set as the text to the left of the cursor on the current line:So far this is very a standard FIM completion using "local" context. Adding more and more context will usually improve the quality of the completion, but it will also increase the latency. As a datapoint, consider that a 7B LLM running on a 76 core M2 Ultra GPU roughly takes ~1 second to process 1000 tokens of context. Modern LLMs have training contexts of more than 32k tokens, so filling the entire context with local context and reprocessing it on each completion request is obviously not feasible for local completion, as it would be exceedingly slow. For good user experience, we aim at a latency of about ~1 second or less per completion suggestion, while utilizing the full context of the model at the same time. Read more on how we solve this problem further down the text.
Global context
In addition to the local context around the current cursor location, we can significantly improve the quality of the generated suggestions by including extra "global" context. This extra context can come either from other places in the same file that we are currently editing, or from other recently edited or opened files. There are a lot of different techniques for deciding which extra context specifically to include in the request that could be potentially relevant to the current completion task. In the
llama.vimplugin, we use a simple approach:g:llama_config.ring_n_chunkschunks ofg:llama_config.ring_chunk_sizelines eachg:llama_config.ring_scopelines around the cursor)Upon each FIM completion request, we now send both the local and global contexts together. The latter is passed through the
"input_extra"field of the/infillrequest in the following format:With this design, as we edit the files in our Neovim session, the overall context grows to a certain amount (determined by the ring buffer size) and usually contains up-to-date relevant information for the editing task at hand. The specific events and logic for gathering chunks can be easily modified and customized if needed.
Note that the entire state of the context is stored client-side and is sent to the server on each request.
Server-side processing
Upon receiving a request with
Nextra context chunks, the server constructs the following repo-level FIM prompt:This is based on the work inhttps://arxiv.org/pdf/2409.12186. Note that not all models are trained for this pattern, so it is recommended to use models that support it, such as
Qwen2.5-Coder. This prompt format has important advantages that allow efficient context reuse, discussed in the following paragraphs.In this FIM prompt, the components correspond to:
<|repo_name|>,<|file_sep|>,<|fim_prefix|>,<|fim_suffix|>,<|fim_middle|>- special tokens defined by the modelfilename_i- thefilenameof the i'th chunk in the"input_extra"arraytext_i- thetextof the i'th chunk in the"input_extra"arrayprefix,suffix,prompt- the input from the"input_prefix","input_suffix", and"prompt"fields of the requestThe server processes the constructed prompt and then generates a maximum number of tokens that represent the FIM completion. The generation can be terminated early by several different conditions:
The generated text is sent back to the client for display as a suggestion via virtual text overlay.
KV cache reuse : global prefix
The first optimization technique for improving long-context performance is to simply reuse the computed KV cache common prefix from the previous request. This allows us to very efficiently append new chunks of extra context, in-between the
<|fim_prefix|>token and the existing chunks in the extra context:Reusing the KV cache prefix is supported generally by the
llama-serverand requires simply to provide the"cache_prompt": trueflag in the completion requests. With this option, each new completion request will reuse the largest common prefix of tokens between the old and the new request. This saves a large part of the prompt processing in situations where the extra context does not change, or was extended by appending a new chunk at the end.KV cache reuse : context shift
The previous optimization is only useful up to
g:ring_n_chunkschunks of extra context. When the ring buffer becomes full, the first chunk would be evicted and would therefore "shift" all following chunks into a new position relative to the start of the prompt:Because of this relative shift of
D0tokens, it is no longer possible to directly reuse the KV cache of the extra context. The reason for this is because the position of the tokens is encoded inside the KV cache data (e.g. via the RoPE operator) and now the tokens are no longer in those particular positions (for more info, see#71 (comment)).However, quite early in the project (#2060), we realized that the cache in this case can actually be efficiently reused by "updating" the encoded positions in the K cache. This follows from the observation that theRoPE operator is "additive". Roughly speaking, applying a RoPE with position
p1 = p0 + dis equivalent to applying:p0don the already RoPE'd data in the previous stepThis provides a very cheap way to "move" the remaining chunks in the ring buffer forward, towards the beginning of the context: simply apply RoPE with position
-D0to all tokens in the K cache that we want to reuse. Doing so, we can again save the computation of a large portion of the extra prompt.Note that the described context shifting method is not mathematically identical to recomputing the entire prompt from scratch. It can be easily seen that the embeddings at each token position are "entangled" with all the embeddings before that position, so simply "shifting" the K cache positions will not produce the exact same numbers as full reprocessing. Regardless of this, the context shifting feature has been applied and used by the local
llama.cppcommunity for more than an year now and empirical results indicate that it is very effective and does not seem to degrade the quality of the output in a significant way. The cache reuse techniques described here heavily rely on this "trick".The described context shifting strategy can also be applied when the evicted chunk is somewhere in the middle of the ring buffer or even if there are multiple evicted chunks at a time. A detailed description of the implementation can be found in#5793 and#9866.
This context reuse strategy requires the
llama-serverto be started with the--cache-reuse Ncommand-line argument. TheNargument is the minimum size of the chunks (in number of tokens) that we will accept and shift in the KV cache for reuse purposes. The logic is that we don't want to reuse very small bits (e.g. individual tokens) from random places of the old context and instead we are interested in reusing large continuous blocks. Note that the implementation preserves the order of the reused chunks, so that a shifted chunk will never move over another chunk (i.e. reused chunks always appear in the same order to each other as when they were originally computed).Applying these two techniques, we can now efficiently update the extra context of our FIM requests by adding and evicting chunks any way the client decides. Existing chunks will not be recomputed and the server will process only new chunks that were not present in the previous request. The
llama.vimplugging periodically updates the extra context ring buffer on the client side and sends the information to the server whenever it detects inactivity (i.e. the cursor hasn't moved for certain period of time or we are currently inNormalmode). This makes the processing of the extra global context almost entirely seamless for the user, mitigating a huge portion of the latency in the naive approach.KV cache reuse : local prefix
Let's now focus again on the local context part of the request and explain one additional cache reuse strategy that helps to further reduce the completion latency in some typical cases. All of the following examples will assume the PSM (Prefix-Suffix-Middle) FIM pattern. Similar analysis can be made for the SPM (Suffix-Prefix-Middle) pattern which is supported via the
--spm-infillcommand line argument ofllama-server.Assume also that we are in the middle of editing a line of text and the client has already received a suggestion from the server:
Here is how the local FIM prompt looks like in more details:
From here, there are 3 typical follow-up completion requests that occur in most situations:
Same line FIM
For clarity, assume the cursor moved
{dx}tokens to the right (moving to the left follows the same logic). The new FIM prompt would look like this:In this case the entire local prefix will be reused since it's contents and position are the same as in the previous request. This means that attempting FIM anywhere on the same line will be quite cheap and will involve recomputing only the suffix tokens.
Next line FIM
In this case, the new FIM prompt after moving to the next line, looks like this:
The old
{prefix_line_1}line is now out of the FIM prefix scope and a new{prefix_line_P+1}line is within the FIM prefix scope. We can reuse the cache for lines[2, P]via context shifting, as explained earlier. So in this case, we compute only the new prefix line{prefix_line_P+1}, together with the new FIM suffix.Prev line FIM
This case is the most cache unfriendly one. Moving a line up, the new FIM prompt will look like this:
Because we haven't computed the
{prefix_line_0}line in the previous request, the cache reuse logic has to stop at the very start of the local FIM prompt. Therefore in this case we don't reuse any of the previous local FIM cache and we need to compute the entire local FIM prompt.Expected performance
On each FIM request, the server takes a maximum of 1 full batch of tokens from the provided local context. The prefix and suffix tokens are split in a ratio of
3:1:https://github.com/ggerganov/llama.cpp/blob/32927e68b7fbfd6dfa82e531d186f1b6b22612ae/examples/server/server.cpp#L2055-L2062
This means that for new FIM requests, there will be at most
--batchtokens to process, while in most cases the processed tokens would be much less due to the cache reuse optimizations described above. Knowing this, we can estimate the typical performance of FIM requests using thellama-batched-benchtool. Here are some analysis on M1 Pro and M2 Ultra usingQwen2.5-Coder1.5B and 7B models:M1 Pro
M2 Ultra
From these numbers we can estimate the prompt processing and text generation speeds, as well as the expected FIM time at different levels of context occupation. Here we assume that the FIM request would require to process 1/4 of
--batchtokens as prompt and generate32tokens as suggestion:M1 Pro, LLM 1.5B, Q8_0:
expected FIM time in ms:
M2 Ultra, LLM 7B, Q8_0:
expected FIM time in ms:
Examples
Using
llama.vimon M1 Pro (2021) withQwen2.5-Coder 1.5B Q8_0:The orange text is the generated suggestion. The green text contains performance stats for the FIM request: the currently used context is
15186tokens and the maximum is32768. There are30chunks in the ring buffer with extra context (out of64). So far,1chunk has been evicted in the current session and there are0chunks in queue. The newly computed prompt tokens for this request were260and the generated tokens were25. It took1245 msto generate this suggestion after entering the lettercon the current line.Using
llama.vimon M2 Ultra withQwen2.5-Coder 7B Q8_0:llama.vim-0-lq.mp4
Demonstrates that the global context is accumulated and maintained across different files and showcases the overall latency when working in a large codebase.
TODO
-npand-cas needed)Future ideas