ggml-org/llama.cppPublic

NotificationsYou must be signed in to change notification settings
Fork11.2k
Star77.3k

Project status#3471

Closed

ggerganov started this conversation inGeneral

Project status#3471

ggerganov

Oct 4, 2023

· 4 comments· 10 replies

Return to top

Discussion options

ggerganov
Oct 4, 2023
Maintainer

[NO LONGER UPDATED]

Below is a summary of the functionality provided by thellama.cpp project.

The goal is to have a birds-eye-view of what works and what does not
Collaborators are encouraged to add things to the list and update the status of existing things as needed
The list should be simple without too much details about the specific problems - these belong to dedicated issues

Legend (feel free to update):

✅ - Working correctly
☁️ - Partially working
❌ - Failing
❓ - Status unknown (needs testing)
🔬 - Under investigation
🚧 - Currently in development

Feature	Executable	Status	Issues
Inference
Single-batch decoding	`main`,`simple`	✅
Parallel / batched decoding	`batched`	✅
Continuous batching	`parallel`	✅
Speculative sampling	`speculative`	✅
Tree-based speculative sampling	`speculative`	✅
Self-speculative sampling	`speculative`	🚧	#3565
Lookahead sampling	`lookahead`	✅
Infill	`infill`	✅
REST API	`server`	✅
Embeddings	`embedding`	✅
Grouped Query Attention CPU	`main`	✅
Grouped Query Attention CUDA	`main`	✅
Grouped Query Attention OpenCL	`main`	✅
Grouped Query Attention Metal	`main`	✅
Session load / save	`main`	✅
K-quants (256) CUDA	`main`	✅
K-quants (64) CUDA	`main`	✅
K-quants (256) Metal	`main`	✅
K-quants (64) Metal	`main`	☁️	#3276
Special tokens	`main`	✅
Grammar sampling	`main`,`server`	✅
Beam search	`beam-search`	❓	#3471 (comment)
LoRA	`main`	☁️	#3333 #3519
SPM tokenizer	`test-tokenizer-0-llama`	✅
BPE tokenizer	`test-tokenizer-0-falcon`	✅
Models
LLaMA v1	`main`	✅
LLaMA v2	`main`	✅
Falcon	`main`	✅
StarCoder	`main`	✅
Baichuan	`main`	✅
MPT	`main`	✅
Persimmon	`main`	✅
LLaVA	`llava`	✅
Refact	`main`	✅
Bloom	`main`	✅
StableLM-3b-4e1t	`main`	✅
Training
Finetuning CPU	`finetune`	✅
Finetuning Metal	`finetune`	🔬
Backends
CPU x64	`ggml`	✅
CPU Arm	`ggml`	✅
GPU CUDA	`ggml-cuda`	✅
GPU ROCm	`ggml-cuda`	✅
GPU Metal	`ggml-metal`	✅
GPU OpenCL	`ggml-opencl`	✅
GPU Vulkan	`ggml-vulkan`	🚧	#2059

You must be logged in to vote

Replies: 4 comments 10 replies

Comment options

niansa
Oct 4, 2023

What does the "☁️" mean?

You must be logged in to vote

2 replies

Comment options

shibe2 Oct 4, 2023

I don't know what the icon means, but current status of OpenCL back-end is: it works with supported models, but is buggy and perhaps, slower than it could be.

Comment options

ggerganov Oct 4, 2023
Maintainer Author

Yup, this was my impression from reading a few issues lately. If you think it's not the case, feel free to update it. I just haven't set up OpenCL in my environment and cannot do tests atm

Comment options

ScarletEmerald
Oct 5, 2023

So "Parallel decoding" is done bybatched and "Continuous batching" is done byparallel? Are these reversed?

You must be logged in to vote

1 reply

Comment options

ggerganov Oct 5, 2023
Maintainer Author

Parallel decoding is also called "batched decoding" hencebatched. Theparallel example demonstrates a basic server that serves clients in parallel - it just happens to have the continuous batching feature as an option.

Naming things is hard :) Sorry if these are confusing

Comment options

slaren
Oct 7, 2023
Maintainer

Should beam search be added here? I think it is broken atm, at least with CUDA.

You must be logged in to vote

4 replies

Comment options

ggerganov Oct 8, 2023
Maintainer Author

Yes, it should be added. The list is far from complete

Comment options

Mihaiii Oct 8, 2023

Fwiw, for me beam search is broken even without CUDA in a sense that when I run the example, nothing happens (it just hangs for minutesat this line until I CTRL+C it).

If it's an unknown problem, I'll open an issue (tbh, it's strange that nobody mentioned it before so maybe I'm doing something wrong).

Update: when it hangs on the above mentioned line, I have 0 hard page fauls/sec.

Comment options

slaren Oct 8, 2023
Maintainer

With CUDA it works for a while, but then it starts generating gibberish. I think that the calls tollama_decode are failing and it is not catching it. It's probably missing some KV cache management after the batched decoding change.

Comment options

ggerganov Oct 18, 2023
Maintainer Author

The beam search functionality should be moved out from the library and implemented as a standalone example.

Comment options

shibe2
Oct 18, 2023

What would be criteria for considering OpenCL back-end working correctly? I've fixed all known bugs in ggml-opencl.cpp and now working on refactoring like#3669.

You must be logged in to vote

3 replies

Comment options

ggerganov Oct 18, 2023
Maintainer Author

The criteria is that if it runs correctly on your machine, then it is ✅ until someone reports a problem that is reproducible - then it becomes ☁️ or ❌ depending on how broken the thing is

Comment options

shibe2 Oct 18, 2023

Alright, turning the green light then!

Comment options

Yossef-Dawoad Nov 27, 2023

maybe you can ditch the icons for something Like Scoring Like (A+, A, A-, B, ...) this will make it obvious if something working fine but needs improvements has a score with A- and so on, maybe something like this :
[ A+ ] or [ A ] : working like charm
[ A- ] : Working correctly but needs improvement
[ B ] : Partially working
[ B- ] : Partially working with big issues to be resolved
[ C ] : Status unknown (needs testing)
[ D+ ] : Under investigation
[ D ] : Currently in development
[ F ] : Failing

maybe you should add a column for tier support for example, if a feature is tier 1 or 2, ... what do you think?