- Notifications
You must be signed in to change notification settings - Fork13.9k
Implement multimodal models (LLaVA)#3436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
staviq commentedOct 2, 2023
Sometime ago I was playing with the idea of allowing images to be uploaded via Would it be helpful for testing if I make a pr with this change ? The idea was to import images client side, in the browser, draw them on hidden canvas and export as ppm, this would allow such image to be processed server side without relying on any external libraries/dependencies I could add image upload to the Let me know if you are interested. |
monatis commentedOct 2, 2023
Thanks@staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI. |
staviq commentedOct 2, 2023
I completely missed stb is licensed under MIT, that's cool. No format shenanigans necessary then. Ok, take your time then, I'll wait until you feel comfortable for UI integration. |
monatis commentedOct 7, 2023
Sorry for the delay here. There was an issue with evaluating embedding input that I needed to debug, and it was too painful to do so with my physical machine slow at generation. Obtained a faster VM in the cloud and hope to move faster this weekend. |
monatis commentedOct 7, 2023
This is now working with recently published LLaVA V1.5. The CLIP part consumes a huge amount of memory --I'll optimize it with |
monatis commentedOct 8, 2023
@josephilome this shouldn't that hard --I can implement it once the current implementation is optimized. |
monatis commentedOct 9, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There are still some tasks to do but I think this is ready for testing / feedback / reviews. A pre-converted model can be foundhere. You need to download one of the ggml-model[f16|q5_k|q4_k].gguf models and the mmproj-model-f16.gguf (the image encoder). These two-file format is faster to move right now, but we can think of a single file format in the future. Also see the readme. I'll add more documentation, do code cleanup and address reviews this afternoon. Any feedback is welcome. |
ggerganov commentedOct 9, 2023 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@monatis Awesome stuff! I haven't had a detailed look or ran tests yet, but looking at the progress, it's quite amazing to have something that can understand images. Looking forward to giving this a try! Just curious, how much of the total compute is done by CLIP? I.e. is it a bottleneck? |
ExtReMLapin commentedJan 31, 2024
Any plan to update the GGUF for LLaVA 1.6 ? |
Green-Sky commentedJan 31, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
oh they released themhttps://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2 a few days ago i only saw the 1.6 preview in their hf space, but no mention of it anywhere else on the internet :) edit: blog posthttps://llava-vl.github.io/blog/2024-01-30-llava-1-6/ |
ExtReMLapin commentedFeb 1, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Even if you convert the safetensor file into torch .bin file you will get this error when trying to convert to GGUF |
gamester2665 commentedFeb 1, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
yup.. can confirm following#2948 doesn't yield valid llava-v1.6-mistral-7b-GGUF... any suggestions? |
ExtReMLapin commentedFeb 1, 2024
And that's the first one that fails (pretty much the first or second layer lmao) |
chigkim commentedFeb 1, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Looping in@haotian-liu and@cmp-nct in case they could help with Llava V1.6. |
cjpais commentedFeb 1, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I've got a hacked up script that works for 1.6, will share shortly on a fork raw script (breaks llava 1.5 support):llava1.6-surgery-hack.py
note: the location of the mmproj is different between 34b and 7b, probably best to do a search for all of the mmproj tensors, split them all out, save them, and resave each checkpoint without them |
cmp-nct commentedFeb 1, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I'm also half way but occupied with real world stuff. I've created a pull draft to use as a base for 1.6#5267 Right now I am struggling with the new ViT When not using the correct ViT I could already test llava-1.6 and despite not including the proper image manipulation and resolution it is anyway very good already. |
cjpais commentedFeb 2, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
gamester2665 commentedFeb 2, 2024
awesome! thanks@cjpais .. throwing into LMStudio for testing now |
BBC-Esq commentedFeb 2, 2024
Did it work in LM Studio? |
gamester2665 commentedFeb 2, 2024
@BBC-Esq Yes! cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf working successfully in LMStudio. |
BBC-Esq commentedFeb 2, 2024
You guys move fast. I'm considering moving my stuff from ctranslate2 to llama.cpp, any good issues/discussions to see if you move that fast with whisper.cpp? |
ExtReMLapin commentedFeb 2, 2024
bruh moment |
aymenabid-lab commentedMar 25, 2024
I'm use the llava how to modify bach size to avoid this error
tokenizer, model, image_processor, context_len = load_pretrained_model( |
cebtenzzre commentedMar 25, 2024
You're almost certainly looking forhttps://github.com/haotian-liu/LLaVA. This is the llama.cpp repo. |
Uh oh!
There was an error while loading.Please reload this page.
closes#3332
This is still WIP and highly experimental.
The work started inlmm.cpp,
but it turned out to be also ok to implement it in this repo, which I believe will be much simpler.
The plan is make a surgery on LLaVA models and export:
llavaexecutable.usage:
This will output the detailed description of the image.
Note: You can override the default textual prompt "Describe the image in detail." by adding
-p "custom promp comes here". Run./bin/llavafor other options.Note: A lower temperature value like 0.1 is recommended. Add
--temp 0.1to your command to do so.