NotificationsYou must be signed in to change notification settings
Fork13.9k
Star90.6k

Implement multimodal models (LLaVA)#3436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

monatis merged 36 commits intomasterfromllava

Oct 12, 2023

Merged

Implement multimodal models (LLaVA)#3436

monatis merged 36 commits intomasterfromllava

Oct 12, 2023

Conversation

Copy link

Collaborator

monatis commentedOct 2, 2023•
edited
Loading

closes#3332

This is still WIP and highly experimental.

The work started inlmm.cpp,
but it turned out to be also ok to implement it in this repo, which I believe will be much simpler.

The plan is make a surgery on LLaVA models and export:

a regular llama.gguf file,
a custom CLIP model with multimodal projector on top of it.

GGUF support for CLIP and LLaVA model surgery is already done.
E2E inference of LLaVA V1.5.
Use the GGML allocator API and cleanup the code.
Better CLI args handling inllava executable.
Upload pre-converted models and write a readme.

usage:

Build with cmake.
From this link download `mmproj-model-f16.gguf and one of ggml-model-[f16|q5_k|q4_k].gguf.
Run:

./bin/llava -m ggml-model-q5_k.gguf --mmproj mmproj-model-f16.gguf --image path/to/an/image.jpg

This will output the detailed description of the image.

Note: You can override the default textual prompt "Describe the image in detail." by adding-p "custom promp comes here". Run./bin/llava for other options.

Note: A lower temperature value like 0.1 is recommended. Add--temp 0.1 to your command to do so.

WIP: start implementing LLaVA

59aa1ac

Copy link

Contributor

staviq commentedOct 2, 2023

Sometime ago I was playing with the idea of allowing images to be uploaded viaserver web UI, I had a working poc, but dropped the idea since nobody was working on multimodal functionality back then

Would it be helpful for testing if I make a pr with this change ?

The idea was to import images client side, in the browser, draw them on hidden canvas and export as ppm, this would allow such image to be processed server side without relying on any external libraries/dependencies

I could add image upload to theserver UI and a simple image wrapper class/functions on the cpp side.

Let me know if you are interested.

Copy link

CollaboratorAuthor

monatis commentedOct 2, 2023

Thanks@staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

Copy link

Contributor

staviq commentedOct 2, 2023

Thanks@staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

I completely missed stb is licensed under MIT, that's cool. No format shenanigans necessary then.

Ok, take your time then, I'll wait until you feel comfortable for UI integration.

monatis added3 commits

October 2, 2023 21:38

rm scratch buf for now, will revert after cleanup

0f0e7c6

LLaVA image encoder is working. will combine with llama

7e9120f

Add llava inference code, but it's buggy. debugging

d37ed47

ggerganov added the modelModel specific label

Oct 3, 2023

Copy link

CollaboratorAuthor

monatis commentedOct 7, 2023

Sorry for the delay here. There was an issue with evaluating embedding input that I needed to debug, and it was too painful to do so with my physical machine slow at generation. Obtained a faster VM in the cloud and hope to move faster this weekend.

LLaVA is working e2e, needs to optimize memory allocation + cleanup

8690f42

Copy link

CollaboratorAuthor

monatis commentedOct 7, 2023

This is now working with recently published LLaVA V1.5. The CLIP part consumes a huge amount of memory --I'll optimize it withggml_allocr and cleanup the implementation tomorrow.

Copy link

CollaboratorAuthor

monatis commentedOct 8, 2023

@josephilome this shouldn't that hard --I can implement it once the current implementation is optimized.

monatis added5 commits

October 8, 2023 14:58

Use ggml_allocr + rm unnecessary code

94eeac3

fix: crlf -> lf

0c2bd79

fix: new line at EoF

204d08b

fix: trailing whitespace

95da79e

Merge branch 'master' into llava

2a04d0b

monatis mentioned this pull request

Oct 8, 2023

Thank you for you work!monatis/lmm.cpp#1

Open

Add readme

444dbce

Copy link

CollaboratorAuthor

monatis commentedOct 9, 2023•
edited
Loading

There are still some tasks to do but I think this is ready for testing / feedback / reviews.

A pre-converted model can be foundhere.

You need to download one of the ggml-model[f16|q5_k|q4_k].gguf models and the mmproj-model-f16.gguf (the image encoder). These two-file format is faster to move right now, but we can think of a single file format in the future. Also see the readme.

I'll add more documentation, do code cleanup and address reviews this afternoon. Any feedback is welcome.

monatis requested a review fromggerganov

October 9, 2023 06:55

monatis marked this pull request as ready for review

October 9, 2023 06:55

monatis added6 commits

October 9, 2023 11:10

Update readme

8af7e21

Some cleanup

54495c9

Are you happy editorconfig?

9b0ec4d

rm unused batch image preprocessing

8278a73

rm unused import

d78e816

fix: rm designated initializers

4759bfd

Copy link

Member

ggerganov commentedOct 9, 2023•
edited
Loading

@monatis Awesome stuff!

I haven't had a detailed look or ran tests yet, but looking at the progress, it's quite amazing to have something that can understand images. Looking forward to giving this a try!

Just curious, how much of the total compute is done by CLIP? I.e. is it a bottleneck?

ggerganov added the high priorityVery important issue label

Oct 9, 2023

yeroc mentioned this pull request

Dec 19, 2023

Support multimodal inputskherud/java-llama.cpp#34

Open

chigkim mentioned this pull request

Jan 12, 2024

Convert to Gguf format to work with Llama.cpp?OpenGVLab/InternVL#32

Closed

feng-intel mentioned this pull request

Jan 17, 2024

[Question] Can LLava inference on CPU?haotian-liu/LLaVA#865

Open

Copy link

Contributor

ExtReMLapin commentedJan 31, 2024

Any plan to update the GGUF for LLaVA 1.6 ?

Copy link

Collaborator

Green-Sky commentedJan 31, 2024•
edited
Loading

oh they released themhttps://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2

a few days ago i only saw the 1.6 preview in their hf space, but no mention of it anywhere else on the internet :)

edit: blog posthttps://llava-vl.github.io/blog/2024-01-30-llava-1-6/

Copy link

Contributor

ExtReMLapin commentedFeb 1, 2024•
edited
Loading

Even if you convert the safetensor file into torch .bin file you will get this error when trying to convert to GGUF

  File "/opt/LLaVA/llama.cpp/convert.py", line 1474, in <module>    main()  File "/opt/LLaVA/llama.cpp/convert.py", line 1460, in main    model   = convert_model_names(model, params)  File "/opt/LLaVA/llama.cpp/convert.py", line 1198, in convert_model_names    raise Exception(f"Unexpected tensor name: {name}")Exception: Unexpected tensor name: model.image_newline

Copy link

gamester2665 commentedFeb 1, 2024•
edited
Loading

yup.. can confirm following#2948 doesn't yield valid llava-v1.6-mistral-7b-GGUF... any suggestions?

$ python llama.cpp/convert.py llava-hf \>   --outfile llava-v1.6-mistral-7b-GGUF.gguf \>   --outtype f32Loading model file llava-hf\model-00001-of-00004.safetensorsLoading model file llava-hf\model-00001-of-00004.safetensorsLoading model file llava-hf\model-00002-of-00004.safetensorsLoading model file llava-hf\model-00003-of-00004.safetensorsLoading model file llava-hf\model-00004-of-00004.safetensorsparams = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=1000000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.AllF32: 0>, path_model=WindowsPath('llava-hf'))Found vocab files: {'tokenizer.model': WindowsPath('llava-hf/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': WindowsPath('llava-hf/tokenizer.json')}Loading vocab file 'llava-hf\tokenizer.model', type 'spm'Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 0}, add special tokens {'bos': True, 'eos': False}>Permuting layer 0Permuting layer 1Permuting layer 2Permuting layer 3Permuting layer 4Permuting layer 5Permuting layer 6Permuting layer 7Permuting layer 8Permuting layer 9Permuting layer 10Permuting layer 11Permuting layer 12Permuting layer 13Permuting layer 14Permuting layer 15Permuting layer 16Permuting layer 17Permuting layer 18Permuting layer 19Permuting layer 20Permuting layer 21Permuting layer 22Permuting layer 23Permuting layer 24Permuting layer 25Permuting layer 26Permuting layer 27Permuting layer 28Permuting layer 29Permuting layer 30Permuting layer 31model.embed_tokens.weight                        -> token_embd.weight                        | BF16   | [32000, 4096]Traceback (most recent call last):  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1474, in <module>    main()  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1460, in main    model   = convert_model_names(model, params)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1198, in convert_model_names    raise Exception(f"Unexpected tensor name: {name}")Exception: Unexpected tensor name: model.image_newline(llama-new)

Copy link

Contributor

ExtReMLapin commentedFeb 1, 2024

And that's the first one that fails (pretty much the first or second layer lmao)

Copy link

chigkim commentedFeb 1, 2024•
edited
Loading

Looping in@haotian-liu and@cmp-nct in case they could help with Llava V1.6.

Copy link

Contributor

cjpais commentedFeb 1, 2024•
edited
Loading

I've got a hacked up script that works for 1.6, will share shortly on a fork

raw script (breaks llava 1.5 support):llava1.6-surgery-hack.py

loads safetensors
removes "model.image_newline" forconvert.py, I don't know the impact of this
splits mm_projector into new file
saves updates safetensors which have been modified

note: the location of the mmproj is different between 34b and 7b, probably best to do a search for all of the mmproj tensors, split them all out, save them, and resave each checkpoint without them

Copy link

Contributor

cmp-nct commentedFeb 1, 2024•
edited
Loading

I'm also half way but occupied with real world stuff.
The main task of 1.6 is to implement the new 'unpad' mechanism

I've created a pull draft to use as a base for 1.6#5267
It uses a clean surgery script which should work with all variants of llava, it also supports searching for stuff (though it currently does not search for the projector, only for the ViT)
The projector gguf file is also prepared for the new features (spatial_unpad), the new tensor is moved in there

Right now I am struggling with the new ViT
size mismatch for vision_model.encoder.layers.1.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([13824]).
That's ffn_down and ffn_up

When not using the correct ViT I could already test llava-1.6 and despite not including the proper image manipulation and resolution it is anyway very good already.

Copy link

Contributor

cjpais commentedFeb 2, 2024•
edited
Loading

not sure if okay to share here...
for those who are looking here are initial gguf quants for llava 1.6

please note they are very early, built from the hacked surgery script. improvements coming in#5267 from@cmp-nct, will try to contribute where I can but I am nothing close to an expert

7b mistral
34b

Copy link

gamester2665 commentedFeb 2, 2024

awesome! thanks@cjpais .. throwing into LMStudio for testing now

Copy link

BBC-Esq commentedFeb 2, 2024

Did it work in LM Studio?

Copy link

gamester2665 commentedFeb 2, 2024

@BBC-Esq Yes! cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf working successfully in LMStudio.

Copy link

BBC-Esq commentedFeb 2, 2024

You guys move fast. I'm considering moving my stuff from ctranslate2 to llama.cpp, any good issues/discussions to see if you move that fast with whisper.cpp?

Copy link

Contributor

ExtReMLapin commentedFeb 2, 2024

removes "model.image_newline" forconvert.py, I don't know the impact of this

bruh moment

This was referencedFeb 27, 2024

LLaVA/README.md at main · haotian-liu/LLaVAirthomasthomas/undecidability#628

Open

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipeirthomasthomas/undecidability#636

Open

TabbyML: Self-hosted AI coding assistant.irthomasthomas/undecidability#642

Open

LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4 - Predibase - Predibaseirthomasthomas/undecidability#645

Open

StarCoder2 and The Stack v2 from BigCodeirthomasthomas/undecidability#662

Open

MultiAgentLLM a faithful recreation of the Small LLMs Are Weak Tool Learners: A Multi-LLM Agent research paperirthomasthomas/undecidability#681

Open

Introducing the next generation of Claude \ Anthropicirthomasthomas/undecidability#685

Open

irthomasthomas mentioned this pull request

Mar 6, 2024

LMOps/README.md at main · microsoft/LMOpsirthomasthomas/undecidability#706

Open

1 task

Copy link

aymenabid-lab commentedMar 25, 2024

I'm use the llava

how to modify bach size to avoid this error

from python within terminal:
python -m llava.serve.model_worker --host 0.0.0.0 --controllerhttp://localhost:10000 --port 40000 --workerhttp://localhost:40000 --model-path /home/dl_g15/llava-v1.5-13b
=>
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 8.06 MiB is free. Including non-PyTorch memory, this process has 7.73 GiB memory in use. Of the allocated memory 7.60 GiB is allocated by PyTorch, and 7.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
from anaconda:
model_path = "/home/dl_g15/llava-v1.5-13b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
=>
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile withTORCH_USE_CUDA_DSA to enable device-side assertions.