Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Implement multimodal models (LLaVA)#3436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
monatis merged 36 commits intomasterfromllava
Oct 12, 2023
Merged

Implement multimodal models (LLaVA)#3436

monatis merged 36 commits intomasterfromllava
Oct 12, 2023

Conversation

@monatis
Copy link
Collaborator

@monatismonatis commentedOct 2, 2023
edited
Loading

closes#3332

This is still WIP and highly experimental.

The work started inlmm.cpp,
but it turned out to be also ok to implement it in this repo, which I believe will be much simpler.

The plan is make a surgery on LLaVA models and export:

  1. a regular llama.gguf file,
  2. a custom CLIP model with multimodal projector on top of it.
  • GGUF support for CLIP and LLaVA model surgery is already done.
  • E2E inference of LLaVA V1.5.
  • Use the GGML allocator API and cleanup the code.
  • Better CLI args handling inllava executable.
  • Upload pre-converted models and write a readme.

usage:

  • Build with cmake.
  • From this link download `mmproj-model-f16.gguf and one of ggml-model-[f16|q5_k|q4_k].gguf.
  • Run:
./bin/llava -m ggml-model-q5_k.gguf --mmproj mmproj-model-f16.gguf --image path/to/an/image.jpg

This will output the detailed description of the image.

Note: You can override the default textual prompt "Describe the image in detail." by adding-p "custom promp comes here". Run./bin/llava for other options.

Note: A lower temperature value like 0.1 is recommended. Add--temp 0.1 to your command to do so.

lin72h, Peng-YM, BetaDoggo, RozanskiT, swittk, jeregrine, eryk-mazus, PredatorIWD, Mahdi451, KZeni, and 34 more reacted with thumbs up emojilin72h, Peng-YM, aiaicode, chrisbward, DenisSergeevitch, shouse-lab, Riki1200, HKAB, mudler, bobqianic, and 11 more reacted with hooray emojilin72h, BetaDoggo, aiaicode, DenisSergeevitch, Shiva953, mudler, nikolaydubina, MaxKlyukin, nebulatgs, haotian-liu, and 10 more reacted with heart emojiJosh-XT, Green-Sky, jhen0409, lin72h, FSSRepo, BetaDoggo, mirek190, yagil, ggerganov, kchro3, and 9 more reacted with rocket emojiggerganov, Mahdi451, KZeni, GraceKafuu, axelpey, and vinwizardtoo reacted with eyes emoji
@staviq
Copy link
Contributor

Sometime ago I was playing with the idea of allowing images to be uploaded viaserver web UI, I had a working poc, but dropped the idea since nobody was working on multimodal functionality back then

Would it be helpful for testing if I make a pr with this change ?

The idea was to import images client side, in the browser, draw them on hidden canvas and export as ppm, this would allow such image to be processed server side without relying on any external libraries/dependencies

I could add image upload to theserver UI and a simple image wrapper class/functions on the cpp side.

Let me know if you are interested.

Green-Sky, PapersAnon, lin72h, ryansereno, z11h, KadirErturk4r, Kin-Ian, and dagsvik reacted with thumbs up emojiNigelNelson reacted with heart emoji

@monatis
Copy link
CollaboratorAuthor

Thanks@staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

@staviq
Copy link
Contributor

Thanks@staviq! We can work with images thanks to a single-header C library included in this branch (stb-image.h), but integration with the UI would be great after this PR gets mature. It seems to be requiring some refactoring to the inference code of CLIP, copied from another repo of mine, due to different versions of GGML used. Currently I'm trying to debug and fix it --once done, I can move faster and we can colaborate for integration with the UI.

I completely missed stb is licensed under MIT, that's cool. No format shenanigans necessary then.

Ok, take your time then, I'll wait until you feel comfortable for UI integration.

monatis and lin72h reacted with thumbs up emoji

@ggerganovggerganov added the modelModel specific labelOct 3, 2023
@monatis
Copy link
CollaboratorAuthor

Sorry for the delay here. There was an issue with evaluating embedding input that I needed to debug, and it was too painful to do so with my physical machine slow at generation. Obtained a faster VM in the cloud and hope to move faster this weekend.

lin72h, PapersAnon, and GraceKafuu reacted with thumbs up emojilin72h reacted with eyes emoji

@monatis
Copy link
CollaboratorAuthor

This is now working with recently published LLaVA V1.5. The CLIP part consumes a huge amount of memory --I'll optimize it withggml_allocr and cleanup the implementation tomorrow.

lin72h, Galunid, mirek190, dotieuthien, and GraceKafuu reacted with hooray emojilin72h, Galunid, and ggerganov reacted with rocket emoji

@monatis
Copy link
CollaboratorAuthor

@josephilome this shouldn't that hard --I can implement it once the current implementation is optimized.

lin72h, mirek190, swittk, and CrazyBrick reacted with heart emoji

@monatis
Copy link
CollaboratorAuthor

monatis commentedOct 9, 2023
edited
Loading

There are still some tasks to do but I think this is ready for testing / feedback / reviews.

A pre-converted model can be foundhere.

You need to download one of the ggml-model[f16|q5_k|q4_k].gguf models and the mmproj-model-f16.gguf (the image encoder). These two-file format is faster to move right now, but we can think of a single file format in the future. Also see the readme.

I'll add more documentation, do code cleanup and address reviews this afternoon. Any feedback is welcome.

lin72h, PapersAnon, ianscrivener, and ym-msadjadi reacted with thumbs up emojilin72h, Green-Sky, and BarfingLemurs reacted with hooray emojilin72h reacted with heart emojijohnmccabe reacted with rocket emoji

@monatismonatis marked this pull request as ready for reviewOctober 9, 2023 06:55
@ggerganov
Copy link
Member

ggerganov commentedOct 9, 2023
edited
Loading

@monatis Awesome stuff!

I haven't had a detailed look or ran tests yet, but looking at the progress, it's quite amazing to have something that can understand images. Looking forward to giving this a try!

Just curious, how much of the total compute is done by CLIP? I.e. is it a bottleneck?

lin72h and Lurrobert reacted with thumbs up emoji

@ggerganovggerganov added the high priorityVery important issue labelOct 9, 2023
@ExtReMLapin
Copy link
Contributor

Any plan to update the GGUF for LLaVA 1.6 ?

lin72h, Green-Sky, shinomakoi, ze-lopes, and gamester2665 reacted with eyes emoji

@Green-Sky
Copy link
Collaborator

Green-Sky commentedJan 31, 2024
edited
Loading

oh they released themhttps://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2

a few days ago i only saw the 1.6 preview in their hf space, but no mention of it anywhere else on the internet :)

edit: blog posthttps://llava-vl.github.io/blog/2024-01-30-llava-1-6/

lin72h reacted with thumbs up emoji

@ExtReMLapin
Copy link
Contributor

ExtReMLapin commentedFeb 1, 2024
edited
Loading

Even if you convert the safetensor file into torch .bin file you will get this error when trying to convert to GGUF

  File "/opt/LLaVA/llama.cpp/convert.py", line 1474, in <module>    main()  File "/opt/LLaVA/llama.cpp/convert.py", line 1460, in main    model   = convert_model_names(model, params)  File "/opt/LLaVA/llama.cpp/convert.py", line 1198, in convert_model_names    raise Exception(f"Unexpected tensor name: {name}")Exception: Unexpected tensor name: model.image_newline
gamester2665 reacted with eyes emoji

@gamester2665
Copy link

gamester2665 commentedFeb 1, 2024
edited
Loading

yup.. can confirm following#2948 doesn't yield valid llava-v1.6-mistral-7b-GGUF... any suggestions?

$ python llama.cpp/convert.py llava-hf \>   --outfile llava-v1.6-mistral-7b-GGUF.gguf \>   --outtype f32Loading model file llava-hf\model-00001-of-00004.safetensorsLoading model file llava-hf\model-00001-of-00004.safetensorsLoading model file llava-hf\model-00002-of-00004.safetensorsLoading model file llava-hf\model-00003-of-00004.safetensorsLoading model file llava-hf\model-00004-of-00004.safetensorsparams = Params(n_vocab=32000, n_embd=4096, n_layer=32, n_ctx=32768, n_ff=14336, n_head=32, n_head_kv=8, n_experts=None, n_experts_used=None, f_norm_eps=1e-05, rope_scaling_type=None, f_rope_freq_base=1000000.0, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.AllF32: 0>, path_model=WindowsPath('llava-hf'))Found vocab files: {'tokenizer.model': WindowsPath('llava-hf/tokenizer.model'), 'vocab.json': None, 'tokenizer.json': WindowsPath('llava-hf/tokenizer.json')}Loading vocab file 'llava-hf\tokenizer.model', type 'spm'Vocab info: <SentencePieceVocab with 32000 base tokens and 0 added tokens>Special vocab info: <SpecialVocab with 0 merges, special tokens {'bos': 1, 'eos': 2, 'unk': 0, 'pad': 0}, add special tokens {'bos': True, 'eos': False}>Permuting layer 0Permuting layer 1Permuting layer 2Permuting layer 3Permuting layer 4Permuting layer 5Permuting layer 6Permuting layer 7Permuting layer 8Permuting layer 9Permuting layer 10Permuting layer 11Permuting layer 12Permuting layer 13Permuting layer 14Permuting layer 15Permuting layer 16Permuting layer 17Permuting layer 18Permuting layer 19Permuting layer 20Permuting layer 21Permuting layer 22Permuting layer 23Permuting layer 24Permuting layer 25Permuting layer 26Permuting layer 27Permuting layer 28Permuting layer 29Permuting layer 30Permuting layer 31model.embed_tokens.weight                        -> token_embd.weight                        | BF16   | [32000, 4096]Traceback (most recent call last):  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1474, in <module>    main()  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1460, in main    model   = convert_model_names(model, params)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "F:\SANDBOX\convert_llava\llama.cpp\convert.py", line 1198, in convert_model_names    raise Exception(f"Unexpected tensor name: {name}")Exception: Unexpected tensor name: model.image_newline(llama-new)

@ExtReMLapin
Copy link
Contributor

And that's the first one that fails (pretty much the first or second layer lmao)

@chigkim
Copy link

chigkim commentedFeb 1, 2024
edited
Loading

Looping in@haotian-liu and@cmp-nct in case they could help with Llava V1.6.

@cjpais
Copy link
Contributor

cjpais commentedFeb 1, 2024
edited
Loading

I've got a hacked up script that works for 1.6, will share shortly on a fork

raw script (breaks llava 1.5 support):llava1.6-surgery-hack.py

  • loads safetensors
  • removes "model.image_newline" forconvert.py, I don't know the impact of this
  • splits mm_projector into new file
  • saves updates safetensors which have been modified

note: the location of the mmproj is different between 34b and 7b, probably best to do a search for all of the mmproj tensors, split them all out, save them, and resave each checkpoint without them

@cmp-nct
Copy link
Contributor

cmp-nct commentedFeb 1, 2024
edited
Loading

I'm also half way but occupied with real world stuff.
The main task of 1.6 is to implement the new 'unpad' mechanism

I've created a pull draft to use as a base for 1.6#5267
It uses a clean surgery script which should work with all variants of llava, it also supports searching for stuff (though it currently does not search for the projector, only for the ViT)
The projector gguf file is also prepared for the new features (spatial_unpad), the new tensor is moved in there

Right now I am struggling with the new ViT
size mismatch for vision_model.encoder.layers.1.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([13824]).
That's ffn_down and ffn_up

When not using the correct ViT I could already test llava-1.6 and despite not including the proper image manipulation and resolution it is anyway very good already.

lin72h reacted with thumbs up emoji

@cjpais
Copy link
Contributor

cjpais commentedFeb 2, 2024
edited
Loading

not sure if okay to share here...
for those who are looking here are initial gguf quants for llava 1.6

please note they are very early, built from the hacked surgery script. improvements coming in#5267 from@cmp-nct, will try to contribute where I can but I am nothing close to an expert

7b mistral
34b

gamester2665, ExtReMLapin, lin72h, LachlanStuart, and marschr reacted with heart emoji

@gamester2665
Copy link

awesome! thanks@cjpais .. throwing into LMStudio for testing now

@BBC-Esq
Copy link

Did it work in LM Studio?

@gamester2665
Copy link

@BBC-Esq Yes! cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf working successfully in LMStudio.

lin72h reacted with thumbs up emoji

@BBC-Esq
Copy link

You guys move fast. I'm considering moving my stuff from ctranslate2 to llama.cpp, any good issues/discussions to see if you move that fast with whisper.cpp?

@ExtReMLapin
Copy link
Contributor

  • removes "model.image_newline" forconvert.py, I don't know the impact of this

bruh moment

hdu214 reacted with thumbs down emoji

@aymenabid-lab
Copy link

I'm use the llava

how to modify bach size to avoid this error

  • from python within terminal:
    python -m llava.serve.model_worker --host 0.0.0.0 --controllerhttp://localhost:10000 --port 40000 --workerhttp://localhost:40000 --model-path /home/dl_g15/llava-v1.5-13b
    =>
    torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 22.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 8.06 MiB is free. Including non-PyTorch memory, this process has 7.73 GiB memory in use. Of the allocated memory 7.60 GiB is allocated by PyTorch, and 7.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    from anaconda:
    model_path = "/home/dl_g15/llava-v1.5-13b"

tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path)
)
=>
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile withTORCH_USE_CUDA_DSA to enable device-side assertions.

@cebtenzzre
Copy link
Collaborator

I'm use the llava

You're almost certainly looking forhttps://github.com/haotian-liu/LLaVA. This is the llama.cpp repo.

smcnally reacted with thumbs up emoji

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@ggerganovggerganovggerganov approved these changes

+2 more reviewers

@doomed151doomed151doomed151 left review comments

@Green-SkyGreen-SkyGreen-Sky left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

high priorityVery important issuellavaLLaVa and multimodalmodelModel specificneed feedbackTesting and feedback with results are needed

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

llama : add multimodal support (LLaVA)

34 participants

@monatis@staviq@ggerganov@Green-Sky@Galunid@aiaicode@BarfingLemurs@gcardoso2314@cebtenzzre@rlancemartin@ilteris@cednats@haotian-liu@bkbasavaraju@QueryType@aisensiy@LumenYoung@TikaToka@pudepiedj@Lurrobert@kiiwee@ASmallPotato@djasil@RachelShalom@ExtReMLapin@gamester2665@chigkim@cjpais@cmp-nct@BBC-Esq@aymenabid-lab@doomed151@phymbert

[8]ページ先頭

©2009-2025 Movatter.jp