-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PaliGemma Support #7553
base: master
Are you sure you want to change the base?
Add PaliGemma Support #7553
Conversation
Hm yes, this is not currently supported. Need to figure out what changes would be necessary. Explicitly setting the mask through the API would be possible, but I think it would be too difficult to use. There are other options such as:
Not sure yet what would be best |
I'm partial to this if it's the most straightforward to implement just because it offers the most general API but I understand the concern. Otherwise, I think option 1 is the better approach as it's centered around the batch api and would require minimal work to modify existing code. |
@ggerganov I'll see what I can come up with along those lines. We could probably limit complexity by only allowing future attention to work within a batch, otherwise we would need multiple graph compute calls to update previously computed kv positions. I might be mistaken about that though. |
Yes you are correct, future attention can only be done within the batch. Not sure if option 1 would require less changes though. We have to remember for each token what is the set of tokens that it attends to. So even if we can provide this information for the current batch through On the other hand option 2 might need to just maintain a container with ranges in the |
cc @iamlemec, @compilade - bringing your attention to this discussion in case you get some additional ideas how to support this functionality |
I think that the choice between 1 and 2 might come down to whether the causal/non-causal block positions are generally fixed. It seems like with PaliGemma the image block is of fixed length determined by the image processing model. With something like GritLM the non-causal segments can be of varying size. Also, a 3rd slightly more general implementation would be to add a per-token That said, there might be a way to do it with the current codebase. Can we just evaluate the image tokens in a first batch with |
The problem I think is that during the second batch, the attention mask will be causal for the entire context, which would lead to incorrect masking of the tokens from the first batch, no? |
But those KV's from the first block are already computed, cached, and wont be recomputed. So any given token in the second batch will be attending to every single token in the first batch, and the KV's they pick up will have been computed non-causally in the first batch execution. |
The KV data is cached yes, but the softmax computation in the attention during the new batch still needs the correct mask for the entire context. With a causal mask, the tokens from the new batch would correctly attend to all previous ones, but the previous tokens would not attend correctly to themselves - they still attend to each other via the softmax, even though they are from the previous batch |
Thanks for the reply @ggerganov. I'm still not 100% convinced, but it seems possible that I'm confused about the premise here. Partially for my own edification, and partially to clear up any ambiguity, I decided to write up a minimal working example in Python with Here's a link to the Gist: https://gist.github.com/iamlemec/3febf59b41b7f32a450fcfcb4be0713c. I used RoBERTa because it's still relatively small and allows one to specify full 2D attention matrices, rather than just 1D attention masks like in base BERT. Anyway, those asserts pass! So that's potential validation? |
@iamlemec I stand corrected. Thanks for this example and helping figuring this out! So this is actually very good news - we should be able to support PaliGemma with the existing API, correct? |
@iamlemec @ggerganov sounds good, so if I understand correctly the approach would be to update the path for This may negatively impact performance of pure embedding models as they don't store anything in the cache afaik but maybe there's a way to distinguish the two scenarios. |
@abetlen Yes, I think this should work
Hopefully no change would be needed. Note that we have two parameters:
The latter can be changed via |
@abetlen @ggerganov Do we need to actually modify eval_string(ctx_llava->ctx_llama, system_prompt.c_str(), params->n_batch, &n_past, true);
llava_eval_image_embed(ctx_llava->ctx_llama, image_embed, params->n_batch, &n_past);
eval_string(ctx_llava->ctx_llama, user_prompt.c_str(), params->n_batch, &n_past, false); to llama_set_causal_attn(ctx_llava->ctx_llama, true);
eval_string(ctx_llava->ctx_llama, system_prompt.c_str(), params->n_batch, &n_past, true);
llama_set_causal_attn(ctx_llava->ctx_llama, false);
llava_eval_image_embed(ctx_llava->ctx_llama, image_embed, params->n_batch, &n_past);
llama_set_causal_attn(ctx_llava->ctx_llama, true);
eval_string(ctx_llava->ctx_llama, user_prompt.c_str(), params->n_batch, &n_past, false); I'm not quite sure how the system prompt works with |
Yes, it should work without changes to |
That's the approach I was initially trying but it caused this assert to fail as the logits aren't reserved when However I think I was just missing a one line change in |
Hello, I have cloned abetlen's work. I am trying to run the converting script on this model https://huggingface.co/google/paligemma-3b-pt-224/tree/main, but I keep getting the following error: Traceback (most recent call last):
File "/home/belkov.arseniy/paligemma/convert.py", line 305, in <module>
special_vocab.add_to_gguf(fout)
File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/vocab.py", line 69, in add_to_gguf
add_handler(value)
File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/gguf_writer.py", line 530, in add_add_bos_token
self.add_bool(Keys.Tokenizer.ADD_BOS, value)
File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/gguf_writer.py", line 201, in add_bool
self.add_key_value(key, val, GGUFValueType.BOOL)
File "/home/belkov.arseniy/paligemma/llama.cpp/gguf-py/gguf/gguf_writer.py", line 166, in add_key_value
raise ValueError(f'Duplicated key name {key!r}')
ValueError: Duplicated key name 'tokenizer.ggml.add_bos_token' Can someone please help? |
I don't get it, is the feature meant to have already been implemented or is it a work in progress? |
…add-paligemma-support
@ggerganov was able to come back to this and finally get it working. Changes:
The below python pseudocaode is what's currently required to evaluate the prefill correctly cparams.n_ubatch = 512 # This needs to be large enough to fit the image embeddings __and__ text embeddings from the input prompt in a single llama_decode call
# ...
tokens = tokenize(prompt)
token_embeddingss = llama_token_inp_embd(lctx, tokens) # we need to embed the input tokens because llama_decode supports __either__ token or embedding inputs
# ...
image_embeddings = embed_from_bytes(llava_ctx, image_bytes)
# ...
memcpy(batch.embd, image_embeddings.embd, n_image_pos * embedding_dim * sizeof(float))
memcpy(batch.embd + n_image_pos * embedding_dim * sizeof(float), token_embeddings, len(tokens) * embedding_dim * sizeof(float));
# ...
llama_set_causal_attn(lctx, False)
llama_decode(batch);
llama_set_causal_attn(lctx, True) |
@abetlen Great to see that this is working. Merging this PR might have to wait or be implemented within the scope of the new effort for adding vision capabilities to the core |
@ggerganov no problem, I'll work with @ngxson and see if I can provide support on that PR. |
@abetlen FYI, this week I'm currently quite busy on @huggingface hub stuff 😅 My current plan is to be able to add both tokens and embeddings into In the meantime, would you mind to take a look on how the batch can be divided into ubatch with both the tokens and embd inside? |
Very much still a work in progress however I've been able to convert the weights and update the clip model to support the differences in the PaliGemma implementation of SigLIP (second projector mlp layer is missing).
The next missing piece is that PaliGemma evaluates the prefix part of the prompt (which includes image embeddings) using a fully-connected attention mask as opposed to causal attention (See HF Implementation).
I haven't played around too much with, otherwise it will be necessary to update the API to specify a custom attention mask.llama_set_causal_attn
function but I believe this may be sufficientI've created a custom script to generate the f16 ggufs here I've opted to do this in a custom script as the current
convert-hf-to-gguf.py
is not suited for converting vlms at the moment.