Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roberta embeddings fixes #10856

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Ssukriti
Copy link
Contributor

@Ssukriti Ssukriti commented Dec 16, 2024

Since the last PR, we identified that the embeddings values being produced were not of good quality. On further debugging, we found that the tokenizer should be set to gpt-2 and position embeddings needed some modifications.

This PR addresses changes to get the correct embeddings from Roberta model.

This PR has been tested by comparing embeddings values against sentenceTransformers library for Roberta architecture

from sentence_transformers import SentenceTransformer
embeddings_model = SentenceTransformer(path)
embedding_vector = embeddings_model.encode([prompt])[0]

now matches

python3 convert_hf_to_gguf.py model_path --outfile model_path.gguf
llama-embedding -m model_path.gguf -p [prompt] -c 514

Hence, the embeddings are now the correct values for Roberta models.

We will add documentation examples in a follow up PR

gabe-l-hart and others added 2 commits December 13, 2024 16:41
Branch: RobertaTokenizer

Signed-off-by: Gabe Goodhart <[email protected]>
Signed-off-by: Sukriti-Sharma4 <[email protected]>
@github-actions github-actions bot added the python python script changes label Dec 16, 2024
# adds the cls/sep tokens as bos/eos. This is handled as a
# post-processor in tokenizers, so the chkhsh is different, but
# it still maps to gpt-2 internally.
res = "gpt-2"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per guidelines, we shouldnt be modifying this value in convert_to_gguf , so that it can be autogenerated from convert_hf_to_gguf_update.py

we want it to map to gpt-2 tokenization type.

Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Any input would be appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants