Roberta embeddings fixes #10856

Ssukriti · 2024-12-16T22:51:58Z

Since the last PR, we identified that the embeddings values being produced were not of good quality. On further debugging, we found that the tokenizer should be set to gpt-2 and position embeddings needed some modifications.

This PR addresses changes to get the correct embeddings from Roberta model.

This PR has been tested by comparing embeddings values against sentenceTransformers library for Roberta architecture

from sentence_transformers import SentenceTransformer
embeddings_model = SentenceTransformer(path)
embedding_vector = embeddings_model.encode([prompt])[0]

now matches

python3 convert_hf_to_gguf.py model_path --outfile model_path.gguf
llama-embedding -m model_path.gguf -p [prompt] -c 514

Hence, the embeddings are now the correct values for Roberta models.

We will add documentation examples in a follow up PR

Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <[email protected]>

Signed-off-by: Sukriti-Sharma4 <[email protected]>

Ssukriti · 2024-12-16T23:56:00Z

convert_hf_to_gguf.py

+            #   adds the cls/sep tokens as bos/eos. This is handled as a
+            #   post-processor in tokenizers, so the chkhsh is different, but
+            #   it still maps to gpt-2 internally.
+            res = "gpt-2"


as per guidelines, we shouldnt be modifying this value in convert_to_gguf , so that it can be autogenerated from convert_hf_to_gguf_update.py

we want it to map to gpt-2 tokenization type.

Would the correct way to do this be keep it to roberta-bpe as generated and then add it to mapping https://github.com/ggerganov/llama.cpp/blob/master/src/llama.cpp#L6483 here , so that it maps to gpt-2?

Any input would be appreciated

gabe-l-hart and others added 2 commits December 13, 2024 16:41

fix: Use gpt2 tokenizer for roberta and add eos/bos tokens

a2e03b8

Branch: RobertaTokenizer Signed-off-by: Gabe Goodhart <[email protected]>

fixes to position embeddings

d5f69e8

Signed-off-by: Sukriti-Sharma4 <[email protected]>

github-actions bot added the python python script changes label Dec 16, 2024

Ssukriti commented Dec 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roberta embeddings fixes #10856

Roberta embeddings fixes #10856

Ssukriti commented Dec 16, 2024 •

edited

Loading

Ssukriti Dec 16, 2024

Roberta embeddings fixes #10856

Are you sure you want to change the base?

Roberta embeddings fixes #10856

Conversation

Ssukriti commented Dec 16, 2024 • edited Loading

Ssukriti Dec 16, 2024

Choose a reason for hiding this comment

Ssukriti commented Dec 16, 2024 •

edited

Loading