Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) #10535

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

bartowski1182
Copy link
Contributor

@bartowski1182 bartowski1182 commented Nov 26, 2024

This isn't the correct solution, the tokenizer is wrong for the instruct models, shouldn't add anything to support it but instead should change the tokenizer.json

The checksum for the new tokenizer was missing from convert_hf_to_gguf.py, added to it and convert_hf_to_gguf_update.py

@github-actions github-actions bot added the python python script changes label Nov 26, 2024
@bartowski1182
Copy link
Contributor Author

Hmm this still fails to run for some reason, saying: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'olmo2'

Must have missed adding it somewhere, looking...

@bartowski1182
Copy link
Contributor Author

Okay confirmed with these changes it seems to recognize the model without issue (haven't confirmed coherency, but imatrix is working now which points to generation working)

@slaren
Copy link
Collaborator

slaren commented Nov 26, 2024

It is necessary to check that the pre-tokenizer regex matches the one being used in llama.cpp as well, otherwise a new one needs to be added.

@bartowski1182
Copy link
Contributor Author

@slaren
Copy link
Collaborator

slaren commented Nov 26, 2024

That's one part of it, but as it is, it is just using the default BPE regex. If it is different it needs to be added here:

struct llm_tokenizer_bpe : llm_tokenizer {

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Nov 26, 2024

Ahh I see, looking in https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct/raw/main/tokenizer.json I don't see any regex defined, does that imply the default is fine?

@slaren
Copy link
Collaborator

slaren commented Nov 26, 2024

It seems to use the GPT2 tokenizer so I imagine it needs to use the GPT2 regex (which olmo also uses). @2015aroras might know if there are changes to the tokenizer compared to olmo.

@2015aroras
Copy link
Contributor

2015aroras commented Nov 26, 2024

I'm not very familiar with tokenizers, so bare with me. OLMo 1 and OLMo 2 have different tokenizers with different vocab sizes. When I implemented OLMo 2 here, I tested only the base OLMo 2 model since that's all I had at hand. It seems to have check hash a8594e3edff7c29c003940395316294b2c623e09894deebbc65f33f1515df79e (corresponding to dbrx). The only tokenizer change that the Instruct model is supposed to have relative to the base model is a chat template.

@bartowski1182
Copy link
Contributor Author

Yes I noticed the base model went fine but the instruct versions didn't, there must be some other change to the tokenizer that's not as obvious

@2015aroras
Copy link
Contributor

Using this with the 13B model seems to yield a fair conversation. No idea why one of the prompts got no response the first time, though.

<|system|>
You are OLMo 2 Instruct, a helpful, open-source AI Assistant built by the Allen Institute for AI.

> Who are you?
I am OLMo 2 Instruct, an AI language model developed by the Allen Institute for Artificial Intelligence (Ai2). I am designed to assist with a wide range of tasks, from answering questions to generating text-based content. My responses are generated based on patterns in data, and while I strive to provide accurate and helpful information, please keep in mind that my capabilities are not equivalent to human understanding or expertise.

> Why is the earth round?

> Why is the earth round?
The Earth is round (more accurately, an oblate spheroid) because of gravity. Over millions of years, the force of gravity has acted on the mass of the Earth, pulling it toward the center from all directions equally
. This uniform pull has resulted in a roughly spherical shape.

The spinning motion of the Earth also plays a role in its shape. This rotational force, known as the centrifugal force, causes the planet to bulge slightly at the equator and flatten at the poles, creating an oblate
> What is 38 * 17?
38 multiplied by 17 equals 646.

@bartowski1182
Copy link
Contributor Author

Seems reasonable enough, is this made using my PR?

@bartowski1182
Copy link
Contributor Author

Can you use llama-tokenize to see if it's tokenizing stuff properly?

@bartowski1182
Copy link
Contributor Author

After trying it with llama-tokenize I can see that no it's not tokenizing correctly even with my changes

@2015aroras
Copy link
Contributor

I used the PR, but didn't try llama-tokenize.

We're looking internally into the tokenizer differences. I'll report back when we have a conclusion.

@bartowski1182
Copy link
Contributor Author

Okay thanks sounds good, I'll also take a look now that I'm home

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Nov 27, 2024

well here's something @2015aroras

the non-instruct has a regex string in the pre_tokenizer, the instruct doesn't

non-instruct:

"pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

instruct:

"pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
  },

I'm guessing that's what's making the difference

Also the post_processor's are different, but not sure it matters since llama.cpp is detecting it fine and presumably won't use that anyways post-conversion

@bartowski1182
Copy link
Contributor Author

Yup so adding the pre_tokenizer makes it pass without my changes

@bartowski1182
Copy link
Contributor Author

bartowski1182 commented Nov 27, 2024

However even with that change it doesn't seem to be tokenizing correctly :/

100257 -> '<|endoftext|>'
    27 -> '<'
    91 -> '|'
  9125 -> 'system'
    91 -> '|'
   397 -> '>
'
  2675 -> 'You'
   527 -> ' are'
   459 -> ' an'
 18328 -> ' assistant'
   198 -> '
'
    27 -> '<'
    91 -> '|'
   882 -> 'user'
    91 -> '|'
    29 -> '>'
 15339 -> 'hello'

@bartowski1182
Copy link
Contributor Author

but maybe that's just how the model behaves by default..? I don't see any tokens for <|user|> or <|system|>, so it won't be able to tokenize it properly D:

@bartowski1182 bartowski1182 changed the title Add olmo2 tokenizer to convert script DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) Nov 27, 2024
@bartowski1182 bartowski1182 marked this pull request as draft November 27, 2024 17:06
@bartowski1182
Copy link
Contributor Author

any updates @2015aroras ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants