DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) #10535

bartowski1182 · 2024-11-26T21:34:41Z

This isn't the correct solution, the tokenizer is wrong for the instruct models, shouldn't add anything to support it but instead should change the tokenizer.json

~~The checksum for the new tokenizer was missing from convert_hf_to_gguf.py, added to it and convert_hf_to_gguf_update.py~~

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

bartowski1182 · 2024-11-26T21:41:04Z

Hmm this still fails to run for some reason, saying: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'olmo2'

Must have missed adding it somewhere, looking...

bartowski1182 · 2024-11-26T21:59:57Z

Okay confirmed with these changes it seems to recognize the model without issue (haven't confirmed coherency, but imatrix is working now which points to generation working)

slaren · 2024-11-26T22:02:10Z

It is necessary to check that the pre-tokenizer regex matches the one being used in llama.cpp as well, otherwise a new one needs to be added.

bartowski1182 · 2024-11-26T22:02:50Z

@slaren are you referring to this: https://github.com/ggerganov/llama.cpp/pull/10535/files#diff-57725fa4c99fac28d0283eaa8b59c5e030355af0c8fbaf33145a64e3f2aa3420R6409

or is there something else i'm missing

slaren · 2024-11-26T22:05:36Z

That's one part of it, but as it is, it is just using the default BPE regex. If it is different it needs to be added here:

llama.cpp/src/llama-vocab.cpp

Line 369 in c9b00a7

struct llm_tokenizer_bpe : llm_tokenizer {

bartowski1182 · 2024-11-26T22:07:17Z

Ahh I see, looking in https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct/raw/main/tokenizer.json I don't see any regex defined, does that imply the default is fine?

slaren · 2024-11-26T22:10:33Z

It seems to use the GPT2 tokenizer so I imagine it needs to use the GPT2 regex (which olmo also uses). @2015aroras might know if there are changes to the tokenizer compared to olmo.

2015aroras · 2024-11-26T22:46:50Z

I'm not very familiar with tokenizers, so bare with me. OLMo 1 and OLMo 2 have different tokenizers with different vocab sizes. When I implemented OLMo 2 here, I tested only the base OLMo 2 model since that's all I had at hand. It seems to have check hash a8594e3edff7c29c003940395316294b2c623e09894deebbc65f33f1515df79e (corresponding to dbrx). The only tokenizer change that the Instruct model is supposed to have relative to the base model is a chat template.

bartowski1182 · 2024-11-26T23:15:59Z

Yes I noticed the base model went fine but the instruct versions didn't, there must be some other change to the tokenizer that's not as obvious

2015aroras · 2024-11-26T23:47:44Z

Using this with the 13B model seems to yield a fair conversation. No idea why one of the prompts got no response the first time, though.

<|system|>
You are OLMo 2 Instruct, a helpful, open-source AI Assistant built by the Allen Institute for AI.

> Who are you?
I am OLMo 2 Instruct, an AI language model developed by the Allen Institute for Artificial Intelligence (Ai2). I am designed to assist with a wide range of tasks, from answering questions to generating text-based content. My responses are generated based on patterns in data, and while I strive to provide accurate and helpful information, please keep in mind that my capabilities are not equivalent to human understanding or expertise.

> Why is the earth round?

> Why is the earth round?
The Earth is round (more accurately, an oblate spheroid) because of gravity. Over millions of years, the force of gravity has acted on the mass of the Earth, pulling it toward the center from all directions equally
. This uniform pull has resulted in a roughly spherical shape.

The spinning motion of the Earth also plays a role in its shape. This rotational force, known as the centrifugal force, causes the planet to bulge slightly at the equator and flatten at the poles, creating an oblate
> What is 38 * 17?
38 multiplied by 17 equals 646.

bartowski1182 · 2024-11-27T00:28:40Z

Seems reasonable enough, is this made using my PR?

bartowski1182 · 2024-11-27T00:53:34Z

Can you use llama-tokenize to see if it's tokenizing stuff properly?

bartowski1182 · 2024-11-27T01:46:53Z

After trying it with llama-tokenize I can see that no it's not tokenizing correctly even with my changes

2015aroras · 2024-11-27T01:58:30Z

I used the PR, but didn't try llama-tokenize.

We're looking internally into the tokenizer differences. I'll report back when we have a conclusion.

bartowski1182 · 2024-11-27T03:20:50Z

Okay thanks sounds good, I'll also take a look now that I'm home

bartowski1182 · 2024-11-27T03:44:35Z

well here's something @2015aroras

the non-instruct has a regex string in the pre_tokenizer, the instruct doesn't

non-instruct:

"pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

instruct:

"pre_tokenizer": {
    "type": "ByteLevel",
    "add_prefix_space": false,
    "trim_offsets": true,
    "use_regex": true
  },

I'm guessing that's what's making the difference

Also the post_processor's are different, but not sure it matters since llama.cpp is detecting it fine and presumably won't use that anyways post-conversion

bartowski1182 · 2024-11-27T03:55:50Z

Yup so adding the pre_tokenizer makes it pass without my changes

bartowski1182 · 2024-11-27T04:05:10Z

However even with that change it doesn't seem to be tokenizing correctly :/

100257 -> '<|endoftext|>'
    27 -> '<'
    91 -> '|'
  9125 -> 'system'
    91 -> '|'
   397 -> '>
'
  2675 -> 'You'
   527 -> ' are'
   459 -> ' an'
 18328 -> ' assistant'
   198 -> '
'
    27 -> '<'
    91 -> '|'
   882 -> 'user'
    91 -> '|'
    29 -> '>'
 15339 -> 'hello'

bartowski1182 · 2024-11-27T04:08:09Z

but maybe that's just how the model behaves by default..? I don't see any tokens for <|user|> or <|system|>, so it won't be able to tokenize it properly D:

bartowski1182 · 2024-12-10T20:20:07Z

any updates @2015aroras ?

Add olmo2 tokenizer to convert script

8d243b6

github-actions bot added the python python script changes label Nov 26, 2024

Add tokenizer type

445904e

bartowski1182 changed the title ~~Add olmo2 tokenizer to convert script~~ DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) Nov 27, 2024

bartowski1182 marked this pull request as draft November 27, 2024 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) #10535

DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) #10535

bartowski1182 commented Nov 26, 2024 •

edited

Loading

bartowski1182 commented Nov 26, 2024

bartowski1182 commented Nov 26, 2024

slaren commented Nov 26, 2024 •

edited

Loading

bartowski1182 commented Nov 26, 2024

slaren commented Nov 26, 2024

bartowski1182 commented Nov 26, 2024 •

edited

Loading

slaren commented Nov 26, 2024 •

edited

Loading

2015aroras commented Nov 26, 2024 •

edited

Loading

bartowski1182 commented Nov 26, 2024

2015aroras commented Nov 26, 2024

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024

2015aroras commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024 •

edited

Loading

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024 •

edited

Loading

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Dec 10, 2024

DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) #10535

Are you sure you want to change the base?

DO NOT MERGE Add olmo2 tokenizer to convert script (leaving open for discussion) #10535

Conversation

bartowski1182 commented Nov 26, 2024 • edited Loading

bartowski1182 commented Nov 26, 2024

bartowski1182 commented Nov 26, 2024

slaren commented Nov 26, 2024 • edited Loading

bartowski1182 commented Nov 26, 2024

slaren commented Nov 26, 2024

bartowski1182 commented Nov 26, 2024 • edited Loading

slaren commented Nov 26, 2024 • edited Loading

2015aroras commented Nov 26, 2024 • edited Loading

bartowski1182 commented Nov 26, 2024

2015aroras commented Nov 26, 2024

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024

2015aroras commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024 • edited Loading

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Nov 27, 2024 • edited Loading

bartowski1182 commented Nov 27, 2024

bartowski1182 commented Dec 10, 2024

bartowski1182 commented Nov 26, 2024 •

edited

Loading

slaren commented Nov 26, 2024 •

edited

Loading

bartowski1182 commented Nov 26, 2024 •

edited

Loading

slaren commented Nov 26, 2024 •

edited

Loading

2015aroras commented Nov 26, 2024 •

edited

Loading

bartowski1182 commented Nov 27, 2024 •

edited

Loading

bartowski1182 commented Nov 27, 2024 •

edited

Loading