subword `#` should be an option. #33

FFengIll · 2023-09-14T11:02:52Z

For bert, there are many models use # for subword symbol, but not all.
Some popular bert-based models defined their own subword symbol.

For example, in e5 the symbol is ▁.

>>> a = '▁'
>>> a.encode('utf-8')
b'\xe2\x96\x81'

The text was updated successfully, but these errors were encountered:

FFengIll · 2023-09-14T11:03:27Z

Furthermore, there is no such rule to force use #.

FFengIll · 2023-09-14T11:20:14Z

In model, the substr symbol always be called as replacement or continuing_subword_prefix.
Actually, it will show in tokenizer.json.

skeskinen · 2023-09-18T13:13:22Z

Hi, I was wondering about the subword rules also with regards to #31
I remember trying to get the tokens from the tokenizer, like you did in the PR.
But I also remember having some issue with the subwords when I tried to do this.

Does the code in 31 handle subwords?
Do you have an idea on how to handle models like e5?

Also, unrelated but a thought I had earlier: it would be nice to convert test_tokenizer.cpp to python and run the tests against the reference tokenizers

FFengIll · 2023-09-19T02:05:50Z

@skeskinen no, #31 only make vocab not necessary (because it maybe missing).

This issue is another problem for subwords ( I found this since I meet too many unknown token when using e5).

bellow is some token samples in bert-based model.

in m3e, subword is ## like many bert model.

"##a": 8139,
"03": 8140,
"09": 8141,
"08": 8142,
"28": 8143,
"##2": 8144,

in e5, subword is ▁ since they trained a new tokenizer (bellow is part copy from tokenizer.json)

      [
        "▁si",
        -7.355116367340088
      ],
      [
        "▁ja",
        -7.370460510253906
      ],
      [
        "▁za",
        -7.37307596206665
      ],
      [
        "▁v",
        -7.385393142700195
      ],

FFengIll · 2023-09-19T02:07:27Z

For now, I do not have a good idea for this issue, so I do not implement a PR for it.
Maybe we need to more research and discuss.

cgisky1980 · 2023-09-19T03:08:12Z

For now, I do not have a good idea for this issue, so I do not implement a PR for it. Maybe we need to more research and discuss.

加油，需要跨平台的中英文向量化~ E5 多语言版就不错

FFengIll mentioned this issue Sep 18, 2023

GGUF file format specification ggerganov/ggml#302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subword `#` should be an option. #33

subword `#` should be an option. #33

FFengIll commented Sep 14, 2023

FFengIll commented Sep 14, 2023

FFengIll commented Sep 14, 2023

skeskinen commented Sep 18, 2023

FFengIll commented Sep 19, 2023

FFengIll commented Sep 19, 2023

cgisky1980 commented Sep 19, 2023 •

edited

Loading

subword # should be an option. #33

subword # should be an option. #33

Comments

FFengIll commented Sep 14, 2023

FFengIll commented Sep 14, 2023

FFengIll commented Sep 14, 2023

skeskinen commented Sep 18, 2023

FFengIll commented Sep 19, 2023

FFengIll commented Sep 19, 2023

cgisky1980 commented Sep 19, 2023 • edited Loading

subword `#` should be an option. #33

subword `#` should be an option. #33

cgisky1980 commented Sep 19, 2023 •

edited

Loading