You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.
It should be relatively simple to copy existing implementation:
Implement that in bert.cpp -> bert_normalize_prompt
Add some test cases with Asian languages to test_tokenizer.cpp, get the expected results from python Transformers lib tokenizer.
Alternatively:
Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.
The text was updated successfully, but these errors were encountered:
As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.
It should be relatively simple to copy existing implementation:
Alternatively:
Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.
The text was updated successfully, but these errors were encountered: