implement do_handle_chinese_characters in tokenizing #1

skeskinen · 2023-04-27T17:40:06Z

As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.

It should be relatively simple to copy existing implementation:

Get inspiration from existing implementation like: https://github.com/huggingface/tokenizers/blob/ef5f50605ddf9f8caef1598c0e4853862b9707a7/tokenizers/src/normalizers/bert.rs#L98
Implement that in bert.cpp -> bert_normalize_prompt
Add some test cases with Asian languages to test_tokenizer.cpp, get the expected results from python Transformers lib tokenizer.

Alternatively:
Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.

skeskinen · 2023-05-03T13:24:40Z

Another implementation of bert tokenizing: https://github.com/zhihu/cuBERT/blob/master/src/cuBERT/tokenization.cpp
Also, it would probably make sense to move the tokenization tests to python. That way it would be easy to compare with hf-transformers output.

skeskinen added the good first issue Good for newcomers label Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement do_handle_chinese_characters in tokenizing #1

implement do_handle_chinese_characters in tokenizing #1

skeskinen commented Apr 27, 2023

skeskinen commented May 3, 2023

implement do_handle_chinese_characters in tokenizing #1

implement do_handle_chinese_characters in tokenizing #1

Comments

skeskinen commented Apr 27, 2023

skeskinen commented May 3, 2023