Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement do_handle_chinese_characters in tokenizing #1

Open
skeskinen opened this issue Apr 27, 2023 · 1 comment
Open

implement do_handle_chinese_characters in tokenizing #1

skeskinen opened this issue Apr 27, 2023 · 1 comment
Labels
good first issue Good for newcomers

Comments

@skeskinen
Copy link
Owner

As of yet I haven't tried what happens with Chinese/Japanese characters in tokenization. Some special handling is required since these languages don't have spaces between words.

It should be relatively simple to copy existing implementation:

  1. Get inspiration from existing implementation like: https://github.com/huggingface/tokenizers/blob/ef5f50605ddf9f8caef1598c0e4853862b9707a7/tokenizers/src/normalizers/bert.rs#L98
  2. Implement that in bert.cpp -> bert_normalize_prompt
  3. Add some test cases with Asian languages to test_tokenizer.cpp, get the expected results from python Transformers lib tokenizer.

Alternatively:
Replace the whole tokenizer with the huggingface rust implementation? It should probably be at least simplified a little bit, but I would be fine adding some rust code here if it doesn't complicate the build too much.

@skeskinen skeskinen added the good first issue Good for newcomers label Apr 27, 2023
@skeskinen
Copy link
Owner Author

Another implementation of bert tokenizing: https://github.com/zhihu/cuBERT/blob/master/src/cuBERT/tokenization.cpp
Also, it would probably make sense to move the tokenization tests to python. That way it would be easy to compare with hf-transformers output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant