Token mis-manipulation for Chinese Characters #108

yytang220 · 2024-05-24T08:37:09Z

Code is here https://github.com/noamgat/lm-format-enforcer/blob/main/lmformatenforcer/integrations/transformers.py#L69

This line will affect generation for chinese character. Use
cleaned = decoded will temporarily solve this problem.

Since I don't know the whole logic, seeking permenant solution.

The text was updated successfully, but these errors were encountered:

noamgat · 2024-05-25T05:35:30Z

Can you give a code sample of a reproduction of a problematic scenario that this line change solves?

yytang220 · 2024-05-29T12:02:16Z

Can you give a code sample of a reproduction of a problematic scenario that this line change solves?

before：

after：

per-token decode using different decode_fn
top: with rstrip, bottom: without rstrip

noamgat · 2024-05-31T06:26:48Z

Hi, thanks for the detailed example.
In order to reproduce it in code, can you share the model + prompt + schema that you are using in order to generate text?
Alternatively, the tokenizer + token sequence (numbers, not letters) that you get.

yytang220 · 2024-06-04T09:07:24Z

Hi, thanks for the detailed example. In order to reproduce it in code, can you share the model + prompt + schema that you are using in order to generate text? Alternatively, the tokenizer + token sequence (numbers, not letters) that you get.

token_list : [19788,818,2828,440,13501,4202,8798, 697, 121, 2256]
tokenizer: deepseek-coder (https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token mis-manipulation for Chinese Characters #108

Token mis-manipulation for Chinese Characters #108

yytang220 commented May 24, 2024

noamgat commented May 25, 2024

yytang220 commented May 29, 2024

noamgat commented May 31, 2024

yytang220 commented Jun 4, 2024

Token mis-manipulation for Chinese Characters #108

Token mis-manipulation for Chinese Characters #108

Comments

yytang220 commented May 24, 2024

noamgat commented May 25, 2024

yytang220 commented May 29, 2024

noamgat commented May 31, 2024

yytang220 commented Jun 4, 2024