Skip to content

Commit

Permalink
[Streamer] Fix UTF-8 handling in streamer (mlc-ai#2978)
Browse files Browse the repository at this point in the history
This PR fixes a bug in the streamer handling for UTF-8 characters.
Prior to this PR, the streamer has an assumption that a replacement
character (`�`) always correspond to an entire token. However, for
the Qwen2 model tokenizer, some token can be ` �` if decoded directly,
which breaks the assumption and leads to incorrect result generated
by the streamer.

This PR fixes this issue with a safer behavior that does not rely
on such an assumption.
  • Loading branch information
MasterJH5574 authored Oct 14, 2024
1 parent 82b9d85 commit fead3e5
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion cpp/tokenizers/streamer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,8 @@ std::string TextStreamerObj::Put(const std::vector<int32_t>& delta_tokens) {
0) {
new_pending_tokens.push_back(pending_tokens_.back());
pending_tokens_.pop_back();
validated_str = validated_str.substr(0, validated_str.length() - 3);
all_tokens.pop_back();
validated_str = tokenizer_->Decode(all_tokens).substr(prefix_str.length());
}
} else {
// Case 2. prefix_str is not a prefix of `full_str`.
Expand Down

0 comments on commit fead3e5

Please sign in to comment.