Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the intended behaviour of _encode_bytes #288

Open
ashleyholman opened this issue Apr 29, 2024 · 0 comments
Open

Understanding the intended behaviour of _encode_bytes #288

ashleyholman opened this issue Apr 29, 2024 · 0 comments

Comments

@ashleyholman
Copy link

ashleyholman commented Apr 29, 2024

I'm working on a PR and would like to understand the reason for the behaviour of this _encode_bytes function when it hits an invalid UTF-8 sequence, to ensure I don't break this functionality.

tiktoken/src/lib.rs

Lines 474 to 495 in 1b9faf2

fn _encode_bytes(&self, py: Python, bytes: &[u8]) -> Vec<Rank> {
py.allow_threads(|| {
match std::str::from_utf8(bytes) {
Ok(text) => self._encode_ordinary_native(text),
Err(e) => {
let text = unsafe { std::str::from_utf8_unchecked(&bytes[..e.valid_up_to()]) };
let (tokens, last_piece_token_len) = self._encode_native(text, &HashSet::new());
let (mut tokens, last_piece_token_len) =
self._increase_last_piece_token_len(tokens, last_piece_token_len);
if !tokens.is_empty() && last_piece_token_len > 0 {
// Lop off the tokens from the last piece and run BPE on the remaining bytes
// Somewhat niche, but this may not be correct if we'd have had a regex
// split between the valid UTF-8 and the invalid bytes, which is why this
// method is private
let mut unstable_bytes =
self._decode_native(&tokens[tokens.len() - last_piece_token_len..]);
unstable_bytes.extend_from_slice(&bytes[e.valid_up_to()..]);
tokens.truncate(tokens.len() - last_piece_token_len);
match self.encoder.get(&unstable_bytes) {
Some(token) => tokens.push(*token),
None => tokens.extend(&byte_pair_encode(&unstable_bytes, &self.encoder)),

How come only the first valid UTF-8 sequence is encoded with _encode_native (honouring regex splits) but all subsequent bytes are encoded as a single piece with byte_pair_encode? The Utf8Error returned by std::str::from_utf8 contains an error_len() property which gives the length of the invalid byte sequence. So couldn't byte_pair_encode be used only for the invalid sequence, and then use _encode_native again for any subsequent valid sequence? This can be implemented in a loop similar to the example loop in these Rust docs: https://doc.rust-lang.org/std/str/struct.Utf8Error.html

And more generally I'm looking to understand the current use cases that this is supporting and the reason it's implemented like it is. Thanks if you can share any further context.

@ashleyholman ashleyholman changed the title Further documentation of encode_with_unstable Understanding the intended behaviour of _encode_bytes Apr 29, 2024
@ashleyholman ashleyholman changed the title Understanding the intended behaviour of _encode_bytes Understanding the intended behaviour of _encode_bytes Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant