You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm working on a PR and would like to understand the reason for the behaviour of this _encode_bytes function when it hits an invalid UTF-8 sequence, to ensure I don't break this functionality.
How come only the first valid UTF-8 sequence is encoded with _encode_native (honouring regex splits) but all subsequent bytes are encoded as a single piece with byte_pair_encode? The Utf8Error returned by std::str::from_utf8 contains an error_len() property which gives the length of the invalid byte sequence. So couldn't byte_pair_encode be used only for the invalid sequence, and then use _encode_native again for any subsequent valid sequence? This can be implemented in a loop similar to the example loop in these Rust docs: https://doc.rust-lang.org/std/str/struct.Utf8Error.html
And more generally I'm looking to understand the current use cases that this is supporting and the reason it's implemented like it is. Thanks if you can share any further context.
The text was updated successfully, but these errors were encountered:
ashleyholman
changed the title
Further documentation of encode_with_unstable
Understanding the intended behaviour of _encode_bytes Apr 29, 2024
ashleyholman
changed the title
Understanding the intended behaviour of _encode_bytes
Understanding the intended behaviour of _encode_bytesApr 29, 2024
I'm working on a PR and would like to understand the reason for the behaviour of this
_encode_bytes
function when it hits an invalid UTF-8 sequence, to ensure I don't break this functionality.tiktoken/src/lib.rs
Lines 474 to 495 in 1b9faf2
How come only the first valid UTF-8 sequence is encoded with
_encode_native
(honouring regex splits) but all subsequent bytes are encoded as a single piece withbyte_pair_encode
? TheUtf8Error
returned bystd::str::from_utf8
contains anerror_len()
property which gives the length of the invalid byte sequence. So couldn'tbyte_pair_encode
be used only for the invalid sequence, and then use_encode_native
again for any subsequent valid sequence? This can be implemented in a loop similar to the example loop in these Rust docs: https://doc.rust-lang.org/std/str/struct.Utf8Error.htmlAnd more generally I'm looking to understand the current use cases that this is supporting and the reason it's implemented like it is. Thanks if you can share any further context.
The text was updated successfully, but these errors were encountered: