Understanding the intended behaviour of `_encode_bytes` #288

ashleyholman · 2024-04-29T09:39:42Z

I'm working on a PR and would like to understand the reason for the behaviour of this _encode_bytes function when it hits an invalid UTF-8 sequence, to ensure I don't break this functionality.

tiktoken/src/lib.rs

Lines 474 to 495 in 1b9faf2

    
           fn _encode_bytes(&self, py: Python, bytes: &[u8]) -> Vec<Rank> { 
        
               py.allow_threads(|| { 
        
                   match std::str::from_utf8(bytes) { 
        
                       Ok(text) => self._encode_ordinary_native(text), 
        
                       Err(e) => { 
        
                           let text = unsafe { std::str::from_utf8_unchecked(&bytes[..e.valid_up_to()]) }; 
        
                           let (tokens, last_piece_token_len) = self._encode_native(text, &HashSet::new()); 
        
                           let (mut tokens, last_piece_token_len) = 
        
                               self._increase_last_piece_token_len(tokens, last_piece_token_len); 
        
                           if !tokens.is_empty() && last_piece_token_len > 0 { 
        
                               // Lop off the tokens from the last piece and run BPE on the remaining bytes 
        
                               // Somewhat niche, but this may not be correct if we'd have had a regex 
        
                               // split between the valid UTF-8 and the invalid bytes, which is why this 
        
                               // method is private 
        
                               let mut unstable_bytes = 
        
                                   self._decode_native(&tokens[tokens.len() - last_piece_token_len..]); 
        
                               unstable_bytes.extend_from_slice(&bytes[e.valid_up_to()..]); 
        
                               tokens.truncate(tokens.len() - last_piece_token_len); 
        
                               match self.encoder.get(&unstable_bytes) { 
        
                                   Some(token) => tokens.push(*token), 
        
                                   None => tokens.extend(&byte_pair_encode(&unstable_bytes, &self.encoder)),

How come only the first valid UTF-8 sequence is encoded with _encode_native (honouring regex splits) but all subsequent bytes are encoded as a single piece with byte_pair_encode? The Utf8Error returned by std::str::from_utf8 contains an error_len() property which gives the length of the invalid byte sequence. So couldn't byte_pair_encode be used only for the invalid sequence, and then use _encode_native again for any subsequent valid sequence? This can be implemented in a loop similar to the example loop in these Rust docs: https://doc.rust-lang.org/std/str/struct.Utf8Error.html

And more generally I'm looking to understand the current use cases that this is supporting and the reason it's implemented like it is. Thanks if you can share any further context.

The text was updated successfully, but these errors were encountered:

ashleyholman changed the title ~~Further documentation of encode_with_unstable~~ Understanding the intended behaviour of _encode_bytes Apr 29, 2024

ashleyholman changed the title ~~Understanding the intended behaviour of _encode_bytes~~ Understanding the intended behaviour of _encode_bytes Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the intended behaviour of `_encode_bytes` #288

Understanding the intended behaviour of `_encode_bytes` #288

ashleyholman commented Apr 29, 2024 •

edited

Loading

Understanding the intended behaviour of _encode_bytes #288

Understanding the intended behaviour of _encode_bytes #288

Comments

ashleyholman commented Apr 29, 2024 • edited Loading

Understanding the intended behaviour of `_encode_bytes` #288

Understanding the intended behaviour of `_encode_bytes` #288

ashleyholman commented Apr 29, 2024 •

edited

Loading