Fixed Length Pre-Tokenizer #1713

jonvet · 2025-01-05T22:59:02Z

Introduces a pre-tokenizer to split text in fixed length chunks (closes #1697).
The method pre_tokenize could be more made more concise by creating a vector with indices first like so

    let mut splits = Vec::new();

    for chunk in char_positions.chunks(self.length) {
        let start = chunk.first().map(|(i, _)| *i).unwrap_or(0);
        let end = chunk.last().map(|(i, c)| i + c.len_utf8()).unwrap_or(text.len());
        splits.push(normalized.slice(Range::Normalized(start..end))
            .ok_or("Failed to slice normalized text")?);
    }

but that would take a bit more memory, so I went for my approach instead.

jonvet added 2 commits January 5, 2025 22:46

Fixed Length PRe-Tokenizer

6f73c55

remove comment

221d55e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed Length Pre-Tokenizer #1713

Fixed Length Pre-Tokenizer #1713

jonvet commented Jan 5, 2025

Fixed Length Pre-Tokenizer #1713

Are you sure you want to change the base?

Fixed Length Pre-Tokenizer #1713

Conversation

jonvet commented Jan 5, 2025