Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

nleroy917 · 2025-02-04T13:20:28Z

The current tokenizers use rust-lapper as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become even faster if we make some assumptions about our data, like is it sorted?

Things we should decide on:

Can we replace rust-lapper with AIList?
Can we create a version of the tokenizers that assumes sorted files? Call it a SpeedTokenizer
Should the above SpeedTokenizer be checking for sorted-ness?

The text was updated successfully, but these errors were encountered:

nleroy917 self-assigned this Feb 4, 2025

nleroy917 added enhancement New feature or request tokenizers Region tokenization brainstorming AIList labels Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

nleroy917 commented Feb 4, 2025

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Comments

nleroy917 commented Feb 4, 2025