Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink the core overlap computation for the tokenizers -- AIList and speed tokenizers #81

Open
nleroy917 opened this issue Feb 4, 2025 · 0 comments
Assignees
Labels
AIList brainstorming enhancement New feature or request tokenizers Region tokenization

Comments

@nleroy917
Copy link
Member

The current tokenizers use rust-lapper as an overlap computation method. This is nice but we should be using our own algorithms -- namely AIList. Moreover, tokenization can become even faster if we make some assumptions about our data, like is it sorted?

Things we should decide on:

  1. Can we replace rust-lapper with AIList?
  2. Can we create a version of the tokenizers that assumes sorted files? Call it a SpeedTokenizer
  3. Should the above SpeedTokenizer be checking for sorted-ness?
@nleroy917 nleroy917 self-assigned this Feb 4, 2025
@nleroy917 nleroy917 added enhancement New feature or request tokenizers Region tokenization brainstorming AIList labels Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AIList brainstorming enhancement New feature or request tokenizers Region tokenization
Projects
None yet
Development

No branches or pull requests

1 participant