Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dedicated tokenizer for byte level transformers #36202

Open
apehex opened this issue Feb 14, 2025 · 1 comment · May be fixed by #36216
Open

Dedicated tokenizer for byte level transformers #36202

apehex opened this issue Feb 14, 2025 · 1 comment · May be fixed by #36216
Labels
Feature request Request for a new feature

Comments

@apehex
Copy link

apehex commented Feb 14, 2025

Feature request

There are alternative transformer architectures that handle bytes directly:

Instead of tokenizing according to a vacabulary, the idea would be to get the raw encoding bytes.

Motivation

Combinations of bytes have more expressive power than flat vocabularies and avoid dimensions of 100k in the first and last layers.
A patch of 4 bytes can represent 4294967296 tokens of length 4.

Your contribution

I have a draft that I will PR shortly!

@apehex apehex added the Feature request Request for a new feature label Feb 14, 2025
@Rocketknight1
Copy link
Member

cc @ArthurZucker - I'm not sure about this, so asking a core maintainer! I think byte-level tokenization is simple enough that it doesn't need a dedicated class yet, but we may change that if we start seeing a lot of models using it (note that we haven't seen any actual released models using the Byte Latent Transformers architecture!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants