Dedicated tokenizer for byte level transformers #36202

apehex · 2025-02-14T18:47:30Z

Feature request

There are alternative transformer architectures that handle bytes directly:

byte latent transformers by Meta
ByT5 by Google
tokun by me 👾

Instead of tokenizing according to a vacabulary, the idea would be to get the raw encoding bytes.

Motivation

Combinations of bytes have more expressive power than flat vocabularies and avoid dimensions of 100k in the first and last layers.
A patch of 4 bytes can represent 4294967296 tokens of length 4.

Your contribution

I have a draft that I will PR shortly!

Rocketknight1 · 2025-02-17T15:32:55Z

cc @ArthurZucker - I'm not sure about this, so asking a core maintainer! I think byte-level tokenization is simple enough that it doesn't need a dedicated class yet, but we may change that if we start seeing a lot of models using it (note that we haven't seen any actual released models using the Byte Latent Transformers architecture!)

apehex added the Feature request Request for a new feature label Feb 14, 2025

apehex linked a pull request Feb 15, 2025 that will close this issue

[WIP] Add a dedicated tokenizer for byte level transformers #36216

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedicated tokenizer for byte level transformers #36202

Dedicated tokenizer for byte level transformers #36202

apehex commented Feb 14, 2025

Rocketknight1 commented Feb 17, 2025

Dedicated tokenizer for byte level transformers #36202

Dedicated tokenizer for byte level transformers #36202

Comments

apehex commented Feb 14, 2025

Feature request

Motivation

Your contribution

Rocketknight1 commented Feb 17, 2025