You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Instead of tokenizing according to a vacabulary, the idea would be to get the raw encoding bytes.
Motivation
Combinations of bytes have more expressive power than flat vocabularies and avoid dimensions of 100k in the first and last layers.
A patch of 4 bytes can represent 4294967296 tokens of length 4.
Your contribution
I have a draft that I will PR shortly!
The text was updated successfully, but these errors were encountered:
cc @ArthurZucker - I'm not sure about this, so asking a core maintainer! I think byte-level tokenization is simple enough that it doesn't need a dedicated class yet, but we may change that if we start seeing a lot of models using it (note that we haven't seen any actual released models using the Byte Latent Transformers architecture!)
Feature request
There are alternative transformer architectures that handle bytes directly:
Instead of tokenizing according to a vacabulary, the idea would be to get the raw encoding bytes.
Motivation
Combinations of bytes have more expressive power than flat vocabularies and avoid dimensions of 100k in the first and last layers.
A patch of 4 bytes can represent 4294967296 tokens of length 4.
Your contribution
I have a draft that I will PR shortly!
The text was updated successfully, but these errors were encountered: