-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automate vocab support and model conversion #7379
base: master
Are you sure you want to change the base?
Conversation
- Add imports for json and hashlib - Add missing models: phi, stablelm, mistral, and mixtral - Fix constructor logic - Fix how models are accessed - Apply model schema to download_model method
…ecksums and tests
Update: The JSON structure is clear, but the hierarchical nature of associated metadata presents some challenges. Challenges: There are two paths I am considering to implement this solution effectively:
Considerations: I am currently developing a pseudo registry to aid in my understanding of the underlying implementation and making it easier to add new models using a CLI tool without any complications. The sha256sum for encodings is primarily used to validate vocabularies, but I'm starting to question its usefulness as it feels more like a distraction than an essential feature. To mitigate potential issues with regular expressions, validating them in aggregate might be beneficial. Regular expressions can be difficult to read and understand unless you are familiar with the author's intentions, making them write-only by nature. |
Notes: Depending on the model used, the rules applied differ. Deepseek uses a type "normalizer": {
"type": "Sequence",
"normalizers": []
},
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": "[\r\n]"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Split",
"pattern": {
"Regex": "\\s?\\p{L}+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Split",
"pattern": {
"Regex": "\\s?\\p{P}+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Split",
"pattern": {
"Regex": "[一-龥ࠀ-一가-]+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Digits",
"individual_digits": true
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
}, You could associate each normalizer and pre-tokenizer rule via an array and associated index, but this would be a lot more cumbersome than one might hope, even if it were simplified to the way it is currently implemented. I like the apparent elegance of simply taking each pre-tokenizer and treating them as an array, but this is problematic at best because now we have no idea whether or not to trim, add spaces, or use the associated regular expression. Some rules don't even need it at all. A reg ex could be available, but that doesn't necessarily mean it should be used. I've noticed this is quite common in trained tokenizers. This is obviously affecting the tokenized output. |
Notes: Pair scores can be calculated using the formula:
This only applies to Word Piece Models. |
Notes:
// NOTE: Renamed to prevent name conflicts and support backward compatibility
static std::string ggml_format(const char* fmt, ...) {
va_list ap;
va_list ap2;
va_start(ap, fmt);
va_copy(ap2, ap);
int size = vsnprintf(NULL, 0, fmt, ap);
GGML_ASSERT(size >= 0 && size < INT_MAX); // NOLINT
std::vector<char> buf(size + 1);
int size2 = vsnprintf(buf.data(), size + 1, fmt, ap2);
GGML_ASSERT(size2 == size);
va_end(ap2);
va_end(ap);
return std::string(buf.data(), size);
} I spent way more time on this than I would have liked. C++ 20 and higher has Possible remedy is to utilize C++ namespacing and apply proper prefix for each respective namespace, e.g. GGML and LLAMA namespaces. |
I can definitely do this. The only issue is how much I'm allowed to refactor to achieve this goal. It will break backwards compatibility but will end up with much cleaner code and a more automated process as a result. The workflow would remain the same. |
02:22:41 | ~/Local/code/tokenizers
(.venv) git:(main | Δ) λ bpython
bpython version 0.24 on top of Python 3.12.3 /home/austin/Local/code/tokenizers/.venv/bin/python
>>> import logging
>>> import pprint
>>> from pathlib import Path
>>> from tok.gguf.huggingface_hub import HFHubModel
>>>
>>> auth_token = "<auth_read_token>"
>>> path = Path("models")
>>> logger = logging.getLogger('test')
>>> hub_model = HFHubModel(auth_token, path, logger)
Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (cache).
Your token has been saved to /home/austin/.cache/huggingface/token
Login successful
>>> hub_model.request.list_remote_files
<bound method HFHubRequest.list_remote_files of <tok.gguf.huggingface_hub.HFHubRequest object at 0x787e7ab5e990>>
>>> model_repo = "stabilityai/stablelm-2-zephyr-1_6b"
>>> model_files = hub_model.request.list_remote_model_parts(model_repo)
>>> pprint.pprint(model_files, indent=2)
['model.safetensors']
>>> stablelm = "stabilityai/stablelm-2-zephyr-1_6b"
>>> llama3 = "meta-llama/Meta-Llama-3-8B"
>>> llama2 = "meta-llama/Llama-2-7b-hf"
>>> hub_model.request.list_remote_model_parts(llama2)
['model-00001-of-00002.safetensors', 'model-00002-of-00002.safetensors']
>>> hub_model.request.list_remote_model_parts(llama3)
['model-00001-of-00004.safetensors', 'model-00002-of-00004.safetensors', 'model-00003-of-00004.safetensors', 'model-00004-o
f-00004.safetensors']
>>> This works fine in most cases, but fails every now and then. >>> falcon = "tiiuae/falcon-7b-instruct"
>>> hub_model.request.list_remote_model_parts(falcon)
['coreml/text-generation/falcon-7b-64-float32.mlpackage/Data/com.apple.CoreML/weights/weight.bin', 'pytorch_model-00001-of-
00002.bin', 'pytorch_model-00002-of-00002.bin'] Not sure if should hardcode the model parts on a model by model basis or use a selection method for edge cases, e.g. expected 2 model parts but got 3 instead and then have a selection pop up to mediate it. This is nice because these aren't local. These are completely remote. So if you have read access on a gated model, it works. If it's public, no problem. |
Model parts are arbitrary. The same model with the same tokenizer and weights could be split into a different number of valid model parts. I don't think this should be hardcoded. Missing model parts are already detected in
|
@compilade Yes, I understand, but that's not always true. >>> mistral3 = "mistralai/Mistral-7B-Instruct-v0.3"
>>>
>>> model_parts = hub_model.request.list_remote_model_parts(mistral3)
>>>
>>> model_parts
['consolidated.safetensors', 'model-00001-of-00003.safetensors', 'model-00002-of-00003.safetensors', 'model-00003-of-00003.
safetensors'] The This isn't the only situation like this. For example, llama3 returns the expected results, but falcon doesn't even though they both have nested paths (e.g. llama3 has the original model files and the HF files). The APIs results are inconsistent. Regardless, this might hold true depending on the user. It's easier to have a list that a user can select from to verify whenever a deviation is encountered. The only other mitigation I can think of is for the user to specify the model file types, parts, and number of parts on a model by model basis. Another potential solution have a list of these files available in each of the model definitions. If the desire is to not encounter any prompts and just "magically" convert, the only rational move here is to list the model parts on a model-by-model basis in each model definition and then reference that list as necessary. It's the most sane thing I can think of rather than coming up with hacky solutions to constantly work around the issue which will always cause issues unless the problem is known and understood by the user. |
Do all valid model parts match the globs I'm not sure listing model parts on a model-by-model is possible, since there are many different models using the same model arch. Some have multiple sizes, some are used by completely different types of models (e.g. |
Hm. I agree that this can indeed be more challenging than initially anticipated, especially with the arbitrary nature of model files across different architectures. Most of the code for handling tokenizers is already set up on the Python end, and my goal here is to integrate the rest into the pipeline. For simplicity's sake, let's address this issue by providing users a list of available file types to select from when fetching files remotely or using local ones if they exist. This way, we ensure that the user has control over the process and can easily handle edge cases as needed: def select_file_type(self, tensor_types: set[str]) -> str:
# If more than one type found, ask the user
if len(tensor_types) > 1:
questions = [
inquirer.List(
"reader",
message="Which tensor type would you like to read?",
choices=list(tensor_types),
),
]
answers = inquirer.prompt(questions)
return answers["reader"]
elif len(tensor_types) == 1:
return next(iter(tensor_types))
else:
raise ValueError("No recognized tensor files found.") This approach was used for reading metadata from Safetensors or Torch models, but the idea is similar. We read the repo file listing, list the available files, have the user select the ones they need, and proceed accordingly. The tokenizer files are relatively straightforward as they're almost always the same aside from a few edge cases, but these usually don't cause significant issues due to their simplicity. On the other hand, model files can be completely arbitrary with no way to determine the file name in advance. It is impressive that the convert script currently fails so infrequently for supported architectures given how difficult it is to guess the model file names accurately. However, we know from experience that this implementation isn't robust enough to handle all edge cases consistently. By providing a user-friendly selection method, we can make the process more manageable and less prone to unexpected errors when dealing with arbitrary model files across various architectures. |
There are likely people using the convert scripts non-interactively, so I don't think asking the user a question like this would be appropriate for all cases. But I might be wrong. For selecting the file type, a command-line flag should be appropriate, which could be used to override the default of "preferring
I don't know if they really are that arbitrary, since models on HuggingFace pretty much all seem to fit in the It seems most model makers have converged on the HuggingFace naming of model files, and that's what I agree that the errors still have a lot of room for user-friendliness, though, like in #7215 (comment), which was caused by missing files. |
UINT8 = auto() | ||
INT8 = auto() | ||
UINT16 = auto() | ||
INT16 = auto() | ||
UINT32 = auto() | ||
INT32 = auto() | ||
UINT64 = auto() | ||
INT64 = auto() | ||
FLOAT32 = auto() | ||
FLOAT64 = auto() | ||
BOOL = auto() | ||
STRING = auto() | ||
ARRAY = auto() | ||
OBJECT = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are stored in the GGUF model files, so I think they should stay explicitly defined with a number, to more easily match the C/C++ side.
Also note that they are explicitly defined in the GGUF spec: https://github.com/ggerganov/ggml/blob/2aae01fd9b8f9399f343cf18f46f38996ef52e2c/docs/gguf.md?plain=1#L155-L186
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I knew this when I modified this. It was easier to parse and quickly scan which is why I did this. Adding the object and grouping like objects helped reduce mental overhead. Technically, enumerations shouldn't be so coupled, but I understand why that's the case here.
# NOTE: Maybe use numpy, e.g. np.dtypes to determine data type? | ||
# Using base types is unreliable in python as all numbers in python are 64-bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integers in Python 3 have arbitrary precision, they are not merely 64-bits wide. (for the float
type, I don't know)
I think using Numpy number types could be done, if that kind of control is needed, but unfortunately the numbers have to be explicitly converted to the desired Numpy number type beforehand for type comparisons to work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This something left for another PR, but I felt the note was worth adding. I'll dig up where I read that if I can when I have some time.
gguf-py/gguf/constants.py
Outdated
MODEL = "tokenizer.model" # STRING: e.g. llama, gpt2, etc... | ||
TYPE = "tokenizer.type" # STRING: BPE, SPM, WPM, etc. | ||
NORM = "tokenizer.norm" # OBJECT {"type": "ByteLevel", ...} | ||
PRE = "tokenizer.pre" # OBJECT {"type": "ByteLevel", ...} | ||
ADDED = "tokenizer.added" # ARRAY of OBJECTs: [{"id": 1, ...}, ...] | ||
VOCAB = "tokenizer.vocab" # ARRAY of STRINGs: ["[BOS]", ...] | ||
MERGES = "tokenizer.merges" # ARRAY of STRINGs: ["▁ t", ...] | ||
TOKEN_TYPE = "tokenizer.token_type" # ARRAY of INT [2, ...] | ||
TOKEN_TYPE_COUNT = "tokenizer.token_type_count" # BERT token types | ||
SCORES = "tokenizer.scores" # WPM only | ||
BOS_ID = "tokenizer.bos_token_id" | ||
EOS_ID = "tokenizer.eos_token_id" | ||
UNK_ID = "tokenizer.unknown_token_id" | ||
SEP_ID = "tokenizer.seperator_token_id" | ||
PAD_ID = "tokenizer.padding_token_id" | ||
CLS_ID = "tokenizer.cls_token_id" | ||
MASK_ID = "tokenizer.mask_token_id" | ||
ADD_BOS = "tokenizer.add_bos_token" | ||
ADD_EOS = "tokenizer.add_eos_token" | ||
ADD_PREFIX = "tokenizer.add_space_prefix" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so this is what you meant in #7379 (comment).
Renaming the GGUF metadata keys is definitely a breaking change. If you really want to rename these, you'll have to also rename them in the llama.cpp
file, as well as in https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#tokenizer. This would also likely need a GGUF format version bump. There were probably reasons for using tokenizer.ggml.
instead of simply tokenizer.
. I don't know for sure, but the structure of the spec might mean it's intended to be possible to have multiple tokenizer types in the same model file.
(also note that tokenizer.ggml.scores
are also used by SentencePiece models, so the comment # WPM only
is probably incomplete)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need consensus on this. If anyone disagrees with this, I'm open to alternatives, but this will become more clear once I'm ready to full flesh everything out. I already looked at the C/C++ source code, but I'm waiting on @jaime-m-p and @mofosyne PRs because it will cause conflicts otherwise.
I published a "tokenizers" repo on my account if you want to play around with the toy examples. I have it "semi-functional" there. I needed to separate it from this branch though so I could freely experiment with some ideas without radically changing anything here.
The change here is "contained".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I published a "tokenizers" repo on my account if you want to play around with the toy examples. I have it "semi-functional" there.
Interesting! I think there's likely a way to make your idea of the registry file work with the intentions of the existing GGUF metadata keys. It could even be included directly in the GGUF model files. Something similar to tokenizer.huggingface.json
, but for a subset of it (to avoid including the vocab twice (or not?)), to include the pretokenizer, normalizer, and other relevant configs in the GGUF models.
From what I see in your vocab.cpp
I think this is a great start! I didn't yet think about parsing JSON in C++. Sometimes I forget that common/json.hpp
(aka nlohmann/json
) is there.
I'm throwing some ideas here.
Maybe convert-hf-to-gguf.py
should simply start including tokenizer.json
(and tokenizer_config.json
?) in the tokenizer.huggingface.json
GGUF metadata key?
This way the tokenizer problems could be solved later without necessarily having to re-convert the model files. But I don't know to what extent parsing JSON (especially something like Llama 3
's 9MB tokenizer.json
) would affect the load time. If it's too slow, maybe subsets of it could be in different GGUF metadata keys?
Even though I tend to nitpick a lot, it's nice to see that you are experimenting with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the refactored structure here is aligned with that idea. Partition the most common key-value pairs so we only load, read, and write once. This should negligibly affect the load time isolating what's needed at runtime.
I'm keeping in mind that we also rely on the SPM tokenizer.model
files as well which is why I kept the rest "in-tact" for the most part. I'm thinking about it in terms of compatibility.
I'd also like to eventually build a BPE tokenizer for llama.cpp that we can use natively. I think this would be cool and potentially useful for training. This would effectively reduce the reliance on third-party sources for in-house models.
I'm very interested in training my own models which is why this fascinates me. I have a few older cards, so I'm thinking I can rig them up and use vulkan, but that's another task entirely.
I digress, but in short, I think I may partially revert the structure for backwards compatibility as I understand this is a valid concern. Maybe a set of transitional attributes could aide in streamlining this more effectively in both conversion and inference for the interim.
Automate Vocab Support and Model Conversion
Motivation:
This PR aims to automate the downloading of supported model architectures from HuggingFace, handle conversions (if needed), and address existing fragile implementations related to tokenizers for proper training, finetuning, and inference. The goal is to streamline the process by reducing manual implementations while making it easier for users.
Overview:
User navigates to huggingface.co website and downloads a supported model architecture (
torch
,safetensors
, or GGUF). If the desired format (e.g., GGUF) is not available, we will download it in one of the other formats for conversion.Download necessary files:
*.pt(h)
,*.bin
, or*.safetensors
Convert the model files to a GGUF model file (consolidated by default but can be split if needed). During conversion, metadata is extracted and written into the resulting GGUF model file. However, some metadata like normalizer and pre-tokenizer information are currently missing in this process.
Ideally, users should only need to add the vocab type, model type, and the repository name.
Goals:
Notes:
torch
andsafetensors
, into GGUF format.Model Paths:
Models are still added to the
models
path by default unless explicitly passed as an option to be written somewhere else. This enables user preference.The models are organized in a way that mirrors the expectation of a HuggingFace user, e.g.:
(.venv) git:(auto-model-support | θ) λ tree models -I private models ├── mistralai │ ├── Mistral-7B-Instruct-v0.2 │ │ ├── config.json │ │ ├── ggml-vocab-mistral-7b-instruct-v0.gguf │ │ ├── tokenizer_config.json │ │ ├── tokenizer.json │ │ └── tokenizer.model │ └── Mixtral-8x7B-Instruct-v0.1 │ ├── config.json │ ├── ggml-vocab-mixtral-8x7b-instruct-v0.gguf │ ├── tokenizer_config.json │ ├── tokenizer.json │ └── tokenizer.model
If a user does not have access to the repo, the path will be empty. This is a bug. I still need to fix that. There are workarounds currently implemented to ignore this unintended side-effect, e.g.:
Advantages:
This increases and enables organization as a result. Keeping the models path clean and easy to navigate as needed. It also consolidates model paths as well. This allows us to reliably and predictably reference a model path.
What I like about this approach is that it will allow us to get the vocab, model, or both if desired. It also removes any concern of the growing repo size as we no longer need to include vocab files. We can dynamically generate them before or at runtime.
The model will be automatically downloaded if it is not locally available to the user. This allows us to fetch the model as needed.
Validation:
A JSON file,
checksums.json
, is automatically generated removing the need to update the conversion scripts every time a hash is generated. We can commit this file instead so we have a trusted reference we can all use.This allows us to read from the dynamically generated JSON file as a source for validating tokenizers as well as other models. Using a regular expression to read, match, and then write to the file is error prone. This is a cleaner solution and I have considered other alternatives.
I'm open to feedback on how to handle this. e.g. It is possible to dynamically construct a AST, but this can introduce other issues without careful consideration.
Storing Tokenizer Models:
A key issue I've noticed is name conflicts. As more tokenizer models are added, the higher the likelihood we will run into issues. Instead, we can rely upon what type of tokenizer is available.
We already have an
enum
for the tokenizer type setup with enumerations set for BPE, SPM, and WPM. We can use a string or integer to identify the type of vocabulary, effectively reducing the amount of code and redundancy that would pile up as a result otherwise. Personally, I'm leaning towards a string and this would be thetokenizer.ggml.pre
field value. e.g.Would become
This is cleaner and easier to identify.
I've been considering how to do this for quite some time now. If this is undesirable, then I genuinely recommend adding a field for properly identifying the tokenizer model and properly labeling the type.
This is crucial to avoid confusion, keep in sync with modern terminology, and reduce redundancy in the code base that could result otherwise. The behavior, use, and application of this "pre-tokenizer" term is in direct conflict with what I assume most of us would think otherwise.
Identifying and Labeling Tokenizers:
Note that this is a rough sketch. I'm open to suggestions here:
Possible options might be:
tokenizer.ggml.pre
: A member variable to set the regular expression.tokenizer.ggml.model
: A member variable that has the models name.tokenizer.model.type
: A member variable that has the tokenizer type.TODO
Adds the following support:
generate-vocab.sh
scriptI have other plans as well. I Will update this as I progress.
Tagging @CISC @compilade @ggerganov