-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert-hf : reduce repeated boilerplate from write_tensors #7031
Conversation
For example, loading `model-00001-of-00001.safetensors` now works. * convert-hf : fix stacking MoE expert tensors `torch.stack` and `torch.cat` don't do the same thing. * convert-hf : fix Mamba conversion Tested to work even with a SentencePiece-based tokenizer.
`os.listdir` is said to list files in arbitrary order. Sorting the file names should let "model-00009-of-00042.safetensors" be loaded before "model-00010-of-00042.safetensors".
@@ -101,7 +101,7 @@ def copy_with_new_metadata(reader: gguf.GGUFReader, writer: gguf.GGUFWriter, new | |||
|
|||
for tensor in reader.tensors: | |||
# Dimensions are written in reverse order, so flip them first | |||
shape = np.flipud(tensor.shape) | |||
shape = np.flipud(tensor.shape).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for coercing shape to list here (and not elsewhere)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's passed to GGUFWriter.add_tensor_info
, which expects a Sequence[int]
for the shape, and this shape is of type NDArray[uint32]
which caused the error:
Argument of type "NDArray[uint32]" cannot be assigned to parameter "tensor_shape" of type "Sequence[int]" in function "add_tensor_info"
"NDArray[uint32]" is incompatible with "Sequence[int]"
(reportGeneralTypeIssues)
This comes from the shape
field of a ReaderTensor
, and it is coerced to list in other places, like in gguf-dump.py
:
llama.cpp/gguf-py/scripts/gguf-dump.py
Line 93 in b0d943d
"shape": tensor.shape.tolist(), |
llama.cpp/gguf-py/scripts/gguf-dump.py
Line 54 in b0d943d
prettydims = ', '.join('{0:5}'.format(d) for d in list(tensor.shape) + [1] * (4 - len(tensor.shape))) |
But the shape
field of ReaderTensor
is only used in 7 places (including its definition). In other places, the "shape" usually directly come from either Numpy or PyTorch, which use tuple[int, ...]
for the shape type, which is compatible with Sequence[int]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, although GGUFWriter.add_tensor_info
s typing is then perhaps not correct I understand why it's simpler to just make it a list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGUFWriter.add_tensor_info
's typing seems correct to me; it's used in 2 other places, and both use shapes which are already compatible with Sequence[int]
.
llama.cpp/gguf-py/gguf/gguf_writer.py
Lines 247 to 248 in b0d943d
shape: Sequence[int] = raw_shape if raw_shape is not None else tensor.shape | |
self.add_tensor_info(name, shape, tensor.dtype, tensor.nbytes, raw_dtype = raw_dtype) |
Line 1145 in b0d943d
self.gguf.add_tensor_info(name, tensor.shape, data_type, data_nbytes, raw_dtype=raw_dtype) |
So using Sequence[int]
there seems appropriate, as it's the most general type (I think?) which can be indexed (it avoids having to cast tuple[int, ...]
into list[int]
, or list[int]
into tuple[int, ...]
).
This is how the shape is used in add_tensor_info
:
llama.cpp/gguf-py/gguf/gguf_writer.py
Lines 210 to 211 in b0d943d
for i in range(n_dims): | |
self.ti_data += self._pack("Q", tensor_shape[n_dims - 1 - i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, all I'm saying is that that also works with NDArray[uint32] (even though it's not compatible with Sequence[int]).
An attempt to convert MPT fails early with an unhelpful error message, as shown below:
|
Could be a bound method being treated as a static method. |
Not sure what that means in terms of fixing it "properly". The error disappears when I add an explicit init method iin MPTmodel which calls super().init with the same parameters. Then a different error occurs, which seems to be less related to Python and more to the actual conversion at hand:
|
It seems Protocol can't be used as a statically type-checked ABC, because its subclasses also can't be instantiated. (why did it seem to work?) At least there's still a way to throw an error when forgetting to define the `model_arch` property of any registered Model subclasses.
I think this error happens because Seems like |
…tion There are no abstract methods used anyway, so using ABC isn't really necessary.
I can't recommend anything in that regard. But I checked that after your last commit the script works and produces exactly the same GGUF for MPT-7B as the master branch version (and the resulting model works). |
# we don't need these | ||
if name.endswith((".attention.masked_bias", ".attention.bias", ".attention.rotary_emb.inv_freq")): | ||
if name.endswith((".attention.masked_bias", ".attention.bias", ".rotary_emb.inv_freq")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to blacklist tensors in a per model basis you can use
llama.cpp/gguf-py/gguf/constants.py
Lines 732 to 743 in 6ecf318
MODEL_TENSOR_SKIP: dict[MODEL_ARCH, list[MODEL_TENSOR]] = { | |
MODEL_ARCH.LLAMA: [ | |
MODEL_TENSOR.ROPE_FREQS, | |
MODEL_TENSOR.ATTN_ROT_EMBD, | |
], | |
MODEL_ARCH.BAICHUAN: [ | |
MODEL_TENSOR.ROPE_FREQS, | |
MODEL_TENSOR.ATTN_ROT_EMBD, | |
], | |
MODEL_ARCH.PERSIMMON: [ | |
MODEL_TENSOR.ROPE_FREQS, | |
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice idea! I didn't know about MODEL_TENSOR_SKIP
. Right now it's only used for MODEL_ARCH.LLAMA
in convert.py
. The exclusion lists for the other architectures were never really tested (e.g. MODEL_ARCH.PLAMO
and MODEL_ARCH.GROK
both have MODEL_TENSOR.ATTN_ROT_EMBD
in MODEL_TENSORS
, but they are not in MODEL_TENSOR_SKIP
, since they rely on matching the name like you pointed out).
I'm not sure about using this; modify_tensors
can already be used to skip tensors, and I think it's cleaner this way for architecture-specific use, while common skips (as it is) belong in the base Model.write_tensors
(the line on which this comment thread was started). BertModel
has a nice example for an architecture-specific exclusion:
llama.cpp/convert-hf-to-gguf.py
Lines 1970 to 1972 in 3e5e0dc
# we are only using BERT for embeddings so we don't need the pooling layer | |
if name in ("embeddings.position_ids", "pooler.dense.weight", "pooler.dense.bias"): | |
return [] # we don't need these |
It was already done similarly in write_tensors
(I tried to keep most of the model-specific code from write_tensors
unchanged into modify_tensors
). Otherwise, this means tensor types that won't be serialized should still get a serialized name, even if they're never used. BERT's pooling layers don't have an enum, and they probably should not have one if it's to be carried around but still remain unused like LLM_TENSOR_ATTN_ROT_EMBD
in the llama.cpp
file. (if they do get used, I agree they should get added, but they aren't for now.)
Line 488 in a2ac89d
{ LLM_TENSOR_ATTN_ROT_EMBD, "blk.%d.attn_rot_embd" }, |
So instead, maybe it would be worth considering removing MODEL_TENSOR.ROPE_FREQS
and MODEL_TENSOR.ATTN_ROT_EMBD
? But this is likely out of the scope of this PR.
66ecf6c
to
98f2d0e
Compare
@compilade BTW, I added your typing changes to #6778 (and a progress bar, inspired by your other PR), so will be a simpler rebase for whichever gets merged last. :) |
(For anyone following only this and not #7075) #7075 is starting to be more ready and I'm considering to unify both PRs to make the branches easier to manage. To keep the naming less confusing, I think it would be best to change the base branch of #7075 to However, doing this won't move over the review comments, so I'm wondering if that would still be fine? (@CISC, @Galunid) |
Fine by me, it was more for my benefit rather than this PR anyway. :) |
Sure, we can do that |
Closing in favor of #7075 |
A lot of the lines in
convert-hf-to-gguf.py
are repeated fromModel.write_tensors
.With some refactoring, it is possible to remove 600+ lines while mostly keeping the original behavior.
Important
Continued at #7075. The testing section below is still valid for both PRs.
Changes Summary
pyrightconfig.json
to let it load thegguf
library correctly~=0.2.0
from~=0.1.98
encoding="utf8"
toencoding="utf-8"
, for consistency, even if they are the same in Python.Model
a Protocola plain class instead of anABC
(Abstract Base Class)ABC
doesn't seem necessary anymoreModel.__init__
model_arch
field in subclasses exactly as it's already done, but without making pyright output type errors aboutLiteral[gguf.MODEL_ARCH.SOMETHING]
not compatible with typeproperty
model_arch
field, by checking for this in__init_subclass__
(so it's checked even for unused subclasses ofModel
)modify_tensors
callback which is used in the model classes instead of overridingwrite_tensors
.data_torch
, which should make lazy computation easier to implement in the future (instead of also having to wrap complexnumpy
operations)f16
output is chosen, it's possible to selectively override the type of output tensors by overridingextra_f32_tensors
and/orextra_f16_tensors
in the architecture-specific model classesmodel-00001-of-00001.safetensors
part (previously, single-part models were expected to name their model filemodel.safetensors
)v0-mamba-*
models in https://huggingface.co/delphi-suiteps.einops
tollama-python-extra
Nix packagesInternLM2Model
inconvert-hf-to-gguf.py
Testing
To make sure I didn't break previous work, I need help with testing conversion of existing architectures; I'm sure some of you already have the original model files downloaded, so testing conversion would be greatly appreciated.
$ python3 convert-hf-to-gguf.py --outtype f16 --outfile ./path/to/output.f16.gguf ./path/to/original/model/
If your username is mentioned below, it's because I think you might have the relevant model nearby, and/or I'd like you to check whether or not I broke your previous work, because I did modify quite a bit of model-specific conversion code.
BloomModel
(@xingchensong)MPTModel
(@jploski)BaichuanModel
(@foldl)FalconModel
(@ggerganov)RefactModel
(@ds5t5)PersimmonModel
(@ggerganov)master
. This model arch might need to be re-implemented from scratch since the graph code is quite complicatedStableLMModel
with StableLM-2-12B (@ashishdatta)LlamaModel
with MoE (@slaren)LlamaModel
without MoE (@ggerganov)convert.py
with Llama models using a SentencePiece tokenizerGrokModel
(@arki05)DbrxModel
(@phymbert)Qwen2MoeModel
(@ggerganov)GPT2Model
(@ggerganov)InternLM2Model
(@SolenoidWGT)GemmaModel
(@ggerganov)MambaModel
(@compilade)It would be nice to automate such testing in the future, though I'm not sure how feasible it is. It could require finding the smallest model for each architecture, though unfortunately some architecture have only been used for very large models.
Also, @mofosyne, note that this will cause (some) conflicts with #6511, though they shouldn't be too hard to resolve (since most of them will be caused by removed code). I would be fine with your PR getting merged first to handle conflicts here, though I don't know if you'd prefer to do it on your side instead.
Also pinging @cebtenzzre because of #5825 which was another effort at simplifying
convert-hf-to-gguf.py
, so you should be familiar with the code.Out of scope
Anything related to BPE pre-tokenizer errors was not caused by this PR and won't be fixed here to reduce potential conflicts.