-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gguf-split: split and merge gguf per batch of tensors #6135
Conversation
Interesting approach. I think allowing to split by file size would be more intuitive (and usually more appropriate since file size is usually the limiting factor, eg 4G for FAT or 50G for HF). The current code also makes the workflow a bit awkward with a lot of extra writes. It shouldn't be too hard to call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for having a look into this feature, your PR overall LGTM, just don't forget to include Makefile.
This would be useful for my wllama, since loading 5MB-10MB chunks in parallel will be faster in web environment. So I'm looking forward to the implementation in llama_model_loader
.
For the syscalls that @Artefact2 proposed, we can implement in the v2 of the PR I think, for now it's already a good start to test if modification to llama_model_loader
works or not.
Thanks. You cannot exactly predict the size of the GGUF as tensors size can vary, and we want to have valid GGUF (i.e. not truncated as in your example) for later on having |
…et general.split_count KV to all split
@ggerganov Hi Georgi, can I merge and continue on common ? |
* gguf-split: split and merge gguf files per tensor * gguf-split: build with make toolchain * gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split * split : minor style + fix compile warnings * gguf-split: remove --upload not implemented --------- Co-authored-by: Georgi Gerganov <[email protected]>
* gguf-split: split and merge gguf files per tensor * gguf-split: build with make toolchain * gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split * split : minor style + fix compile warnings * gguf-split: remove --upload not implemented --------- Co-authored-by: Georgi Gerganov <[email protected]>
Motivation
Distributing and storing GGUF files is difficult for 13b+ models, especially on f16. Lot of issue can happen during file transfers, examples:
Typically, they need to be tranferred from huggingface to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example). Also they cannot be stored in a dockerfile, but IMHO this is for good.
This PR introduces a
gguf-split
CLI to ease the split and merge of multiple GGUF.Examples:
--split
gguf-split --split --split-tensors-size 128 ggml-model-q4_0.gguf /tmp/ggml-out-q4_0-2 gguf_split: ggml-model-q4_0.gguf -> /tmp/ggml-out-q4_0-2-00001-of-00003.gguf (128 tensors per file) split_start: /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done split_start: /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done split_start: /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done gguf_split: 3 gguf split written with a total of 325 tensors.
--merge
gguf-split --merge /tmp/ggml-out-q4_0-2-00001-of-00003.gguf /tmp/ggml-out-q4_0-2-merge.gguf gguf_merge: /tmp/ggml-out-q4_0-2-00001-of-00003.gguf -> /tmp/ggml-out-q4_0-2-merge.gguf gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done gguf_merge: /tmp/ggml-out-q4_0-2-merge.gguf merged from 3 split with 325 tensors.
References
Notes
If this approach is accepted, we can later on adapt
llama_load_model_from_file
andllama_load_model_from_url
to supportgeneral.split_count
KV in GGUF.mmap
is not used in this first implementation neithercopy_file_range
iops.The only split strategy supported at the moment is
--split-max-tensors N
which will create split ggufs with max tensors each regardless of their bytes size. Later on another split strategy based on max file size can be introduced.