Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gguf-split: split and merge gguf per batch of tensors #6135

Merged
merged 5 commits into from
Mar 19, 2024

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Mar 18, 2024

Motivation

Distributing and storing GGUF files is difficult for 13b+ models, especially on f16. Lot of issue can happen during file transfers, examples:

  • temporary disk full
  • network interruption

Typically, they need to be tranferred from huggingface to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example). Also they cannot be stored in a dockerfile, but IMHO this is for good.

This PR introduces a gguf-split CLI to ease the split and merge of multiple GGUF.

Examples:

  • --split
gguf-split --split --split-tensors-size 128 ggml-model-q4_0.gguf /tmp/ggml-out-q4_0-2

gguf_split: ggml-model-q4_0.gguf -> /tmp/ggml-out-q4_0-2-00001-of-00003.gguf (128 tensors per file)
split_start: /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done
split_start: /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done
split_start: /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done
gguf_split: 3 gguf split written with a total of 325 tensors.
  • --merge
gguf-split --merge /tmp/ggml-out-q4_0-2-00001-of-00003.gguf /tmp/ggml-out-q4_0-2-merge.gguf

gguf_merge: /tmp/ggml-out-q4_0-2-00001-of-00003.gguf -> /tmp/ggml-out-q4_0-2-merge.gguf
gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done
gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done
gguf_merge: reading metadata /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done
gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00001-of-00003.gguf ...done
gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00002-of-00003.gguf ...done
gguf_merge: writing tensors /tmp/ggml-out-q4_0-2-00003-of-00003.gguf ...done
gguf_merge: /tmp/ggml-out-q4_0-2-merge.gguf merged from 3 split with 325 tensors.

References

Notes

If this approach is accepted, we can later on adapt llama_load_model_from_file and llama_load_model_from_url to support general.split_count KV in GGUF.

mmap is not used in this first implementation neither copy_file_range iops.

The only split strategy supported at the moment is --split-max-tensors N which will create split ggufs with max tensors each regardless of their bytes size. Later on another split strategy based on max file size can be introduced.

@phymbert phymbert added demo Demonstrate some concept or idea, not intended to be merged need feedback Testing and feedback with results are needed labels Mar 18, 2024
@Artefact2
Copy link
Collaborator

Interesting approach. I think allowing to split by file size would be more intuitive (and usually more appropriate since file size is usually the limiting factor, eg 4G for FAT or 50G for HF).

The current code also makes the workflow a bit awkward with a lot of extra writes. It shouldn't be too hard to call copy_file_range() or ioctl(FICLONERANGE) on supported systems, or, as an alternative, add the splitting logic directly to tools that produce ggufs, like convert.py and quantize.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for having a look into this feature, your PR overall LGTM, just don't forget to include Makefile.

This would be useful for my wllama, since loading 5MB-10MB chunks in parallel will be faster in web environment. So I'm looking forward to the implementation in llama_model_loader.

For the syscalls that @Artefact2 proposed, we can implement in the v2 of the PR I think, for now it's already a good start to test if modification to llama_model_loader works or not.

@phymbert
Copy link
Collaborator Author

Interesting approach. I think allowing to split by file size would be more intuitive (and usually more appropriate since file size is usually the limiting factor, eg 4G for FAT or 50G for HF).

The current code also makes the workflow a bit awkward with a lot of extra writes. It shouldn't be too hard to call copy_file_range() or ioctl(FICLONERANGE) on supported systems, or, as an alternative, add the splitting logic directly to tools that produce ggufs, like convert.py and quantize.

Thanks. You cannot exactly predict the size of the GGUF as tensors size can vary, and we want to have valid GGUF (i.e. not truncated as in your example) for later on having llama_model to support tensors distributed in multiple GGUF. But I agree file size is more intuitive, we might introduce --split-max-size split strategy later on. Feel free to implement it once this first implementation is merged.

@phymbert phymbert removed request for Artefact2 and slaren March 19, 2024 08:48
@phymbert
Copy link
Collaborator Author

@ggerganov Hi Georgi, can I merge and continue on common ?

examples/gguf-split/gguf-split.cpp Outdated Show resolved Hide resolved
@phymbert phymbert removed demo Demonstrate some concept or idea, not intended to be merged need feedback Testing and feedback with results are needed labels Mar 19, 2024
@phymbert phymbert changed the title proposal: gguf-split: split and merge gguf per batch of tensors gguf-split: split and merge gguf per batch of tensors Mar 19, 2024
@phymbert phymbert merged commit d0d5de4 into ggerganov:master Mar 19, 2024
29 of 53 checks passed
@phymbert phymbert deleted the hp/feature/gguf-split branch March 19, 2024 11:05
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* gguf-split: split and merge gguf files per tensor

* gguf-split: build with make toolchain

* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split

* split : minor style + fix compile warnings

* gguf-split: remove --upload not implemented

---------

Co-authored-by: Georgi Gerganov <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
* gguf-split: split and merge gguf files per tensor

* gguf-split: build with make toolchain

* gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split

* split : minor style + fix compile warnings

* gguf-split: remove --upload not implemented

---------

Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants