-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MPI] Add support for per-node options, thread counts, and layer allocations #3334
base: master
Are you sure you want to change the base?
[MPI] Add support for per-node options, thread counts, and layer allocations #3334
Conversation
Have you, by any chance, encountered this problem ? It seems like in the original mpi implementation, there was a sync step missing somewhere, and rank 0 was done, while other instances were stuck, and Not sure if it's applicable to this PR, but you seem to know mpi better than me at least, so maybe you'll have some idea as to why it's happening. |
If you mean the issue that the worker nodes don't terminate when the model outputs the end of stream token, that is a known issue. It's not a missing sync anywhere, but rather the architecture of the MPI backend didn't take it into account. Each node only expects one type of message to be sent to it, and since the sampling is done only at the head node, they don't have any information about when it's time to stop. This PR does not fix that problem because it is out of scope for it, but it will likely be fixed in future PRs I am planning. |
llama.h
Outdated
@@ -230,6 +230,8 @@ extern "C" { | |||
const char * path_model, | |||
struct llama_context_params params); | |||
|
|||
LLAMA_API void llama_split_layers_weighted(struct llama_context * ctx, std::vector<float> device_weights); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use C++ in the C-style API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, would replacing with a float array and a size_t length be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the PR seems OK. We should try to adapt to the changes from #3228
Yep, that's what I will be doing over the weekend |
a0ee1eb
to
77cd3e0
Compare
This PR is now fully functional again after the recent changes and has been rebased on master. Only basic inferencing functionality has been tested, more advanced functionality like batching and speculation is unlikely to work. The |
Hi, thanks for taking the time. I'll probably interfere a bit with your change as I'm making some refactoring changes in Had a quick glance at the PR and will look more later. The |
Yep, that's one reason this PR is still a draft, I just copied main to use as a scratch pad. The original idea used |
Looks like the names of the tensors have been changed, which breaks MPI. The current implementation relied on there being |
Oops, I forgot about the purpose of these names and removed them recently. |
I adjusted the command line arguments parsing so you can pass a comma separated list to both I also added the function call needed for scattering the layer ranges to the After that's done, I should be able to remove the MPI example entirely |
Performance with this branch looks interesting, I was able to run llama 2 70B across a homemade cluster of 3 CPU-only nodes: i7 9th gen, i5 4th gen, i5 2nd gen, with 16 Gb DDR4 2666MHz, 16 Gb DDR3, and 8 Gb DDR3 respectively. On this cluster I got around 0.58 tokens / second for 70B Q3_K_M. Htop showed roughly 40-60% CPU utilization across all hardware cores when processing the allocated layers, but it's unclear whether that's because the spikes are so short and Htop isn't sampling often enough. Curiously this isn't much slower than running on a second cluster of much more powerful hardware: ryzen 5 5600g, i7 9th gen, with 32 Gb DDR4 3200MHz each. The second cluster got roughly 0.64 tokens / second, while being much more expensive. I attempted to run it on the ryzen machine alone to gauge MPI overheard via offloading to my 6700xt, but ROCm wouldn't install and opencl caused hangs. I plan on doing more in-depth performance investigations to determine where the bottleneck is. I have access to a proper university cluster as well that I'll be testing on. |
I'm 99.9% certain raw perf counters come from Linux kernel directly, and are not calculated by a point in time, but by aggregated ticks, effectively being deltas between samples so you cannot "miss" a sample. You can always dump raw perf counters to a tmpfs file, in a loop, and parse them later But chances that htop or top are wrong, are low. |
Found what was up with htop, there's a commandline switch After tuning the clusters by adjusting the layer split percentages such that no node was swapping to disk, I achieved 0.69 tokens / second on the weaker cluster and 0.78 tokens / second on the Ryzen cluster. Running on an AMD EPYC 7543P 32-Core Processor without MPI resulted in 1.01 tokens / second, although that system was NUMA and I didn't have permissions to adjust the memory configuration |
Discovered a bug in this implementation regarding KV cache, syncing the sequence IDs isn't enough, the |
b4c7045
to
51f3f8f
Compare
I can confirm that this pr is not building on apple silicon. If it's unexpected, I can provide every bit of information needed to help you fellas. |
I don't have Apple silicon devices to test on, so whatever information you have would be greatly appreciated. |
Actually, it's the same with your CI logs but I'll add more context with this message soon. Edit: Context:System:Apple Silicon, M1 Max 64GB/2TB gh pr checkout 3334 # for checkout this PR.
make CC=mpicc CXX=mpicxx LLAMA_MPI=1 LLAMA_NO_METAL=1 -j10 # make for compiling.
# output of make:
...
examples/batched/batched.cpp:81:41: error: assigning to 'uint32_t' (aka 'unsigned int') from incompatible type 'std::vector<int32_t>' (aka 'vector<int>')
81 | ctx_params.n_threads = params.n_threads;
| ~~~~~~~^~~~~~~~~
examples/batched/batched.cpp:82:57: error: invalid operands to binary expression ('std::vector<int32_t>' (aka 'vector<int>') and 'int')
82 | ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
| ~~~~~~~~~~~~~~~~~~~~~~ ^ ~~
...
2 errors generated.
make: *** [simple] Error 1
make: *** Waiting for unfinished jobs....
2 errors generated.
make: *** [batched-bench] Error 1
2 errors generated.
make: *** [batched] Error 1 Versions:make --version ##
GNU Make 3.81
Copyright (C) 2006 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
This program built for i386-apple-darwin11.3.0
#########
mpirun --version
mpirun (Open MPI) 5.0.1
#########
mpicc --version # or mpicxx --version, they are the same dependency.
Homebrew clang version 17.0.6
Target: arm64-apple-darwin23.2.0
Thread model: posix
InstalledDir: /opt/homebrew/opt/llvm/bin Edit II: I couldn't resist and fix it by removing the problematic build options from makefile and succeeded the build with make. But the result is failure since it crashes with a segmentation fault. |
Ah yes, those errors are due to me not updating all of the examples, should be a simple fix. I would certainly appreciate help though, I've been terribly busy with graduate school the last few months! I plan to rebase on master later this week assuming nothing pops up on my calendar |
@AutonomicPerfectionist, but the functionality itself is not working tho. |
… buffer type be a host buffer, fix rebase errors
d407058
to
2217b02
Compare
I've ported over the fixes for the KV operations, and now the The MPI backend now functions on a "transaction" principle where each operation sent down the pipeline is considered atomic and strictly ordered. This is accomplished by first sending a message with a tag designating the beginning of the transaction and an identifier identifying the type of transaction. The transaction type is then used to determine which function the worker node needs to execute. Tags are also used within the various functions to designate the type of information being sent or received, and thanks to MPI's ordering guarantees, all of this means that the system transparently preserves the order of operations throughout the entire pipeline as dictated by the head node. For example, the head node could begin decoding (once I implement the pipeline parallel / async operators), then rearrange KV cache sequences (firing off more messages down the pipeline), and finally wait for the results of the initial decode. The downstream worker nodes would maintain that ordering so that the result of the pipeline is guaranteed to use the correct KV cache entries. Transactions are only used for operations that need consistent ordering; there's still the ability to send messages and have them "jump" the queue for high-priority or exotic messages. An example is the "shutdown" message, which isn't fully implemented here yet but will fix the issue with the MPI backend hanging once the program finishes. The atomicity of transactions is still preserved; the "shutdown" command would only be processed once whatever transaction being currently processed is finished. To break that atomicity, applications would need to explicitly probe for messages within the processing function of a transaction. This would only really be useful for canceling processing mid-stream, which will be supported as well through the |
@AutonomicPerfectionist so the PR is graduating from draft :D |
Not quite yet, there's still some things to be fixed before I'd consider it ready for general usage. Primarily, memory leaks and missing bounds checks need to be fixed. At the moment, if you run it without using the correct number of |
TL;DR: I simply solved it with placing a missing include directive for #include <map> When I try to make, I get this: I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion
I CXXFLAGS: -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include
I NVCCFLAGS: -std=c++11 -O3
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -L/opt/homebrew/opt/llvm/lib -L/opt/homebrew/opt/bison/lib -L/opt/homebrew/opt/llvm/lib -L/opt/homebrew/opt/bison/lib
I CC: Homebrew clang version 17.0.6
I CXX: Homebrew clang version 17.0.6
mpicc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml.c -o ggml.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c llama.cpp -o llama.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/common.cpp -o common.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/sampling.cpp -o sampling.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/grammar-parser.cpp -o grammar-parser.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/console.cpp -o console.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c ggml-mpi.cpp -o ggml-mpi.o
mpicc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml-metal.m -o ggml-metal.o
mpicc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml-alloc.c -o ggml-alloc.o
mpicc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml-backend.c -o ggml-backend.o
mpicc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion -c ggml-quants.c -o ggml-quants.o
ggml-mpi.cpp:101:5: warning: no previous prototype for function 'ggml_mpi_next_node' [-Wmissing-prototypes]
101 | int ggml_mpi_next_node(struct ggml_mpi_context * ctx_mpi) {
| ^
ggml-mpi.cpp:101:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
101 | int ggml_mpi_next_node(struct ggml_mpi_context * ctx_mpi) {
| ^
| static
ggml-mpi.cpp:105:5: warning: no previous prototype for function 'ggml_mpi_prev_node' [-Wmissing-prototypes]
105 | int ggml_mpi_prev_node(struct ggml_mpi_context * ctx_mpi) {
| ^
ggml-mpi.cpp:105:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
105 | int ggml_mpi_prev_node(struct ggml_mpi_context * ctx_mpi) {
| ^
| static
ggml-mpi.cpp:138:6: warning: no previous prototype for function 'ggml_mpi_barrier' [-Wmissing-prototypes]
138 | void ggml_mpi_barrier(struct ggml_mpi_context * ctx_mpi) {
| ^
ggml-mpi.cpp:138:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
138 | void ggml_mpi_barrier(struct ggml_mpi_context * ctx_mpi) {
| ^
| static
ggml-mpi.cpp:173:37: warning: unused parameter 'n_seq_max' [-Wunused-parameter]
173 | uint32_t n_seq_max) {
| ^
ggml-mpi.cpp:408:10: warning: no previous prototype for function 'ggml_backend_mpi_buffer_type_get_comm' [-Wmissing-prototypes]
408 | MPI_Comm ggml_backend_mpi_buffer_type_get_comm(ggml_backend_buffer_type_t buft) {
| ^
ggml-mpi.cpp:408:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
408 | MPI_Comm ggml_backend_mpi_buffer_type_get_comm(ggml_backend_buffer_type_t buft) {
| ^
| static
ggml-mpi.cpp:414:10: warning: no previous prototype for function 'ggml_backend_mpi_buffer_get_comm' [-Wmissing-prototypes]
414 | MPI_Comm ggml_backend_mpi_buffer_get_comm(ggml_backend_buffer_t buffer) {
| ^
ggml-mpi.cpp:414:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
414 | MPI_Comm ggml_backend_mpi_buffer_get_comm(ggml_backend_buffer_t buffer) {
| ^
| static
ggml-mpi.cpp:418:10: warning: no previous prototype for function 'ggml_backend_mpi_get_comm' [-Wmissing-prototypes]
418 | MPI_Comm ggml_backend_mpi_get_comm(ggml_backend_t backend) {
| ^
ggml-mpi.cpp:418:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
418 | MPI_Comm ggml_backend_mpi_get_comm(ggml_backend_t backend) {
| ^
| static
ggml-mpi.cpp:424:5: warning: no previous prototype for function 'ggml_backend_mpi_buffer_local_rank' [-Wmissing-prototypes]
424 | int ggml_backend_mpi_buffer_local_rank(ggml_backend_buffer_t buffer) {
| ^
ggml-mpi.cpp:424:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
424 | int ggml_backend_mpi_buffer_local_rank(ggml_backend_buffer_t buffer) {
| ^
| static
ggml-mpi.cpp:438:5: warning: no previous prototype for function 'ggml_backend_mpi_local_rank' [-Wmissing-prototypes]
438 | int ggml_backend_mpi_local_rank(ggml_backend_t backend) {
| ^
ggml-mpi.cpp:438:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
438 | int ggml_backend_mpi_local_rank(ggml_backend_t backend) {
| ^
| static
ggml-mpi.cpp:445:5: warning: no previous prototype for function 'ggml_backend_mpi_buffer_rank' [-Wmissing-prototypes]
445 | int ggml_backend_mpi_buffer_rank(ggml_backend_buffer_t buffer) {
| ^
ggml-mpi.cpp:445:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
445 | int ggml_backend_mpi_buffer_rank(ggml_backend_buffer_t buffer) {
| ^
| static
ggml-mpi.cpp:458:5: warning: no previous prototype for function 'ggml_backend_mpi_rank' [-Wmissing-prototypes]
458 | int ggml_backend_mpi_rank(ggml_backend_t backend) {
| ^
ggml-mpi.cpp:458:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
458 | int ggml_backend_mpi_rank(ggml_backend_t backend) {
| ^
| static
ggml-mpi.cpp:463:23: warning: no previous prototype for function 'ggml_backend_mpi_buffer_unwrap' [-Wmissing-prototypes]
463 | ggml_backend_buffer_t ggml_backend_mpi_buffer_unwrap(ggml_backend_buffer_t buffer) {
| ^
ggml-mpi.cpp:463:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
463 | ggml_backend_buffer_t ggml_backend_mpi_buffer_unwrap(ggml_backend_buffer_t buffer) {
| ^
| static
ggml-mpi.cpp:473:28: warning: no previous prototype for function 'ggml_backend_mpi_buffer_type_unwrap' [-Wmissing-prototypes]
473 | ggml_backend_buffer_type_t ggml_backend_mpi_buffer_type_unwrap(ggml_backend_buffer_type_t buft) {
| ^
ggml-mpi.cpp:473:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
473 | ggml_backend_buffer_type_t ggml_backend_mpi_buffer_type_unwrap(ggml_backend_buffer_type_t buft) {
| ^
| static
ggml-mpi.cpp:506:16: warning: no previous prototype for function 'ggml_backend_mpi_buffer_type_copy_ctx' [-Wmissing-prototypes]
506 | GGML_CALL void ggml_backend_mpi_buffer_type_copy_ctx(ggml_backend_buffer_type_t src, ggml_backend_buffer_type_t dst) {
| ^
ggml-mpi.cpp:506:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
506 | GGML_CALL void ggml_backend_mpi_buffer_type_copy_ctx(ggml_backend_buffer_type_t src, ggml_backend_buffer_type_t dst) {
| ^
| static
ggml-mpi.cpp:514:16: warning: no previous prototype for function 'ggml_backend_mpi_buffer_copy_ctx' [-Wmissing-prototypes]
514 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx(ggml_backend_buffer_t src, ggml_backend_buffer_t dst) {
| ^
ggml-mpi.cpp:514:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
514 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx(ggml_backend_buffer_t src, ggml_backend_buffer_t dst) {
| ^
| static
ggml-mpi.cpp:523:16: warning: no previous prototype for function 'ggml_backend_mpi_buffer_copy_ctx_from_type' [-Wmissing-prototypes]
523 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx_from_type(ggml_backend_buffer_type_t src, ggml_backend_buffer_t dst) {
| ^
ggml-mpi.cpp:523:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
523 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx_from_type(ggml_backend_buffer_type_t src, ggml_backend_buffer_t dst) {
| ^
| static
ggml-mpi.cpp:710:30: warning: no previous prototype for function 'ggml_mpi_available_devices_internal' [-Wmissing-prototypes]
710 | std::vector<ggml_mpi_device> ggml_mpi_available_devices_internal() {
| ^
ggml-mpi.cpp:710:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
710 | std::vector<ggml_mpi_device> ggml_mpi_available_devices_internal() {
| ^
| static
ggml-mpi.cpp:733:16: warning: no previous prototype for function 'ggml_backend_is_mpi' [-Wmissing-prototypes]
733 | GGML_CALL bool ggml_backend_is_mpi(ggml_backend_t backend) {
| ^
ggml-mpi.cpp:733:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
733 | GGML_CALL bool ggml_backend_is_mpi(ggml_backend_t backend) {
| ^
| static
ggml-mpi.cpp:776:13: error: no template named 'map' in namespace 'std'; did you mean 'max'?
776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
| ~~~~~^~~
| max
/opt/homebrew/opt/llvm/bin/../include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
31 | max(_LIBCPP_LIFETIMEBOUND const _Tp& __a, _LIBCPP_LIFETIMEBOUND const _Tp& __b, _Compare __comp)
| ^
ggml-mpi.cpp:776:13: error: a type specifier is required for all declarations
776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
| ~~~~~~ ^
ggml-mpi.cpp:776:13: error: template specialization requires 'template<>'
776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| template<>
ggml-mpi.cpp:776:13: error: no variable template matches specialization; did you mean to use 'max' as function template instead?
ggml-mpi.cpp:776:72: error: expected ';' after top level declarator
776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
| ^
| ;
ggml-mpi.cpp:778:13: error: no template named 'map' in namespace 'std'; did you mean 'max'?
778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
| ~~~~~^~~
| max
/opt/homebrew/opt/llvm/bin/../include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
31 | max(_LIBCPP_LIFETIMEBOUND const _Tp& __a, _LIBCPP_LIFETIMEBOUND const _Tp& __b, _Compare __comp)
| ^
ggml-mpi.cpp:778:13: error: a type specifier is required for all declarations
778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
| ~~~~~~ ^
ggml-mpi.cpp:778:13: error: template specialization requires 'template<>'
778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| template<>
ggml-mpi.cpp:778:13: error: no variable template matches specialization; did you mean to use 'max' as function template instead?
ggml-mpi.cpp:778:62: error: expected ';' after top level declarator
778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
| ^
| ;
ggml-mpi.cpp:780:13: error: no template named 'map' in namespace 'std'; did you mean 'max'?
780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
| ~~~~~^~~
| max
/opt/homebrew/opt/llvm/bin/../include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
31 | max(_LIBCPP_LIFETIMEBOUND const _Tp& __a, _LIBCPP_LIFETIMEBOUND const _Tp& __b, _Compare __comp)
| ^
ggml-mpi.cpp:780:13: error: a type specifier is required for all declarations
780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
| ~~~~~~ ^
ggml-mpi.cpp:780:13: error: template specialization requires 'template<>'
780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| template<>
ggml-mpi.cpp:780:13: error: no variable template matches specialization; did you mean to use 'max' as function template instead?
ggml-mpi.cpp:780:50: error: expected ';' after top level declarator
780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
| ^
| ;
ggml-mpi.cpp:816:5: error: use of undeclared identifier 'cached_wrappers'
816 | cached_wrappers[buft] = ggml_backend_wrapped_buffer_type;
| ^
ggml-mpi.cpp:910:5: error: use of undeclared identifier 'cached_buffer_wrappers'
910 | cached_buffer_wrappers[buf] = buffer;
| ^
ggml-mpi.cpp:917:55: warning: unused parameter 'backend_src' [-Wunused-parameter]
917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
| ^
ggml-mpi.cpp:917:83: warning: unused parameter 'backend_dst' [-Wunused-parameter]
917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
| ^
ggml-mpi.cpp:917:123: warning: unused parameter 'src' [-Wunused-parameter]
917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
| ^
ggml-mpi.cpp:917:149: warning: unused parameter 'dst' [-Wunused-parameter]
917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
| ^
ggml-mpi.cpp:917:6: warning: no previous prototype for function 'ggml_backend_mpi_cpy_tensor_async' [-Wmissing-prototypes]
917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
| ^
ggml-mpi.cpp:917:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
| ^
| static
ggml-mpi.cpp:942:115: warning: unused parameter 'offset' [-Wunused-parameter]
942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
| ^
ggml-mpi.cpp:942:130: warning: unused parameter 'size' [-Wunused-parameter]
942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
| ^
ggml-mpi.cpp:942:6: warning: no previous prototype for function 'ggml_backend_mpi_set_tensor_async' [-Wmissing-prototypes]
942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
| ^
ggml-mpi.cpp:942:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
| ^
| static
ggml-mpi.cpp:985:5: warning: missing field 'offload_op' initializer [-Wmissing-field-initializers]
985 | };
| ^
ggml-mpi.cpp:993:5: error: use of undeclared identifier 'cached_backends'; did you mean 'wrapped_backends'?
993 | cached_backends[wrapped_backends] = mpi_backend;
| ^~~~~~~~~~~~~~~
| wrapped_backends
ggml-mpi.cpp:956:55: note: 'wrapped_backends' declared here
956 | ggml_backend_t ggml_backend_mpi_init(ggml_backend_t * wrapped_backends, size_t num_backends, int rank) {
| ^
ggml-mpi.cpp:993:20: error: array subscript is not an integer
993 | cached_backends[wrapped_backends] = mpi_backend;
| ^~~~~~~~~~~~~~~~~
ggml-mpi.cpp:998:77: warning: unused parameter 'user_data' [-Wunused-parameter]
998 | static ggml_backend_t ggml_backend_reg_mpi_init(const char * params, void * user_data) {
| ^
28 warnings and 19 errors generated.
make: *** [ggml-mpi.o] Error 1
make: *** Waiting for unfinished jobs....
common/common.cpp:1948:26: warning: comparison of integers of different signs: 'int' and 'size_type' (aka 'unsigned long') [-Wsign-compare]
1948 | n_threads = (node_id >= params.n_threads.size()) ? get_num_physical_cores() : params.n_threads[node_id];
| ~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~
common/common.cpp:1949:40: warning: comparison of integers of different signs: 'int' and 'size_type' (aka 'unsigned long') [-Wsign-compare]
1949 | int32_t n_threads_batch = (node_id >= params.n_threads_batch.size()) ? -1 : params.n_threads_batch[node_id];
| ~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llama.cpp:9076:9: warning: unused variable 'old_tokens' [-Wunused-variable]
9076 | int old_tokens = batch_all.n_tokens;
| ^~~~~~~~~~
llama.cpp:13016:12: warning: 'return' will never be executed [-Wunreachable-code-return]
13016 | return 0;
| ^
llama.cpp:13461:57: warning: unused parameter 'ctx' [-Wunused-parameter]
13461 | void llama_split_layers_weighted(struct llama_context * ctx, float device_weights[], size_t num_weights) {
| ^
llama.cpp:13461:68: warning: unused parameter 'device_weights' [-Wunused-parameter]
13461 | void llama_split_layers_weighted(struct llama_context * ctx, float device_weights[], size_t num_weights) {
| ^
llama.cpp:13461:93: warning: unused parameter 'num_weights' [-Wunused-parameter]
13461 | void llama_split_layers_weighted(struct llama_context * ctx, float device_weights[], size_t num_weights) {
| ^
llama.cpp:14475:5: warning: no previous prototype for function 'llama_process_mpi_transaction' [-Wmissing-prototypes]
14475 | int llama_process_mpi_transaction(
| ^
llama.cpp:14475:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
14475 | int llama_process_mpi_transaction(
| ^
| static
llama.cpp:14512:13: warning: 'break' will never be executed [-Wunreachable-code-break]
14512 | break;
| ^~~~~
llama.cpp:14487:13: warning: 'break' will never be executed [-Wunreachable-code-break]
14487 | break;
| ^~~~~
llama.cpp:14522:13: warning: unused variable 'count' [-Wunused-variable]
14522 | int32_t count;
| ^~~~~
llama.cpp:14517:5: warning: no previous prototype for function 'llama_process_mpi_worker' [-Wmissing-prototypes]
14517 | int llama_process_mpi_worker(
| ^
llama.cpp:14517:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
14517 | int llama_process_mpi_worker(
| ^
| static
llama.cpp:14532:13: warning: 'break' will never be executed [-Wunreachable-code-break]
14532 | break;
| ^~~~~
llama.cpp:14537:13: warning: 'break' will never be executed [-Wunreachable-code-break]
14537 | break;
| ^~~~~
llama.cpp:14550:13: warning: 'break' will never be executed [-Wunreachable-code-break]
14550 | break;
| ^~~~~
2 warnings generated.
13 warnings generated. |
Interesting... At the moment the wrapper caches are the only parts using maps, but those caches are not actually being used at the moment. I'll add the include in my next push if I end up needing the caches, otherwise I'll remove them and therefore the dependency on map |
@AutonomicPerfectionist, may I ask, is it to be expected to get crash with MPS backend with ngl parameter setted with > 1? |
Yeah, that's pretty much expected, I haven't tested with anything but CPU as the wrapped backend. So metal, CUDA, Vulkan, SYCL, I expect to all crash horrifically. The plan is to fix that so you can run with any backend. If you can, could you put all the details of what you tried here so I can fix it? I only have access to NVIDIA and AMD GPUs so if you have a Mac with Metal it would be a big help to provide any error messages. Also try to build with debug mode on, and in the case of a SEGFAULT you can either use a debugger to investigate further or decompile with |
Yes, of course. I will inform you when I get a chance to test it. Probably, today afternoon. Edit: Sorry for my late response, @AutonomicPerfectionist. $ mpirun -np 1 --hostfile ~/.config/llama.cpp/hostfile /projects/open-source/llama.cpp.mpi.latest/main -m ~/Desktop/model.gguf -b 256 -ngl 1 -t 8 -tb 8 -c 500 --temp 0.0 -n 100 -p "[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer with \`\`\`:
Write a purely functional Haskell code for fibonacci:\n\n
[/INST]
" &> mpi.log mpi.log@
|
Overview
This PR adds a new example and adds functionality to the MPI backend to support per-node options. The new example was created to keep MPI-specific enhancements and workarounds separate from the main codebase as much as possible, based on the
main
example. There are several new functions in the MPI backend, one in the llama API, and one new command-line argument.Major Changes
MPI Example
The major difference between the MPI example and the main example currently is that the
mpi
example reads in options from a file instead of from the command line. This is done using thewordexp
functions available in POSIX.1-2001 compliant systems.Llama API Additions
The
mpi
example also calls newly created llama functions pertaining to the MPI backend. Currently, there is one such function:llama_split_layers_weighted()
. This function takes in a vector of weights and splits the layers among the available compute devices (nodes in the case of MPI) according to those weights, rather than requiring direct layer counts like--n-gpu-layers
. This function was added primarily as a timesaver, to prevent needing to calculate the layer counts manually when changing models or when swapping more powerful nodes with less powerful ones.The
llama_split_layers_weighted()
function is currently only implemented for MPI. The implementation calculates the layer ranges for each node only on the head node, and then distributes these ranges to the other nodes via anMPI_Scatter()
collective operation.MPI Backend Changes
Within the
ggml-mpi
backend, I added the ability to use other communicators besidesMPI_WORLD
. This is not yet used but will be utilized in further studies and experiments. This is in addition to the change to layer ranges described above. I also added Doxygen-style doccomments to the MPI backend header, primarily for my own use as I tend to forget details if they are not written down.Llama Internal Changes
Finally, some modifications were done to llama.cpp and common.cpp to workaround issues. I had moved the infinite loop used in the worker nodes to the
llama_eval()
function, so that operations with the llama context could be done on all nodes. This caused worker nodes to enter infinite loops early due to the model warmup inllama_init_from_gpt_params()
, so that is disabled in MPI mode.Why is this a draft?
There are still tasks that must be completed before this PR is ready to merge:
--mpi-layer-split
to help textranges
inllama_split_layers_weighted
still needs freed)Additionally, a large change in the API is coming in #3228 that will require changes to the MPI backend. Those changes may as well be done here.
Reviewing
Please let me know of any changes desired or if there are any questions. I tried to stick to the code style I've seen in this project, but please point out any areas I missed. I believe the API additions are non-breaking, but please let me know your thoughts on them and whether I should change or remove them.