[MPI] Add support for per-node options, thread counts, and layer allocations #3334

AutonomicPerfectionist · 2023-09-26T00:24:10Z

Overview

This PR adds a new example and adds functionality to the MPI backend to support per-node options. The new example was created to keep MPI-specific enhancements and workarounds separate from the main codebase as much as possible, based on the main example. There are several new functions in the MPI backend, one in the llama API, and one new command-line argument.

Major Changes

MPI Example

The major difference between the MPI example and the main example currently is that the mpi example reads in options from a file instead of from the command line. This is done using the wordexp functions available in POSIX.1-2001 compliant systems.

Llama API Additions

The mpi example also calls newly created llama functions pertaining to the MPI backend. Currently, there is one such function: llama_split_layers_weighted(). This function takes in a vector of weights and splits the layers among the available compute devices (nodes in the case of MPI) according to those weights, rather than requiring direct layer counts like --n-gpu-layers. This function was added primarily as a timesaver, to prevent needing to calculate the layer counts manually when changing models or when swapping more powerful nodes with less powerful ones.

The llama_split_layers_weighted() function is currently only implemented for MPI. The implementation calculates the layer ranges for each node only on the head node, and then distributes these ranges to the other nodes via an MPI_Scatter() collective operation.

MPI Backend Changes

Within the ggml-mpi backend, I added the ability to use other communicators besides MPI_WORLD. This is not yet used but will be utilized in further studies and experiments. This is in addition to the change to layer ranges described above. I also added Doxygen-style doccomments to the MPI backend header, primarily for my own use as I tend to forget details if they are not written down.

Llama Internal Changes

Finally, some modifications were done to llama.cpp and common.cpp to workaround issues. I had moved the infinite loop used in the worker nodes to the llama_eval() function, so that operations with the llama context could be done on all nodes. This caused worker nodes to enter infinite loops early due to the model warmup in llama_init_from_gpt_params(), so that is disabled in MPI mode.

Why is this a draft?

There are still tasks that must be completed before this PR is ready to merge:

Add --mpi-layer-split to help text
Check for proper freeing of memory (ranges in llama_split_layers_weighted still needs freed)
Add windows support to mpi example (only need a replacement for wordexp)
Add error and sanity checks (layer split primarily)
Allow any unallocated layers to be evenly split amongst any nodes not already allocated layers, restoring previous layer split behavior if not split percentages are given

Additionally, a large change in the API is coming in #3228 that will require changes to the MPI backend. Those changes may as well be done here.

Reviewing

Please let me know of any changes desired or if there are any questions. I tried to stick to the code style I've seen in this project, but please point out any areas I missed. I believe the API additions are non-breaking, but please let me know your thoughts on them and whether I should change or remove them.

staviq · 2023-09-27T11:58:55Z

Have you, by any chance, encountered this problem ?

#3099 (comment)

It seems like in the original mpi implementation, there was a sync step missing somewhere, and rank 0 was done, while other instances were stuck, and strace says they get stuck on pool on a socket, which to me looks like mpi desync.

Not sure if it's applicable to this PR, but you seem to know mpi better than me at least, so maybe you'll have some idea as to why it's happening.

AutonomicPerfectionist · 2023-09-27T13:22:20Z

If you mean the issue that the worker nodes don't terminate when the model outputs the end of stream token, that is a known issue. It's not a missing sync anywhere, but rather the architecture of the MPI backend didn't take it into account. Each node only expects one type of message to be sent to it, and since the sampling is done only at the head node, they don't have any information about when it's time to stop. This PR does not fix that problem because it is out of scope for it, but it will likely be fixed in future PRs I am planning.

ggerganov · 2023-09-28T16:41:06Z

llama.h

@@ -230,6 +230,8 @@ extern "C" {
                             const char * path_model,
            struct llama_context_params   params);

+    LLAMA_API void llama_split_layers_weighted(struct llama_context * ctx, std::vector<float> device_weights);


We should not use C++ in the C-style API

Sounds good, would replacing with a float array and a size_t length be sufficient?

Should be fixed now

ggerganov

Overall the PR seems OK. We should try to adapt to the changes from #3228

AutonomicPerfectionist · 2023-09-28T17:14:11Z

We should try to adapt to the changes from #3228

Yep, that's what I will be doing over the weekend

AutonomicPerfectionist · 2023-10-30T15:55:11Z

This PR is now fully functional again after the recent changes and has been rebased on master. Only basic inferencing functionality has been tested, more advanced functionality like batching and speculation is unlikely to work. The main example won't work with MPI right now due to the changes in how the layers are split among the nodes, but if desired I can add a fallback path to re-enable that. The working mpi example is based on the main example with some minor changes for MPI support

ggerganov · 2023-11-01T10:00:49Z

Hi, thanks for taking the time. I'll probably interfere a bit with your change as I'm making some refactoring changes in llama.cpp these days. But I'll help resolve conflicts if it's too difficult.

Had a quick glance at the PR and will look more later. The mpi example looks like a lot of duplication with main. I think we should either make a much more minimalist example that just showcases the MPI functionality (something like simple or batched). Or we should just try to adapt main to support MPI if not too difficult.

AutonomicPerfectionist · 2023-11-01T14:08:43Z

I think we should either make a much more minimalist example that just showcases the MPI functionality (something like simple or batched). Or we should just try to adapt main to support MPI if not too difficult.

Yep, that's one reason this PR is still a draft, I just copied main to use as a scratch pad. The original idea used wordexp to load the arguments from a file, which only works on POSIX compliant systems, but thinking through it I think the only argument that needs to be different per node is the number of threads. I think I can instead remove the MPI example entirely, add the necessary calls to main, and extend the threads argument to support multiple values separated by commas (or add a new MPI specific argument to avoid breaking the API for that)

AutonomicPerfectionist · 2023-11-01T15:56:39Z

Looks like the names of the tensors have been changed, which breaks MPI. The current implementation relied on there being tensor_inp_%d names where the number was the layer number, but it appears that has been removed; how might I go about fixing that?

ggerganov · 2023-11-01T16:11:43Z

Oops, I forgot about the purpose of these names and removed them recently.
You should add them back using ggml_format_name(ctx0, ...); // MPI at the start of each layer loop

AutonomicPerfectionist · 2023-11-01T20:05:59Z

I adjusted the command line arguments parsing so you can pass a comma separated list to both -t and -tb to set the threads and batch threads per node. To do so, I had to add a new llama API function to get the node ID, would be open to other suggestions though.

I also added the function call needed for scattering the layer ranges to the main example, so it works with MPI now. I can also restore the original functionality where the layers are evenly split among the nodes, but unfortunately my laptop battery died before I could finish that.

After that's done, I should be able to remove the MPI example entirely

AutonomicPerfectionist · 2023-11-07T18:52:31Z

Performance with this branch looks interesting, I was able to run llama 2 70B across a homemade cluster of 3 CPU-only nodes: i7 9th gen, i5 4th gen, i5 2nd gen, with 16 Gb DDR4 2666MHz, 16 Gb DDR3, and 8 Gb DDR3 respectively. On this cluster I got around 0.58 tokens / second for 70B Q3_K_M. Htop showed roughly 40-60% CPU utilization across all hardware cores when processing the allocated layers, but it's unclear whether that's because the spikes are so short and Htop isn't sampling often enough.

Curiously this isn't much slower than running on a second cluster of much more powerful hardware: ryzen 5 5600g, i7 9th gen, with 32 Gb DDR4 3200MHz each. The second cluster got roughly 0.64 tokens / second, while being much more expensive. I attempted to run it on the ryzen machine alone to gauge MPI overheard via offloading to my 6700xt, but ROCm wouldn't install and opencl caused hangs.

I plan on doing more in-depth performance investigations to determine where the bottleneck is. I have access to a proper university cluster as well that I'll be testing on.

staviq · 2023-11-07T19:12:29Z

Htop isn't sampling often enough.

I'm 99.9% certain raw perf counters come from Linux kernel directly, and are not calculated by a point in time, but by aggregated ticks, effectively being deltas between samples so you cannot "miss" a sample.

You can always dump raw perf counters to a tmpfs file, in a loop, and parse them later
/proc/`pidof main`/stat

But chances that htop or top are wrong, are low.

AutonomicPerfectionist · 2023-11-08T07:13:16Z

Found what was up with htop, there's a commandline switch -d to set the update interval, setting that lower did indeed show 100% usage when processing the allocated layers

After tuning the clusters by adjusting the layer split percentages such that no node was swapping to disk, I achieved 0.69 tokens / second on the weaker cluster and 0.78 tokens / second on the Ryzen cluster.

Running on an AMD EPYC 7543P 32-Core Processor without MPI resulted in 1.01 tokens / second, although that system was NUMA and I didn't have permissions to adjust the memory configuration

AutonomicPerfectionist · 2023-11-10T14:44:06Z

Discovered a bug in this implementation regarding KV cache, syncing the sequence IDs isn't enough, the kv_cache_* function calls also need to be synced. I solved this issue in a different branch for my master's class project but it involved introducing more MPI-specific code to the general llama.cpp codebase. I haven't yet looked at the backend-v2 changes but hopefully, there are facilities to not spread MPI code too far

LeaveNhA · 2024-01-25T02:45:33Z

I can confirm that this pr is not building on apple silicon. If it's unexpected, I can provide every bit of information needed to help you fellas.

AutonomicPerfectionist · 2024-01-25T04:06:52Z

I don't have Apple silicon devices to test on, so whatever information you have would be greatly appreciated.

LeaveNhA · 2024-01-26T01:47:22Z

Actually, it's the same with your CI logs but I'll add more context with this message soon.

Edit:
For the sake of the re-produceability, I recently cloned fresh-new folder. Also FYI, I don't use Rosetta at all unless I have to.

Context:

System:

Apple Silicon, M1 Max 64GB/2TB

gh pr checkout 3334 # for checkout this PR.
make CC=mpicc CXX=mpicxx LLAMA_MPI=1 LLAMA_NO_METAL=1 -j10 # make for compiling.

# output of make:
...
examples/batched/batched.cpp:81:41: error: assigning to 'uint32_t' (aka 'unsigned int') from incompatible type 'std::vector<int32_t>' (aka 'vector<int>')
   81 |     ctx_params.n_threads       = params.n_threads;
      |                                  ~~~~~~~^~~~~~~~~
examples/batched/batched.cpp:82:57: error: invalid operands to binary expression ('std::vector<int32_t>' (aka 'vector<int>') and 'int')
   82 |     ctx_params.n_threads_batch = params.n_threads_batch == -1 ? params.n_threads : params.n_threads_batch;
      |                                  ~~~~~~~~~~~~~~~~~~~~~~ ^  ~~
...
2 errors generated.
make: *** [simple] Error 1
make: *** Waiting for unfinished jobs....
2 errors generated.
make: *** [batched-bench] Error 1
2 errors generated.
make: *** [batched] Error 1

Versions:

make --version ##
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0

#########
mpirun --version
mpirun (Open MPI) 5.0.1

#########

mpicc --version # or mpicxx --version, they are the same dependency.
Homebrew clang version 17.0.6
Target: arm64-apple-darwin23.2.0
Thread model: posix
InstalledDir: /opt/homebrew/opt/llvm/bin

Edit II:

I couldn't resist and fix it by removing the problematic build options from makefile and succeeded the build with make. But the result is failure since it crashes with a segmentation fault.
I might be volunteer since I still have no job on me, but I need to read the related code-parts first. Will you inform about it.
PS: might need some Q/A chats, if it's okay.

AutonomicPerfectionist · 2024-01-28T16:41:02Z

Ah yes, those errors are due to me not updating all of the examples, should be a simple fix.

I would certainly appreciate help though, I've been terribly busy with graduate school the last few months!

I plan to rebase on master later this week assuming nothing pops up on my calendar

LeaveNhA · 2024-01-28T16:57:01Z

@AutonomicPerfectionist, but the functionality itself is not working tho.
Or did I fail to use it to run, properly?

…g Bcast

… buffer type be a host buffer, fix rebase errors

…ith new style

…(not tested)

AutonomicPerfectionist · 2024-03-19T21:31:44Z

I've ported over the fixes for the KV operations, and now the parallel example works perfectly. There is a memory leak in the MPI backend compute function, but I know why. I've also fixed a bug caused by the nodes not maintaining the number of sequences or tokens in a batch.

The MPI backend now functions on a "transaction" principle where each operation sent down the pipeline is considered atomic and strictly ordered. This is accomplished by first sending a message with a tag designating the beginning of the transaction and an identifier identifying the type of transaction. The transaction type is then used to determine which function the worker node needs to execute. Tags are also used within the various functions to designate the type of information being sent or received, and thanks to MPI's ordering guarantees, all of this means that the system transparently preserves the order of operations throughout the entire pipeline as dictated by the head node.

For example, the head node could begin decoding (once I implement the pipeline parallel / async operators), then rearrange KV cache sequences (firing off more messages down the pipeline), and finally wait for the results of the initial decode. The downstream worker nodes would maintain that ordering so that the result of the pipeline is guaranteed to use the correct KV cache entries.

Transactions are only used for operations that need consistent ordering; there's still the ability to send messages and have them "jump" the queue for high-priority or exotic messages. An example is the "shutdown" message, which isn't fully implemented here yet but will fix the issue with the MPI backend hanging once the program finishes. The atomicity of transactions is still preserved; the "shutdown" command would only be processed once whatever transaction being currently processed is finished. To break that atomicity, applications would need to explicitly probe for messages within the processing function of a transaction. This would only really be useful for canceling processing mid-stream, which will be supported as well through the CANCEL message to facilitate flushing the pipeline.

liyimeng · 2024-03-20T09:47:36Z

@AutonomicPerfectionist so the PR is graduating from draft :D

AutonomicPerfectionist · 2024-03-20T15:44:08Z

Not quite yet, there's still some things to be fixed before I'd consider it ready for general usage. Primarily, memory leaks and missing bounds checks need to be fixed. At the moment, if you run it without using the correct number of --mpi-layer-split parameters you'll either get a segfault or a buffer overrun. In the latter case it might not tell you there's a problem, and instead just output complete garbage. I need to add checks to make sure either of those don't happen

LeaveNhA · 2024-03-20T18:16:48Z

TL;DR:

I simply solved it with placing a missing include directive for ggml-mpi.cpp file:

#include <map>

When I try to make, I get this:

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Darwin
I UNAME_P:   arm
I UNAME_M:   arm64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include  -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit  -L/opt/homebrew/opt/llvm/lib -L/opt/homebrew/opt/bison/lib -L/opt/homebrew/opt/llvm/lib -L/opt/homebrew/opt/bison/lib
I CC:        Homebrew clang version 17.0.6
I CXX:       Homebrew clang version 17.0.6

mpicc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml.c -o ggml.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include  -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c llama.cpp -o llama.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include  -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/common.cpp -o common.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include  -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/sampling.cpp -o sampling.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include  -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/grammar-parser.cpp -o grammar-parser.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include  -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c common/console.cpp -o console.o
mpicxx -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread -Wno-cast-qual  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/bison/include  -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -c ggml-mpi.cpp -o ggml-mpi.o
mpicc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion  -c ggml-metal.m -o ggml-metal.o
mpicc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-alloc.c -o ggml-alloc.o
mpicc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion    -c ggml-backend.c -o ggml-backend.o
mpicc -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_MPI -DGGML_USE_METAL  -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wno-cast-qual -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion     -c ggml-quants.c -o ggml-quants.o
ggml-mpi.cpp:101:5: warning: no previous prototype for function 'ggml_mpi_next_node' [-Wmissing-prototypes]
  101 | int ggml_mpi_next_node(struct ggml_mpi_context * ctx_mpi) {
      |     ^
ggml-mpi.cpp:101:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  101 | int ggml_mpi_next_node(struct ggml_mpi_context * ctx_mpi) {
      | ^
      | static 
ggml-mpi.cpp:105:5: warning: no previous prototype for function 'ggml_mpi_prev_node' [-Wmissing-prototypes]
  105 | int ggml_mpi_prev_node(struct ggml_mpi_context * ctx_mpi) {
      |     ^
ggml-mpi.cpp:105:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  105 | int ggml_mpi_prev_node(struct ggml_mpi_context * ctx_mpi) {
      | ^
      | static 
ggml-mpi.cpp:138:6: warning: no previous prototype for function 'ggml_mpi_barrier' [-Wmissing-prototypes]
  138 | void ggml_mpi_barrier(struct ggml_mpi_context * ctx_mpi) {
      |      ^
ggml-mpi.cpp:138:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  138 | void ggml_mpi_barrier(struct ggml_mpi_context * ctx_mpi) {
      | ^
      | static 
ggml-mpi.cpp:173:37: warning: unused parameter 'n_seq_max' [-Wunused-parameter]
  173 |                 uint32_t            n_seq_max) {
      |                                     ^
ggml-mpi.cpp:408:10: warning: no previous prototype for function 'ggml_backend_mpi_buffer_type_get_comm' [-Wmissing-prototypes]
  408 | MPI_Comm ggml_backend_mpi_buffer_type_get_comm(ggml_backend_buffer_type_t buft) {
      |          ^
ggml-mpi.cpp:408:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  408 | MPI_Comm ggml_backend_mpi_buffer_type_get_comm(ggml_backend_buffer_type_t buft) {
      | ^
      | static 
ggml-mpi.cpp:414:10: warning: no previous prototype for function 'ggml_backend_mpi_buffer_get_comm' [-Wmissing-prototypes]
  414 | MPI_Comm ggml_backend_mpi_buffer_get_comm(ggml_backend_buffer_t buffer) {
      |          ^
ggml-mpi.cpp:414:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  414 | MPI_Comm ggml_backend_mpi_buffer_get_comm(ggml_backend_buffer_t buffer) {
      | ^
      | static 
ggml-mpi.cpp:418:10: warning: no previous prototype for function 'ggml_backend_mpi_get_comm' [-Wmissing-prototypes]
  418 | MPI_Comm ggml_backend_mpi_get_comm(ggml_backend_t backend) {
      |          ^
ggml-mpi.cpp:418:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  418 | MPI_Comm ggml_backend_mpi_get_comm(ggml_backend_t backend) {
      | ^
      | static 
ggml-mpi.cpp:424:5: warning: no previous prototype for function 'ggml_backend_mpi_buffer_local_rank' [-Wmissing-prototypes]
  424 | int ggml_backend_mpi_buffer_local_rank(ggml_backend_buffer_t buffer) {
      |     ^
ggml-mpi.cpp:424:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  424 | int ggml_backend_mpi_buffer_local_rank(ggml_backend_buffer_t buffer) {
      | ^
      | static 
ggml-mpi.cpp:438:5: warning: no previous prototype for function 'ggml_backend_mpi_local_rank' [-Wmissing-prototypes]
  438 | int ggml_backend_mpi_local_rank(ggml_backend_t backend) {
      |     ^
ggml-mpi.cpp:438:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  438 | int ggml_backend_mpi_local_rank(ggml_backend_t backend) {
      | ^
      | static 
ggml-mpi.cpp:445:5: warning: no previous prototype for function 'ggml_backend_mpi_buffer_rank' [-Wmissing-prototypes]
  445 | int ggml_backend_mpi_buffer_rank(ggml_backend_buffer_t buffer) {
      |     ^
ggml-mpi.cpp:445:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  445 | int ggml_backend_mpi_buffer_rank(ggml_backend_buffer_t buffer) {
      | ^
      | static 
ggml-mpi.cpp:458:5: warning: no previous prototype for function 'ggml_backend_mpi_rank' [-Wmissing-prototypes]
  458 | int ggml_backend_mpi_rank(ggml_backend_t backend) {
      |     ^
ggml-mpi.cpp:458:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  458 | int ggml_backend_mpi_rank(ggml_backend_t backend) {
      | ^
      | static 
ggml-mpi.cpp:463:23: warning: no previous prototype for function 'ggml_backend_mpi_buffer_unwrap' [-Wmissing-prototypes]
  463 | ggml_backend_buffer_t ggml_backend_mpi_buffer_unwrap(ggml_backend_buffer_t buffer) {
      |                       ^
ggml-mpi.cpp:463:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  463 | ggml_backend_buffer_t ggml_backend_mpi_buffer_unwrap(ggml_backend_buffer_t buffer) {
      | ^
      | static 
ggml-mpi.cpp:473:28: warning: no previous prototype for function 'ggml_backend_mpi_buffer_type_unwrap' [-Wmissing-prototypes]
  473 | ggml_backend_buffer_type_t ggml_backend_mpi_buffer_type_unwrap(ggml_backend_buffer_type_t buft) {
      |                            ^
ggml-mpi.cpp:473:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  473 | ggml_backend_buffer_type_t ggml_backend_mpi_buffer_type_unwrap(ggml_backend_buffer_type_t buft) {
      | ^
      | static 
ggml-mpi.cpp:506:16: warning: no previous prototype for function 'ggml_backend_mpi_buffer_type_copy_ctx' [-Wmissing-prototypes]
  506 | GGML_CALL void ggml_backend_mpi_buffer_type_copy_ctx(ggml_backend_buffer_type_t src, ggml_backend_buffer_type_t dst) {
      |                ^
ggml-mpi.cpp:506:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
  506 | GGML_CALL void ggml_backend_mpi_buffer_type_copy_ctx(ggml_backend_buffer_type_t src, ggml_backend_buffer_type_t dst) {
      |           ^
      |           static 
ggml-mpi.cpp:514:16: warning: no previous prototype for function 'ggml_backend_mpi_buffer_copy_ctx' [-Wmissing-prototypes]
  514 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx(ggml_backend_buffer_t src, ggml_backend_buffer_t dst) {
      |                ^
ggml-mpi.cpp:514:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
  514 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx(ggml_backend_buffer_t src, ggml_backend_buffer_t dst) {
      |           ^
      |           static 
ggml-mpi.cpp:523:16: warning: no previous prototype for function 'ggml_backend_mpi_buffer_copy_ctx_from_type' [-Wmissing-prototypes]
  523 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx_from_type(ggml_backend_buffer_type_t src, ggml_backend_buffer_t dst) {
      |                ^
ggml-mpi.cpp:523:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
  523 | GGML_CALL void ggml_backend_mpi_buffer_copy_ctx_from_type(ggml_backend_buffer_type_t src, ggml_backend_buffer_t dst) {
      |           ^
      |           static 
ggml-mpi.cpp:710:30: warning: no previous prototype for function 'ggml_mpi_available_devices_internal' [-Wmissing-prototypes]
  710 | std::vector<ggml_mpi_device> ggml_mpi_available_devices_internal() {
      |                              ^
ggml-mpi.cpp:710:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  710 | std::vector<ggml_mpi_device> ggml_mpi_available_devices_internal() {
      | ^
      | static 
ggml-mpi.cpp:733:16: warning: no previous prototype for function 'ggml_backend_is_mpi' [-Wmissing-prototypes]
  733 | GGML_CALL bool ggml_backend_is_mpi(ggml_backend_t backend) {
      |                ^
ggml-mpi.cpp:733:11: note: declare 'static' if the function is not intended to be used outside of this translation unit
  733 | GGML_CALL bool ggml_backend_is_mpi(ggml_backend_t backend) {
      |           ^
      |           static 
ggml-mpi.cpp:776:13: error: no template named 'map' in namespace 'std'; did you mean 'max'?
  776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
      |        ~~~~~^~~
      |             max
/opt/homebrew/opt/llvm/bin/../include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
   31 | max(_LIBCPP_LIFETIMEBOUND const _Tp& __a, _LIBCPP_LIFETIMEBOUND const _Tp& __b, _Compare __comp)
      | ^
ggml-mpi.cpp:776:13: error: a type specifier is required for all declarations
  776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
      | ~~~~~~      ^
ggml-mpi.cpp:776:13: error: template specialization requires 'template<>'
  776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
      |             ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      | template<> 
ggml-mpi.cpp:776:13: error: no variable template matches specialization; did you mean to use 'max' as function template instead?
ggml-mpi.cpp:776:72: error: expected ';' after top level declarator
  776 | static std::map<ggml_backend_buffer_type_t, ggml_backend_buffer_type_t> cached_wrappers;
      |                                                                        ^
      |                                                                        ;
ggml-mpi.cpp:778:13: error: no template named 'map' in namespace 'std'; did you mean 'max'?
  778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
      |        ~~~~~^~~
      |             max
/opt/homebrew/opt/llvm/bin/../include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
   31 | max(_LIBCPP_LIFETIMEBOUND const _Tp& __a, _LIBCPP_LIFETIMEBOUND const _Tp& __b, _Compare __comp)
      | ^
ggml-mpi.cpp:778:13: error: a type specifier is required for all declarations
  778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
      | ~~~~~~      ^
ggml-mpi.cpp:778:13: error: template specialization requires 'template<>'
  778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
      |             ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      | template<> 
ggml-mpi.cpp:778:13: error: no variable template matches specialization; did you mean to use 'max' as function template instead?
ggml-mpi.cpp:778:62: error: expected ';' after top level declarator
  778 | static std::map<ggml_backend_buffer_t, ggml_backend_buffer_t> cached_buffer_wrappers;
      |                                                              ^
      |                                                              ;
ggml-mpi.cpp:780:13: error: no template named 'map' in namespace 'std'; did you mean 'max'?
  780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
      |        ~~~~~^~~
      |             max
/opt/homebrew/opt/llvm/bin/../include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
   31 | max(_LIBCPP_LIFETIMEBOUND const _Tp& __a, _LIBCPP_LIFETIMEBOUND const _Tp& __b, _Compare __comp)
      | ^
ggml-mpi.cpp:780:13: error: a type specifier is required for all declarations
  780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
      | ~~~~~~      ^
ggml-mpi.cpp:780:13: error: template specialization requires 'template<>'
  780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
      |             ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      | template<> 
ggml-mpi.cpp:780:13: error: no variable template matches specialization; did you mean to use 'max' as function template instead?
ggml-mpi.cpp:780:50: error: expected ';' after top level declarator
  780 | static std::map<ggml_backend_t *, ggml_backend_t> cached_backends;
      |                                                  ^
      |                                                  ;
ggml-mpi.cpp:816:5: error: use of undeclared identifier 'cached_wrappers'
  816 |     cached_wrappers[buft] = ggml_backend_wrapped_buffer_type;
      |     ^
ggml-mpi.cpp:910:5: error: use of undeclared identifier 'cached_buffer_wrappers'
  910 |     cached_buffer_wrappers[buf] = buffer;
      |     ^
ggml-mpi.cpp:917:55: warning: unused parameter 'backend_src' [-Wunused-parameter]
  917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
      |                                                       ^
ggml-mpi.cpp:917:83: warning: unused parameter 'backend_dst' [-Wunused-parameter]
  917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
      |                                                                                   ^
ggml-mpi.cpp:917:123: warning: unused parameter 'src' [-Wunused-parameter]
  917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
      |                                                                                                                           ^
ggml-mpi.cpp:917:149: warning: unused parameter 'dst' [-Wunused-parameter]
  917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
      |                                                                                                                                                     ^
ggml-mpi.cpp:917:6: warning: no previous prototype for function 'ggml_backend_mpi_cpy_tensor_async' [-Wmissing-prototypes]
  917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
      |      ^
ggml-mpi.cpp:917:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  917 | bool ggml_backend_mpi_cpy_tensor_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, const struct ggml_tensor * src, struct ggml_tensor * dst) {
      | ^
      | static 
ggml-mpi.cpp:942:115: warning: unused parameter 'offset' [-Wunused-parameter]
  942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
      |                                                                                                                   ^
ggml-mpi.cpp:942:130: warning: unused parameter 'size' [-Wunused-parameter]
  942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
      |                                                                                                                                  ^
ggml-mpi.cpp:942:6: warning: no previous prototype for function 'ggml_backend_mpi_set_tensor_async' [-Wmissing-prototypes]
  942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
      |      ^
ggml-mpi.cpp:942:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
  942 | void ggml_backend_mpi_set_tensor_async(ggml_backend_t backend, struct ggml_tensor * dst, const void* data, size_t offset, size_t size) {
      | ^
      | static 
ggml-mpi.cpp:985:5: warning: missing field 'offload_op' initializer [-Wmissing-field-initializers]
  985 |     };
      |     ^
ggml-mpi.cpp:993:5: error: use of undeclared identifier 'cached_backends'; did you mean 'wrapped_backends'?
  993 |     cached_backends[wrapped_backends] = mpi_backend;
      |     ^~~~~~~~~~~~~~~
      |     wrapped_backends
ggml-mpi.cpp:956:55: note: 'wrapped_backends' declared here
  956 | ggml_backend_t ggml_backend_mpi_init(ggml_backend_t * wrapped_backends, size_t num_backends, int rank) {
      |                                                       ^
ggml-mpi.cpp:993:20: error: array subscript is not an integer
  993 |     cached_backends[wrapped_backends] = mpi_backend;
      |                    ^~~~~~~~~~~~~~~~~
ggml-mpi.cpp:998:77: warning: unused parameter 'user_data' [-Wunused-parameter]
  998 | static ggml_backend_t ggml_backend_reg_mpi_init(const char * params, void * user_data) {
      |                                                                             ^
28 warnings and 19 errors generated.
make: *** [ggml-mpi.o] Error 1
make: *** Waiting for unfinished jobs....
common/common.cpp:1948:26: warning: comparison of integers of different signs: 'int' and 'size_type' (aka 'unsigned long') [-Wsign-compare]
 1948 |     n_threads = (node_id >= params.n_threads.size()) ? get_num_physical_cores() : params.n_threads[node_id];
      |                  ~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~
common/common.cpp:1949:40: warning: comparison of integers of different signs: 'int' and 'size_type' (aka 'unsigned long') [-Wsign-compare]
 1949 |     int32_t n_threads_batch = (node_id >= params.n_threads_batch.size()) ? -1 : params.n_threads_batch[node_id];
      |                                ~~~~~~~ ^  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
llama.cpp:9076:9: warning: unused variable 'old_tokens' [-Wunused-variable]
 9076 |     int old_tokens = batch_all.n_tokens;
      |         ^~~~~~~~~~
llama.cpp:13016:12: warning: 'return' will never be executed [-Wunreachable-code-return]
 13016 |     return 0;
       |            ^
llama.cpp:13461:57: warning: unused parameter 'ctx' [-Wunused-parameter]
 13461 | void llama_split_layers_weighted(struct llama_context * ctx, float device_weights[], size_t num_weights) {
       |                                                         ^
llama.cpp:13461:68: warning: unused parameter 'device_weights' [-Wunused-parameter]
 13461 | void llama_split_layers_weighted(struct llama_context * ctx, float device_weights[], size_t num_weights) {
       |                                                                    ^
llama.cpp:13461:93: warning: unused parameter 'num_weights' [-Wunused-parameter]
 13461 | void llama_split_layers_weighted(struct llama_context * ctx, float device_weights[], size_t num_weights) {
       |                                                                                             ^
llama.cpp:14475:5: warning: no previous prototype for function 'llama_process_mpi_transaction' [-Wmissing-prototypes]
 14475 | int llama_process_mpi_transaction(
       |     ^
llama.cpp:14475:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
 14475 | int llama_process_mpi_transaction(
       | ^
       | static 
llama.cpp:14512:13: warning: 'break' will never be executed [-Wunreachable-code-break]
 14512 |             break;
       |             ^~~~~
llama.cpp:14487:13: warning: 'break' will never be executed [-Wunreachable-code-break]
 14487 |             break;
       |             ^~~~~
llama.cpp:14522:13: warning: unused variable 'count' [-Wunused-variable]
 14522 |     int32_t count;
       |             ^~~~~
llama.cpp:14517:5: warning: no previous prototype for function 'llama_process_mpi_worker' [-Wmissing-prototypes]
 14517 | int llama_process_mpi_worker(
       |     ^
llama.cpp:14517:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
 14517 | int llama_process_mpi_worker(
       | ^
       | static 
llama.cpp:14532:13: warning: 'break' will never be executed [-Wunreachable-code-break]
 14532 |             break;
       |             ^~~~~
llama.cpp:14537:13: warning: 'break' will never be executed [-Wunreachable-code-break]
 14537 |             break;
       |             ^~~~~
llama.cpp:14550:13: warning: 'break' will never be executed [-Wunreachable-code-break]
 14550 |             break;
       |             ^~~~~
2 warnings generated.
13 warnings generated.

AutonomicPerfectionist · 2024-03-20T18:38:47Z

Interesting... At the moment the wrapper caches are the only parts using maps, but those caches are not actually being used at the moment. I'll add the include in my next push if I end up needing the caches, otherwise I'll remove them and therefore the dependency on map

LeaveNhA · 2024-03-20T20:05:19Z

@AutonomicPerfectionist, may I ask, is it to be expected to get crash with MPS backend with ngl parameter setted with > 1?

AutonomicPerfectionist · 2024-03-20T20:10:49Z

Yeah, that's pretty much expected, I haven't tested with anything but CPU as the wrapped backend. So metal, CUDA, Vulkan, SYCL, I expect to all crash horrifically. The plan is to fix that so you can run with any backend. If you can, could you put all the details of what you tried here so I can fix it? I only have access to NVIDIA and AMD GPUs so if you have a Mac with Metal it would be a big help to provide any error messages. Also try to build with debug mode on, and in the case of a SEGFAULT you can either use a debugger to investigate further or decompile with objdump and search for the program offsets that trigger the SEGFAULT.

LeaveNhA · 2024-03-22T21:59:16Z

Yes, of course. I will inform you when I get a chance to test it. Probably, today afternoon.

Edit:

Sorry for my late response, @AutonomicPerfectionist.
Here is the log that I got from my simple run.

$ mpirun -np 1 --hostfile ~/.config/llama.cpp/hostfile /projects/open-source/llama.cpp.mpi.latest/main -m ~/Desktop/model.gguf -b 256 -ngl 1 -t 8 -tb 8 -c 500 --temp 0.0 -n 100 -p "[INST] Write code to solve the following coding problem that obeys the constraints and passes the example test cases. Please wrap your code answer with \`\`\`:
Write a purely functional Haskell code for fibonacci:\n\n
[/INST]
" &> mpi.log

mpi.log@

Log start
main: build = 2496 (94191905)
main: built with Homebrew clang version 17.0.6 for arm64-apple-darwin23.3.0
main: seed  = 1711310092
llama_model_loader: loaded meta data with 20 key-value pairs and 435 tensors from /Users/sckn/Desktop/model.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-34b-instruct-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 48
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 22016
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   97 tensors
llama_model_loader: - type q5_K:  289 tensors
llama_model_loader: - type q6_K:   49 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 48
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 22016
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 34B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 33.74 B
llm_load_print_meta: model size       = 22.20 GiB (5.65 BPW) 
llm_load_print_meta: general.name     = codellama_codellama-34b-instruct-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
Setting buffer rank for i 0 and j 0
Setting buffer rank for i 0 and j 1
Setting buffer rank for i 0 and j 2
Setting buffer rank for i 0 and j 3
Setting buffer rank for i 0 and j 4
Setting buffer rank for i 0 and j 5
Setting buffer rank for i 0 and j 6
Setting buffer rank for i 0 and j 7
Setting buffer rank for i 0 and j 8
Setting buffer rank for i 0 and j 9
Setting buffer rank for i 0 and j 10
Setting buffer rank for i 0 and j 11
Setting buffer rank for i 0 and j 12
Setting buffer rank for i 0 and j 13
Setting buffer rank for i 0 and j 14
Setting buffer rank for i 0 and j 15
Setting buffer rank for i 0 and j 16
Setting buffer rank for i 0 and j 17
Setting buffer rank for i 0 and j 18
Setting buffer rank for i 0 and j 19
Setting buffer rank for i 0 and j 20
Setting buffer rank for i 0 and j 21
Setting buffer rank for i 0 and j 22
Setting buffer rank for i 0 and j 23
Setting buffer rank for i 0 and j 24
Setting buffer rank for i 0 and j 25
Setting buffer rank for i 0 and j 26
Setting buffer rank for i 0 and j 27
Setting buffer rank for i 0 and j 28
Setting buffer rank for i 0 and j 29
Setting buffer rank for i 0 and j 30
Setting buffer rank for i 0 and j 31
Setting buffer rank for i 0 and j 32
Setting buffer rank for i 0 and j 33
Setting buffer rank for i 0 and j 34
Setting buffer rank for i 0 and j 35
Setting buffer rank for i 0 and j 36
Setting buffer rank for i 0 and j 37
Setting buffer rank for i 0 and j 38
Setting buffer rank for i 0 and j 39
Setting buffer rank for i 0 and j 40
Setting buffer rank for i 0 and j 41
Setting buffer rank for i 0 and j 42
Setting buffer rank for i 0 and j 43
Setting buffer rank for i 0 and j 44
Setting buffer rank for i 0 and j 45
Setting buffer rank for i 0 and j 46
llm_load_tensors: ggml ctx size =    8.32 MiB
ggml_backend_metal_buffer_from_ptr: allocated buffer, size =   477.73 MiB, (  477.80 / 49152.00)
llm_load_tensors: offloading 1 repeating layers to GPU
llm_load_tensors: offloaded 1/49 layers to GPU
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):Metal buffer size =   477.73 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   171.88 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   454.88 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   476.66 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   454.88 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   476.66 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   453.81 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   477.72 MiB
llm_load_tensors: MPI Buffer(Rank 0, local rank 0):MPI Buffer(Rank 0, local rank 0):CPU buffer size =   205.11 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 500
llama_new_context_with_model: n_batch    = 256
llama_new_context_with_model: n_ubatch   = 256
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/projects/open-source/llama.cpp.mpi.latest/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 51539.61 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =     1.95 MiB, (  481.56 / 49152.00)
llama_kv_cache_init:      Metal KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_kv_cache_init: MPI Buffer(Rank 0, local rank 0):CPU KV buffer size =     1.95 MiB
llama_new_context_with_model: KV self size  =   93.75 MiB, K (f16):   46.88 MiB, V (f16):   46.88 MiB
llama_new_context_with_model: MPI Buffer(Rank 0, local rank 0):CPU  output buffer size =    31.25 MiB
ggml_backend_sched_backend_from_buffer: error: no backend supports buffer type Metal used in tensor cache_k_l47
GGML_ASSERT: ggml-backend.c:1111: false
[Zacian:74303] *** Process received signal ***
[Zacian:74303] Signal: Abort trap: 6 (6)
[Zacian:74303] Signal code:  (0)
[Zacian:74303] [ 0] 0   libsystem_platform.dylib            0x00000001897d9a24 _sigtramp + 56
[Zacian:74303] [ 1] 0   libsystem_pthread.dylib             0x00000001897a9cc0 pthread_kill + 288
[Zacian:74303] [ 2] 0   libsystem_c.dylib                   0x00000001896b5a40 abort + 180
[Zacian:74303] [ 3] 0   main                                0x00000001008a59c0 ggml_backend_sched_backend_from_buffer + 224
[Zacian:74303] [ 4] 0   main                                0x00000001008a56e4 ggml_backend_sched_backend_id_from_cur + 28
[Zacian:74303] [ 5] 0   main                                0x00000001008a2d94 ggml_backend_sched_split_graph + 188
[Zacian:74303] [ 6] 0   main                                0x00000001008a2bf0 ggml_backend_sched_reserve + 44
[Zacian:74303] [ 7] 0   main                                0x0000000100807d94 llama_new_context_with_model + 8048
[Zacian:74303] [ 8] 0   main                                0x0000000100870400 _Z26llama_init_from_gpt_paramsR10gpt_params + 468
[Zacian:74303] [ 9] 0   main                                0x00000001008da460 main + 2368
[Zacian:74303] [10] 0   dyld                                0x00000001894290e0 start + 2360
[Zacian:74303] *** End of error message ***
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 74303 on node Zacian exited on
signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
�```

ggerganov reviewed Sep 28, 2023

View reviewed changes

AutonomicPerfectionist force-pushed the mpi-heterogenous branch 2 times, most recently from a0ee1eb to 77cd3e0 Compare October 30, 2023 15:50

ggerganov self-requested a review October 31, 2023 19:11

AutonomicPerfectionist mentioned this pull request Nov 1, 2023

Can't compile "llama.cpp/ggml-quants.c" #3880

Closed

4 tasks

This was referenced Nov 10, 2023

Quick question: is llama.cpp supporting model parallelism? #4014

Closed

Trouble running llama.cpp compiled for OpenMPI #3752

Closed

AutonomicPerfectionist force-pushed the mpi-heterogenous branch from b4c7045 to 51f3f8f Compare November 26, 2023 19:48

vvsotnikov mentioned this pull request Dec 13, 2023

MPI run on M1 Max #4244

Closed

AutonomicPerfectionist mentioned this pull request Dec 22, 2023

Initial import of OpenSHMEM support #4571

Open

cebtenzzre mentioned this pull request Jan 23, 2024

Split workload to multiple computer for better performance nomic-ai/gpt4all#1869

Closed

AutonomicPerfectionist added 10 commits March 14, 2024 20:26

Allow MPI backend to wrap multiple backends

bc93545

Working MPI backend implementation

942ce84

Support new MPI backend in llama.cpp and increase GGML max split inputs

619bf62

Fix simple to use new per-node thread count

01be58c

Update to use backend GUID and changed signatures

c6280bc

Resize seq_ids by n_seq_max, port over sync_pipelined instead of usin…

72dcd66

…g Bcast

Clean up MPI backend a tad

5f156f3

Remove hard-coded layer splits and support more than 2 nodes

4692644

Use CXX and CXXFLAGS for ggml-mpi compilation in Makefile

e8a6156

Change requirement of last backend being CPU to requiring its default…

2217b02

… buffer type be a host buffer, fix rebase errors

AutonomicPerfectionist force-pushed the mpi-heterogenous branch from d407058 to 2217b02 Compare March 15, 2024 03:25

AutonomicPerfectionist added 6 commits March 18, 2024 21:49

Merge branch 'master' into mpi-heterogenous

1d744d8

Fix breaks in gpt_params_find_arg

cc551df

Fix incorrect sched hash size, refactor new cmdline params to align w…

be63161

…ith new style

Fix all examples to use new thread vector params, remove mpi example

57ac2e7

Port transactions from mpi-speculative, fix incorrect seq_id syncing …

d2de181

…(not tested)

Pipeline KV operations

9419190

Fix non-CPU backend wrapping

19e78d2

github-actions bot added build Compilation issues examples server ggml changes relating to the ggml tensor library for machine learning labels May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MPI] Add support for per-node options, thread counts, and layer allocations #3334

[MPI] Add support for per-node options, thread counts, and layer allocations #3334

AutonomicPerfectionist commented Sep 26, 2023 •

edited

Loading

staviq commented Sep 27, 2023

AutonomicPerfectionist commented Sep 27, 2023

ggerganov Sep 28, 2023

AutonomicPerfectionist Sep 28, 2023

AutonomicPerfectionist Sep 28, 2023

ggerganov left a comment

AutonomicPerfectionist commented Sep 28, 2023

AutonomicPerfectionist commented Oct 30, 2023

ggerganov commented Nov 1, 2023

AutonomicPerfectionist commented Nov 1, 2023

AutonomicPerfectionist commented Nov 1, 2023

ggerganov commented Nov 1, 2023

AutonomicPerfectionist commented Nov 1, 2023

AutonomicPerfectionist commented Nov 7, 2023

staviq commented Nov 7, 2023

AutonomicPerfectionist commented Nov 8, 2023

AutonomicPerfectionist commented Nov 10, 2023

LeaveNhA commented Jan 25, 2024

AutonomicPerfectionist commented Jan 25, 2024

LeaveNhA commented Jan 26, 2024 •

edited

Loading

AutonomicPerfectionist commented Jan 28, 2024

LeaveNhA commented Jan 28, 2024

AutonomicPerfectionist commented Mar 19, 2024

liyimeng commented Mar 20, 2024

AutonomicPerfectionist commented Mar 20, 2024

LeaveNhA commented Mar 20, 2024 •

edited

Loading

AutonomicPerfectionist commented Mar 20, 2024

LeaveNhA commented Mar 20, 2024

AutonomicPerfectionist commented Mar 20, 2024

LeaveNhA commented Mar 22, 2024 •

edited

Loading

[MPI] Add support for per-node options, thread counts, and layer allocations #3334

Are you sure you want to change the base?

[MPI] Add support for per-node options, thread counts, and layer allocations #3334

Conversation

AutonomicPerfectionist commented Sep 26, 2023 • edited Loading

Overview

Major Changes

MPI Example

Llama API Additions

MPI Backend Changes

Llama Internal Changes

Why is this a draft?

Reviewing

staviq commented Sep 27, 2023

AutonomicPerfectionist commented Sep 27, 2023

ggerganov Sep 28, 2023

Choose a reason for hiding this comment

AutonomicPerfectionist Sep 28, 2023

Choose a reason for hiding this comment

AutonomicPerfectionist Sep 28, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

AutonomicPerfectionist commented Sep 28, 2023

AutonomicPerfectionist commented Oct 30, 2023

ggerganov commented Nov 1, 2023

AutonomicPerfectionist commented Nov 1, 2023

AutonomicPerfectionist commented Nov 1, 2023

ggerganov commented Nov 1, 2023

AutonomicPerfectionist commented Nov 1, 2023

AutonomicPerfectionist commented Nov 7, 2023

staviq commented Nov 7, 2023

AutonomicPerfectionist commented Nov 8, 2023

AutonomicPerfectionist commented Nov 10, 2023

LeaveNhA commented Jan 25, 2024

AutonomicPerfectionist commented Jan 25, 2024

LeaveNhA commented Jan 26, 2024 • edited Loading

Context:

System:

Versions:

AutonomicPerfectionist commented Jan 28, 2024

LeaveNhA commented Jan 28, 2024

AutonomicPerfectionist commented Mar 19, 2024

liyimeng commented Mar 20, 2024

AutonomicPerfectionist commented Mar 20, 2024

LeaveNhA commented Mar 20, 2024 • edited Loading

AutonomicPerfectionist commented Mar 20, 2024

LeaveNhA commented Mar 20, 2024

AutonomicPerfectionist commented Mar 20, 2024

LeaveNhA commented Mar 22, 2024 • edited Loading

AutonomicPerfectionist commented Sep 26, 2023 •

edited

Loading

LeaveNhA commented Jan 26, 2024 •

edited

Loading

LeaveNhA commented Mar 20, 2024 •

edited

Loading

LeaveNhA commented Mar 22, 2024 •

edited

Loading