3 GPU - Tensor split - unexpected behavior #1271

schonsense · 2024-12-17T16:28:43Z

This is more a description of an unintuitive experience and a disagreement between the user facing information in the GUI vs, the backend.

I am running 2 rtx 4090's and a single 16gb rtx 4060. And depending on how I select the GPU ID, the order of my CUDA devices will change unintuitively.

In the GUI the pulldown lists my devices as GPU ID:

1 - 4090#1
2 - 4060
3 - 4090#2

Running without attempting to manually split tensors works as expected.

A Tensor Split of 3,2,3 with the GPUID set to ALL - gives me an error and closes KBCPP. The VRAM for my cards is split 24GB,16GB,24GB, according to the GPUID order displayed. This should just replicate the default tensor splitting if I understand correctly.

However that same tensor split, will work if I set the GPUID to 1, and it will utilize all three of my cards.

And to top it off, If I go back and set the GPUID to ALL again, but set the tensor swap to 3,3,2 - it works as expected.

It appears that when the GPUID is set to ALL, the back end can re-order the CUDA devices and the ggml_cuda_init script will find the devices in a different order depending on if a specific GPU is selected vs ALL. If I specify a single GPU ID, then the backend recognizes and agrees with the GUI and I can even force data onto unselected cards via tensor split. If I select ALL, my devices get reordered this way in the back end:

1 - 4090#1
2 - 4090#2
3 - 4060

And any tensor split I set in the GUI applies that split to this new device order.

LostRuins · 2024-12-17T16:44:29Z

The gpu ID sent to the backend is what the "main gpu" will be set to, that index is determined by the Cuda device order that should be set to PCI_BUS_ID.

That is the order the devices appear in nvidia-smi.

Can you please run nvidia-smi and show the output and order it lists the devices?

schonsense · 2024-12-17T17:02:50Z

Probably a better way to copy that over, but it reflects what is seen in the GPU ID pulldown.

LostRuins · 2024-12-20T06:02:00Z

Hi, can you please try the latest version 1.80 and see if this is fixed?

LostRuins added the bug Something isn't working label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3 GPU - Tensor split - unexpected behavior #1271

3 GPU - Tensor split - unexpected behavior #1271

schonsense commented Dec 17, 2024

LostRuins commented Dec 17, 2024

schonsense commented Dec 17, 2024

LostRuins commented Dec 20, 2024

3 GPU - Tensor split - unexpected behavior #1271

3 GPU - Tensor split - unexpected behavior #1271

Comments

schonsense commented Dec 17, 2024

LostRuins commented Dec 17, 2024

schonsense commented Dec 17, 2024

LostRuins commented Dec 20, 2024