Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3 GPU - Tensor split - unexpected behavior #1271

Open
schonsense opened this issue Dec 17, 2024 · 3 comments
Open

3 GPU - Tensor split - unexpected behavior #1271

schonsense opened this issue Dec 17, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@schonsense
Copy link

This is more a description of an unintuitive experience and a disagreement between the user facing information in the GUI vs, the backend.

I am running 2 rtx 4090's and a single 16gb rtx 4060. And depending on how I select the GPU ID, the order of my CUDA devices will change unintuitively.

In the GUI the pulldown lists my devices as GPU ID:

1 - 4090#1
2 - 4060
3 - 4090#2

Running without attempting to manually split tensors works as expected.

A Tensor Split of 3,2,3 with the GPUID set to ALL - gives me an error and closes KBCPP. The VRAM for my cards is split 24GB,16GB,24GB, according to the GPUID order displayed. This should just replicate the default tensor splitting if I understand correctly.

However that same tensor split, will work if I set the GPUID to 1, and it will utilize all three of my cards.

And to top it off, If I go back and set the GPUID to ALL again, but set the tensor swap to 3,3,2 - it works as expected.

It appears that when the GPUID is set to ALL, the back end can re-order the CUDA devices and the ggml_cuda_init script will find the devices in a different order depending on if a specific GPU is selected vs ALL. If I specify a single GPU ID, then the backend recognizes and agrees with the GUI and I can even force data onto unselected cards via tensor split. If I select ALL, my devices get reordered this way in the back end:

1 - 4090#1
2 - 4090#2
3 - 4060

And any tensor split I set in the GUI applies that split to this new device order.

@LostRuins
Copy link
Owner

The gpu ID sent to the backend is what the "main gpu" will be set to, that index is determined by the Cuda device order that should be set to PCI_BUS_ID.

That is the order the devices appear in nvidia-smi.

Can you please run nvidia-smi and show the output and order it lists the devices?

@schonsense
Copy link
Author

C:\Users\markm\Desktop\Utilities>nvidia-smi Tue Dec 17 10:59:21 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 566.36 Driver Version: 566.36 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:41:00.0 Off | Off | | 30% 26C P8 6W / 472W | 23498MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4060 Ti WDDM | 00000000:81:00.0 On | N/A | | 33% 28C P8 7W / 165W | 13812MiB / 16380MiB | 2% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 WDDM | 00000000:82:00.0 On | Off | | 31% 29C P8 43W / 472W | 23572MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

Probably a better way to copy that over, but it reflects what is seen in the GPU ID pulldown.

@LostRuins LostRuins added the bug Something isn't working label Dec 18, 2024
@LostRuins
Copy link
Owner

Hi, can you please try the latest version 1.80 and see if this is fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants