-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
6tempcontrol: fan speeds not getting updated #219
Comments
I never managed to reproduce this in ubuntu 16.04, but from some details shared by @papampi that was quite clear the issue is on the nvidia applets side. I'm not sure we can do something to reliable handle this situation without a fix by nvidia. |
for me the problem is present in remote mode but is it present in local mode I propose to explore the following avenues to see if the situation is improving nvidia-smi --gpu-reset -i 0 It's just a suggestion. |
Are there any hanged nvidia-smi or nvidia-settings instances running when the issue occur? |
nvidia-persistenced only |
No, the persistence daemon is not involved here, it just keep open a device file to keep gpus "in use" to avoid the driver module is unloaded. In nvOC to allow overclock we have to run xorg on each gpu so they're already kept in use by xorg, and this is enough to keep drivers loaded. |
GPU 0, Tue Oct 30 18: 07: 24 CET 2018 - Adjusting fan from: 30 to: 35 Temp: 67 |
6tempcontrol has been reworked in the next 3.1-dev branch, nothing has been changed to specifically address this issue but if you manage to confirm it still exists should be easier to debug in case it really depends on nvOC and is not driver-related. |
moi162001: Try this change to 6tempcontrol. Change this line: |
This issue seems to be solved with updated miners. Basically some buggy miner versions were unable to let nvidia-settings update fan speeds regularly. |
Hey LuKePicci: No, this is definitely an issue with 6tempcontrol after Cuda is updated to 10 or later. When running nvidia-settings as root (with sudo), you will get error: ERROR: The control display is undefined; please run The easiest fix is to just remove the sudo. The reason for that error is that XAUTHORITY is not set correctly for it to work as root. The fix for that would be to set XAUTHORITY correctly, like: export XAUTHORITY=$(ps a |grep X|grep -v grep|tr -s " "|cut -d " " -f 11) But I did not recommend that because it is easier to just remove the sudo. |
I implemented the same XAUTHORITY fix for nvOC as systemd service almost a year ago, but here nvidia-settings was called with sudo for some reason I could not remember. However in my case, with no issues on xorg cookies, nvidia-settings was still hanging without applying new fan speeds only on some miners and only in certain load conditions. Which is the correlation between cuda-10 and nvidia-settings permissions? |
I started with the old Ubuntu 18.04 image linked here and updated to Cuda 10 so, in reality, the issue could be with the Ubuntu 18.04 image and not specific to Cuda 10. What I did to find the source of the problem was to modify tempcontrol so that the output of "sudo ${NVD} ${NVD_SETTINGS} ...." was no longer being to null and was written to a log. Once I saw the error (as per my last post), I started to troubleshoot and correct the problem. Consider what happens here on one of my rigs: `m1@Miner1: > lsb_release -a ERROR: The control display is undefined; please run m1@Miner1: > sudo DISPLAY=:0 XAUTHORITY=$(ps a |grep X|grep -v grep|tr -s " "|cut -d " " -f 11) nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeedRPM |
Ok, so this was definitely a different issue from what I had here on the old 16.04. Never tried running on the 18.04 base os. In my case nvidia-settings did not print any error, it was simply stuck with some miners. |
after a few tests I noticed that it is not the 6temp that does not work but the application of the fan speed change.
the 6temp sees the temperature well and increases the speed well but it is not applied
The text was updated successfully, but these errors were encountered: