Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6tempcontrol: fan speeds not getting updated #219

Open
moi162001 opened this issue Oct 26, 2018 · 13 comments
Open

6tempcontrol: fan speeds not getting updated #219

moi162001 opened this issue Oct 26, 2018 · 13 comments

Comments

@moi162001
Copy link

after a few tests I noticed that it is not the 6temp that does not work but the application of the fan speed change.
the 6temp sees the temperature well and increases the speed well but it is not applied

@moi162001 moi162001 changed the title 6temcontrol 6tempcontrol Oct 26, 2018
@LuKePicci LuKePicci changed the title 6tempcontrol 6tempcontrol: fan speeds not getting updated Oct 26, 2018
@LuKePicci
Copy link
Collaborator

I never managed to reproduce this in ubuntu 16.04, but from some details shared by @papampi that was quite clear the issue is on the nvidia applets side. I'm not sure we can do something to reliable handle this situation without a fix by nvidia.

@moi162001
Copy link
Author

moi162001 commented Oct 26, 2018

for me the problem is present in remote mode but is it present in local mode

I propose to explore the following avenues to see if the situation is improving

nvidia-smi --gpu-reset -i 0
sudo nvidia-smi -r
kill -9 $(nvidia-smi | sed -n's/|\s*[0-9]\s\s*([0-9]\s./\1/p' | fate | uniq | sed' /^$/d')
sed -n's/|\s
[0-9]\s\s*([0-9]\s)\s. */\1/p' will find PID,
will only come out unique exract unique,
with a restart of gdm

It's just a suggestion.

@LuKePicci
Copy link
Collaborator

Are there any hanged nvidia-smi or nvidia-settings instances running when the issue occur?

@moi162001
Copy link
Author

moi162001 commented Oct 26, 2018

nvidia-persistenced only

@LuKePicci
Copy link
Collaborator

LuKePicci commented Oct 26, 2018

No, the persistence daemon is not involved here, it just keep open a device file to keep gpus "in use" to avoid the driver module is unloaded. In nvOC to allow overclock we have to run xorg on each gpu so they're already kept in use by xorg, and this is enough to keep drivers loaded.

@moi162001
Copy link
Author

GPU 0, Tue Oct 30 18: 07: 24 CET 2018 - Adjusting fan from: 30 to: 35 Temp: 67
but nvidia-smi
NVIDIA-SMI 410.73 Driver Version: 410.73 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... On | 00000000:03:00.0 Off | N/A |
| 30% 67C P2 90W / 90W | 144MiB / 3019MiB | 100% Default |
+-------------------------------+----------------------+--------------------

@LuKePicci
Copy link
Collaborator

6tempcontrol has been reworked in the next 3.1-dev branch, nothing has been changed to specifically address this issue but if you manage to confirm it still exists should be easier to debug in case it really depends on nvOC and is not driver-related.

@nvOC-Stubo
Copy link
Collaborator

moi162001:

Try this change to 6tempcontrol. Change this line:
sudo ${NVD} ${NVD_SETTINGS} >/dev/null 2>&1
To:
${NVD} ${NVD_SETTINGS} >/dev/null 2>&1

@LuKePicci
Copy link
Collaborator

This issue seems to be solved with updated miners. Basically some buggy miner versions were unable to let nvidia-settings update fan speeds regularly.

@nvOC-Stubo
Copy link
Collaborator

Hey LuKePicci:

No, this is definitely an issue with 6tempcontrol after Cuda is updated to 10 or later. When running nvidia-settings as root (with sudo), you will get error:
No protocol specified
Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run nvidia-settings --help
for usage information.

The easiest fix is to just remove the sudo. The reason for that error is that XAUTHORITY is not set correctly for it to work as root. The fix for that would be to set XAUTHORITY correctly, like:

export XAUTHORITY=$(ps a |grep X|grep -v grep|tr -s " "|cut -d " " -f 11)

But I did not recommend that because it is easier to just remove the sudo.

@LuKePicci
Copy link
Collaborator

LuKePicci commented Jun 14, 2019

I implemented the same XAUTHORITY fix for nvOC as systemd service almost a year ago, but here nvidia-settings was called with sudo for some reason I could not remember. However in my case, with no issues on xorg cookies, nvidia-settings was still hanging without applying new fan speeds only on some miners and only in certain load conditions. Which is the correlation between cuda-10 and nvidia-settings permissions?

@nvOC-Stubo
Copy link
Collaborator

I started with the old Ubuntu 18.04 image linked here and updated to Cuda 10 so, in reality, the issue could be with the Ubuntu 18.04 image and not specific to Cuda 10.

What I did to find the source of the problem was to modify tempcontrol so that the output of "sudo ${NVD} ${NVD_SETTINGS} ...." was no longer being to null and was written to a log. Once I saw the error (as per my last post), I started to troubleshoot and correct the problem. Consider what happens here on one of my rigs:

`m1@Miner1: > lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.1 LTS
Release: 18.04
Codename: bionic
m1@Miner1: > export DISPLAY=:0
m1@Miner1: > nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeedRPM
1829
m1@Miner1: > sudo nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeedRPM
No protocol specified
Unable to init server: Could not connect: Connection refused

ERROR: The control display is undefined; please run nvidia-settings --help for usage information.

m1@Miner1: > sudo DISPLAY=:0 XAUTHORITY=$(ps a |grep X|grep -v grep|tr -s " "|cut -d " " -f 11) nvidia-settings -t -q [fan:0]/GPUCurrentFanSpeedRPM
1832
`
So, you can see that the easiest fix for me was to just remove the sudo. This may or may not be the proper fix nvOC wide as you have other scenarios and Ubuntu versions to consider.

@LuKePicci
Copy link
Collaborator

Ok, so this was definitely a different issue from what I had here on the old 16.04. Never tried running on the 18.04 base os. In my case nvidia-settings did not print any error, it was simply stuck with some miners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants