Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto-pl #1

Open
Nebemires opened this issue Aug 12, 2020 · 16 comments
Open

auto-pl #1

Nebemires opened this issue Aug 12, 2020 · 16 comments
Assignees

Comments

@Nebemires
Copy link

Running on last hiveos stable version. To install must create /etc/sonm/ dir, because in hiveos this dir does not exist. After install all works well, but after reboot, process running without actions. Also after reboot MAX_PL does not apply.

@avsigaev
Copy link
Owner

@Nebemires thanks for the feedback.
Could you show the logs of the service after reboot?
journalctl -u auto-pl.service

@avsigaev avsigaev self-assigned this Aug 12, 2020
@Nebemires
Copy link
Author

Nebemires commented Aug 13, 2020

sorry for delay :)
https://ibb.co/Dtms81D
https://ibb.co/XSSxRZW

@avsigaev
Copy link
Owner

avsigaev commented Aug 13, 2020

@Nebemires thanks. And what is the MAX_TEMP value in the config file ?

UPD: And please, show me first log entries after the reboot:
journalctl -u auto-pl -b | less

@Nebemires
Copy link
Author

-- Logs begin at Thu 2020-08-13 23:51:59 EEST, end at Thu 2020-08-13 23:52:32 EEST. --
Aug 13 23:51:59 AGA-AN3 systemd[1]: Starting Auto-PL script for NVIDIA GPUs (see /etc/sonm/auto-pl.cfg for options)...
Aug 13 23:52:09 AGA-AN3 systemd[1]: Started Auto-PL script for NVIDIA GPUs (see /etc/sonm/auto-pl.cfg for options).
Aug 13 23:52:09 AGA-AN3 auto-pl[859]: Auto-PL service, version 0.2
Aug 13 23:52:09 AGA-AN3 auto-pl[859]: Hive auto-fan DISABLED, using MAX_TEMP from /etc/sonm/auto-pl.cfg
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Found 7 GPU(s)

@avsigaev
Copy link
Owner

@Nebemires I need more logs ))
First 50 entries will be enough. You may post on pastebin.com or any similar service.
And , you didn't say what MAX_TEMP value you've set in the config.

@Nebemires
Copy link
Author

-- Logs begin at Thu 2020-08-13 23:51:59 EEST, end at Thu 2020-08-13 23:52:32 EEST. --
Aug 13 23:51:59 AGA-AN3 systemd[1]: Starting Auto-PL script for NVIDIA GPUs (see /etc/sonm/auto-pl.cfg for options)...
Aug 13 23:52:09 AGA-AN3 systemd[1]: Started Auto-PL script for NVIDIA GPUs (see /etc/sonm/auto-pl.cfg for options).
Aug 13 23:52:09 AGA-AN3 auto-pl[859]: Auto-PL service, version 0.2
Aug 13 23:52:09 AGA-AN3 auto-pl[859]: Hive auto-fan DISABLED, using MAX_TEMP from /etc/sonm/auto-pl.cfg
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Found 7 GPU(s)
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: PL MANAGEMENT: 1 (1-on, 0-off), Max 72°C, MaxPL 76%, MinPL 50%; PL change step 5W; Check every 5 sec
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL management is ENABLED in config
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Enabling persistence mode..
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Default (driver settings) PL for GPU0 is 250
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL range for GPU0: 125 - 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Default (driver settings) PL for GPU1 is 250
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL range for GPU1: 125 - 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Default (driver settings) PL for GPU2 is 250
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL range for GPU2: 125 - 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Default (driver settings) PL for GPU3 is 250
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL range for GPU3: 125 - 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Default (driver settings) PL for GPU4 is 250
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL range for GPU4: 125 - 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Default (driver settings) PL for GPU5 is 250
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL range for GPU5: 125 - 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Default (driver settings) PL for GPU6 is 250
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: PL range for GPU6: 125 - 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Initial PL for GPU0 adjusted to 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Initial PL for GPU1 adjusted to 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Initial PL for GPU2 adjusted to 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Initial PL for GPU3 adjusted to 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Initial PL for GPU4 adjusted to 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Initial PL for GPU5 adjusted to 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: Initial PL for GPU6 adjusted to 190
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: GPU0 checks: normal 1, high 0
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: GPU1 checks: normal 1, high 0
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: GPU2 checks: normal 1, high 0
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: GPU3 checks: normal 1, high 0
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: GPU4 checks: normal 1, high 0
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: GPU5 checks: normal 1, high 0
Aug 13 23:52:16 AGA-AN3 auto-pl[859]: INFO: GPU6 checks: normal 1, high 0
Aug 13 23:52:21 AGA-AN3 auto-pl[859]: Hive auto-fan DISABLED, using MAX_TEMP from /etc/sonm/auto-pl.cfg
Aug 13 23:52:21 AGA-AN3 auto-pl[859]: INFO: GPU0 checks: normal 2, high 0
Aug 13 23:52:21 AGA-AN3 auto-pl[859]: INFO: GPU1 checks: normal 2, high 0
Aug 13 23:52:21 AGA-AN3 auto-pl[859]: INFO: GPU2 checks: normal 2, high 0
Aug 13 23:52:21 AGA-AN3 auto-pl[859]: INFO: GPU3 checks: normal 2, high 0
Aug 13 23:52:22 AGA-AN3 auto-pl[859]: INFO: GPU4 checks: normal 2, high 0
Aug 13 23:52:22 AGA-AN3 auto-pl[859]: INFO: GPU5 checks: normal 2, high 0
Aug 13 23:52:22 AGA-AN3 auto-pl[859]: INFO: GPU6 checks: normal 2, high 0
Aug 13 23:52:27 AGA-AN3 auto-pl[859]: Hive auto-fan DISABLED, using MAX_TEMP from /etc/sonm/auto-pl.cfg

@Nebemires
Copy link
Author

Nebemires commented Aug 13, 2020

## Delay - time period between checks
DELAY=5


## PL MANAGEMENT SETTINGS


MAX_TEMP=72


# Set MANAGE_PL to 1 to enable flexible PL management, and 0 to disable
MANAGE_PL=1


# Max PL, % of drivers' default value for particular GPU
MAX_PL=76


# Min PL, % of drivers' default value for particular GPU
MIN_PL=50


# PL change step, watts
PL_CHANGE_STEP=5



@avsigaev
Copy link
Owner

avsigaev commented Aug 13, 2020

@Nebemires nice .
Well, this is not the error.
I've added threshold for PL adjustment +/- 2 degrees from MAX_TEMP. This means, the service will decrease PL when two conditions are met:

  • more then 2 checks for high temp
  • current temp is more then MAX_TEMP+2
    And, PL will increase, if temperature drops below MAX_TEMP-2.

I made this because I don't want to adjust PL too frequently,some miners don't like this.

By your logs, you had 73 degrees on GPU6, and this is less then MAX_TEMP+2.
In another words, you may set MAX_TEMP to 70, and PL will start decreasing when your GPU hits 73 (i.e. > 70+2) degrees.

@avsigaev
Copy link
Owner

I will add the check for /etc/sonm folder , and that's all.

@Nebemires
Copy link
Author

But why MAX_PL does not change after reboot? Its stays as set on hiveos.

@avsigaev
Copy link
Owner

Hmm. I set it on GPU, you can see this in logs.
You will not see it in Hive OS UI.
But, let me see the nvidia-smi command output, just to check it.

@Nebemires
Copy link
Author

| NVIDIA-SMI 440.82       Driver Version: 440.82       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  P102-100            On   | 00000000:01:00.0 Off |                  N/A |
| 80%   66C    P0   174W / 175W |   4026MiB /  5059MiB |    100%      Default |                                                                                                               
+-------------------------------+----------------------+----------------------+                                                                                                               
|   1  P102-100            On   | 00000000:02:00.0 Off |                  N/A |                                                                                                               
| 80%   68C    P0   174W / 175W |   4026MiB /  5059MiB |    100%      Default |                                                                                                               
+-------------------------------+----------------------+----------------------+                                                                                                               
|   2  P102-100            On   | 00000000:03:00.0 Off |                  N/A |                                                                                                               
| 80%   64C    P0   177W / 175W |   4026MiB /  5059MiB |    100%      Default |                                                                                                               
+-------------------------------+----------------------+----------------------+                                                                                                               
|   3  P102-100            On   | 00000000:04:00.0 Off |                  N/A |                                                                                                               
| 80%   70C    P0   173W / 175W |   4026MiB /  5059MiB |    100%      Default |                                                                                                               
+-------------------------------+----------------------+----------------------+                                                                                                               
|   4  P102-100            On   | 00000000:09:00.0 Off |                  N/A |                                                                                                               
| 81%   67C    P0   173W / 175W |   4026MiB /  5059MiB |    100%      Default |                                                                                                               
+-------------------------------+----------------------+----------------------+                                                                                                               
|   5  P102-100            On   | 00000000:0A:00.0 Off |                  N/A |                                                                                                               
| 80%   66C    P0   176W / 175W |   4026MiB /  5059MiB |    100%      Default |                                                                                                               
+-------------------------------+----------------------+----------------------+                                                                                                               
|   6  P102-100            On   | 00000000:0E:00.0 Off |                  N/A |                                                                                                               
| 81%   71C    P0   175W / 175W |   4026MiB /  5059MiB |    100%      Default |                                                                                                               
+-------------------------------+----------------------+----------------------+                                   

As set it manually to 175, but i tried before with 160, and after reboot 160 stayed                                                                           

@avsigaev
Copy link
Owner

Okay, I see 175w, and this should be 190.
Maybe you've set lower value in Hive settings, and this overrides my script.

@Nebemires
Copy link
Author

Okay, I see 175w, and this should be 190.
Maybe you've set lower value in Hive settings, and this overrides my script.

yes... :)
should i set same as in hiveos settings ?

@avsigaev
Copy link
Owner

avsigaev commented Aug 13, 2020

I think you may disable PL setting in Hive, otherwise I guess they may conflict with script settings.
Try to disable, and then reboot. Let's see

@Nebemires
Copy link
Author

will try thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants