-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid nvidia-smi query if the process will get killed #943
base: main
Are you sure you want to change the base?
Conversation
Avoid GPU query whenever running in a SLURM environment without GPU. If the node doesn't have NVIDIA drivers this query will fail and the process will get killed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PabloNA97 - I'm going to put a block on this because we need to discuss this a bit on our end. My initial take is that I'm not sure that this is the solution to the deeper issue of "why are we even running this if there's no GPU".
I'm also not sure I understand why slurm behaves differently here, having a special case like this shouldn't be necessary.
Ideally we should only run this if a user has requested a GPU.
- 'SLURM_JOB_GPUS' | ||
- 'SLURM_GPUS' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid these might be very specific to some deployments of SLURM (at least I know in the clusters I've managed in the past we didn't have these env variables set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback, appreciate it! :D
I applied these changes on my end to test it with CPU. But if using CPU for production is a bad idea in itself then it is not worth it to merge in the main branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is mentioned in the openfe docs - maybe it would be worth to include a small comment about it if it's not.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #943 +/- ##
==========================================
- Coverage 94.59% 91.53% -3.06%
==========================================
Files 134 134
Lines 9934 9952 +18
==========================================
- Hits 9397 9110 -287
- Misses 537 842 +305
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
I've made a comment on the issue but just to add to this PR, the heuristic of "this is slurm so we don't want to run nvidia-smi" is a bad one since many people are using this software on slrurm systems and don't have this issue. I rather we just improve our try/except on running the command, then trying to decide if we should run it. |
The behaviour of the original code only changes whenever you (1) run on slurm and (2) there is no gpu requested. In that case I thought it didn't make sense to run nvidia-smi. But as @IAlibay explained maybe it doesn't make sense to contemplate this case after all. |
Avoid GPU query whenever running in a SLURM environment without GPU. If the node doesn't have NVIDIA drivers this query will fail and the process will get killed.
fixes #854