Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to psutils for resource utilization monitoring at runtime #11677

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

todor-ivanov
Copy link
Contributor

Fixes #11667

Status

not-tested

Description

The issue #11667 can be solved in 3 different ways. This is the third out of 3 suggested fixes.

  • The current one completely avoids running any shell commands at runtime. But instead, it relies on the fact there is already the psutils library distributed with cmssw and directly uses it to fetch the step's resource statistics inside the PerformanceMonitor.py module itself.

This approach requires some more dramatic changes to the code but also achieves:

  • code improvement
  • avoid dangerous and cumbersome shell command line execution at runtime
  • better resource logging, since we no longer need to use the standard linux %cpu and %mem mappings which are difficult to interpret at a first glance. (Those are resource utilization rations e.g. cputime/realtime ratio, expressed as a percentage), but we are now logging the much more informative system cputime and virtual memory for the process.

This would require more detailed testing though.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

No

@todor-ivanov todor-ivanov requested review from amaltaro and vkuznet July 25, 2023 17:56
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 2 warnings
    • 15 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14374/artifact/artifacts/PullRequestReport.html

Typo && Remove unneded import of SubprocessAlgos
@todor-ivanov todor-ivanov force-pushed the bugfix_SmapsParsingError_fix-11667-3 branch from 9a47116 to b27645c Compare July 25, 2023 18:34
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 15 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14375/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Jul 25, 2023

Hi @smuzaffar,
In the context of the current PR, since we need to import psutil module at runtime, we need some more information about it's presence in CMSSW. Can you tell us which is the earliest CMSSW release providing it, i.e. can we rely on it if a workflow trying to use a 4-5 years old release tries to load it?

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 15 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14376/artifact/artifacts/PullRequestReport.html

@smuzaffar
Copy link

@todor-ivanov , I see that it is available in CMSSW_9_0_0 release which is 6 years old (Mar 2017). So it should be available in all release above CMSSW_9_0_0 but its version might be different in different releases.

@todor-ivanov
Copy link
Contributor Author

Thanks @smuzaffar , where could I check the psutil versions included in the different CMSSW releases. I am testing the few methods from psutils we are using, to check if they are backwards compatible. I started with a really old version like 3.0.0, but it seams they are not. So I need to double check how far back in psutil versions we may go and map it to the relevant CMSSW release.

@smuzaffar
Copy link

you can create cmssw dev area for CMSSW_9_0_0, set env and then import psutil to check the version. This should be the min psutil version you should support. You can also do the same for latest CMSSW_13_2_0_pre2 release and check the psutil version there.

Note that if you take psutil from cmssw then things can break as for future releases we will move to some newer version of psutil which might not be compatible with your code.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 15 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 2 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14379/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Jul 26, 2023

Thanks @smuzaffar!

I just created a CMSSW_9_0_0 dev area and checked the version: [1]. So the earliest psutil version available is 5.0.1. Which is good for us. I also made a thorough history check of all the methods from psutil we are using in this module (log file included as attachment to the current comment: psutil.vtest.log). All methods we need start with version 5.0.0 and has not changed their signature since then. So I'd say as long as we do not have workflows in the system
going bellow CMSSW_9_0_0 we are safe.

As of the future, you are correct. We will have to think of a way to have this validated if the version changes. One thing is for sure - it won't be left unnoticed in production, because jobs from workflows which had the malchance to include a new and backwords incompatible version of psutil, will be massively killed by the watchdog. And even though it may seem unlikely, it is still a third parity library, which none of us maintains, so the possibility is indeed nonzero.

@amaltaro I'd be glad to hear your opinion on this.

[1]

$ source /cvmfs/cms.cern.ch/cmsset_default.sh
$ export SCRAM_ARCH=slc7_amd64_gcc630
$ cmsrel CMSSW_9_0_0
$ cd CMSSW_9_0_0/src/
$ cmsenv 
$ python -c "import psutil; print(str(psutil.version_info))" 
(5, 0, 1)

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 5 warnings and errors that must be fixed
    • 2 warnings
    • 16 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14380/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative to fix this has been merged with PR: #11676

We will keep discussing on how to get psutils in the WMRuntime environment. So I suggest we keep this PR open as discussions progress.

@VinInn
Copy link

VinInn commented Jul 26, 2023

please consider to,
If the job is killed because it exceeds the memory (or time) limit , to dump as much info as possible in particular
/proc//smaps_rollup and global info as well such as /proc/meminfo or vmstat -s

@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read smaps_rollup instead of parsing smaps using "awk"
6 participants