-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: Support AMDGPU Data Collection #1641
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1641 +/- ##
==========================================
- Coverage 42.03% 41.37% -0.67%
==========================================
Files 116 118 +2
Lines 17625 17926 +301
==========================================
+ Hits 7409 7417 +8
- Misses 10216 10509 +293
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
CI failing due to lacking libdrm/libdrm_amdgpu libraries. What's the best way forward here? |
Can the libraries be installed? |
libdrm should exist in most linux distros via their package managers, i'm atm investigating a dependency-free solution by parsing |
Most recent commit parses AMD GPU metrics via procfs (for per-process utilization and video memory usage) and sysfs (for overall AMDGPU memory usage and temperature sensors) and as such doesn't rely on any libraries. However the code is significantly more complex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments for now, mostly looks good though.
} | ||
|
||
// needs previous state for usage calculation | ||
static PROC_DATA: LazyLock<Mutex<HashMap<PathBuf, HashMap<u32, AMDGPUProc>>>> = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking: This doesn't need to be changed for this PR, but a mutex around this seems a little overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the only way I could get a mutable reference to PROC_DATA without making the static itself mutable, and thus requiring unsafe code to access.
// get vram memory info from sysfs | ||
let vram_total_path = device_path.join("mem_info_vram_total"); | ||
let vram_used_path = device_path.join("mem_info_vram_used"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you happen to know whether any of these checks in this file wake up the GPU if it is currently sleeping? We had an issue with temperature checks in https://github.com/ClementTsang/bottom/blob/main/src/data_collection/temperature/linux.rs#L226 where we were waking up devices (mainly GPUs) when checking their temperature, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not, but I can ask around.
}) | ||
} | ||
|
||
pub fn get_amd_temp(device_path: &Path) -> Option<Vec<AMDGPUTemperature>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides having better names, no. They end up symlinking to the same hwmon endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, guess it's fine to leave it in for now but I might integrate this into that code path in a later PR.
I will make the necessary changes tomorrow. |
e2cd20d
to
a26e2b2
Compare
Force pushes were rewording commit messages. |
Co-authored-by: lvxnull2 <[email protected]>
gpu: fix clippy issues Co-authored-by: lvxnull2 <[email protected]>
…ead of current memory usage gpu: requested syntax changes Co-authored-by: lvxnull2 <[email protected]>
I accidentally reset the signature of the 4th commit from HEAD, which I just fixed by resetting the entire branch, apologies! History should be preserved now. |
Sorry for the delay, was on vacation - will take a look at this again in a sec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for doing this!
@all-contributors please add @yretenai for code. |
I've put up a pull request to add @yretenai! 🎉 |
Description
Adds GPU metrics gathering
via amdgpu_top's libamdgpu_top crateon Linux.Some notes: the library queries /proc/{pid}/fdinfo, which can probably be parsed without amdgpu_top's libraries. Intel Arc apparently also uses this fdinfo but I cannot confirm.
Testing
If relevant, please state how this was tested. All changes must be tested to work:
If this is a code change, please also indicate which platforms were tested:
Checklist
If relevant, ensure the following have been met:
cargo fmt
)README.md
, help menu, doc pages, etc.)