Skip to content

Commit

Permalink
Merge amd-master into release/rocm-rel-6.2 20240619
Browse files Browse the repository at this point in the history
Signed-off-by: Maisam Arif <[email protected]>
Change-Id: I32114669f2b7a247406cbff0a01c289bfc1bc1f4
  • Loading branch information
Maisam Arif authored and Maisam Arif committed Jun 20, 2024
2 parents 723cc6b + bc2d5be commit d0d1823
Show file tree
Hide file tree
Showing 21 changed files with 1,144 additions and 548 deletions.
5 changes: 2 additions & 3 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,10 @@ updates:
directory: "/docs/sphinx" # Location of package manifests
open-pull-requests-limit: 10
schedule:
interval: "monthly"
target-branch: "master"
interval: "weekly"
target-branch: "develop"
labels:
- "documentation"
- "dependencies"
- "ci:docs-only"
reviewers:
- "samjwu"
60 changes: 56 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,19 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr

### Additions

- **Added optional process table under `amd-smi monitor -q`**.
The monitor subcommand within the CLI Tool now has the `-q` option to enable an optional process table underneath the original monitored output.

```shell
$ amd-smi monitor -q
GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK ENC_UTIL ENC_CLOCK DEC_UTIL DEC_CLOCK SINGLE_ECC DOUBLE_ECC PCIE_REPLAY VRAM_USED VRAM_TOTAL PCIE_BW
0 199 W 103 °C 84 °C 99 % 1920 MHz 31 % 1000 MHz N/A 0 MHz N/A 0 MHz 0 0 0 1235 MB 16335 MB N/A Mb/s

PROCESS INFO:
GPU NAME PID GTT_MEM CPU_MEM VRAM_MEM MEM_USAGE GFX ENC
0 rvs 1564865 0.0 B 0.0 B 1.1 GB 0.0 B 0 ns 0 ns
```

- **Added Handling to detect VMs with passthrough configurations in CLI Tool**.
CLI Tool had only allowed a restricted set of options for Virtual Machines with passthrough GPUs. Now we offer an expanded set of functions availble to passthrough configured GPUs.

Expand All @@ -16,7 +29,7 @@ VMs now have the ability to set the process isolation and clear the sram from th

```shell
amd-smi set --process-isolation <0 or 1>
amd-smi reset --clear-sram-data
amd-smi reset --clean_local_data
```

- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**.
Expand All @@ -27,6 +40,9 @@ Added `AMDSMI_EVT_NOTIF_RING_HANG` to the possible events in the `amdsmi_evt_not

### Optimizations

- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**.
Changed the naming to be more accurate to what the function was doing. This change also extends to the CLI where we changed the `clear-sram-data` command to `clean_local_data`.

- **Updated `amdsmi_clk_info_t` struct in amdsmi.h and amdsmi_interface.py to align with host/guest**.
Changed cur_clk to clk, changed sleep_clk to clk_deep_sleep, and added clk_locked value. New struct will be in the following format:

Expand Down Expand Up @@ -66,7 +82,7 @@ The file structure referenced to dpm_policy changed to soc_pstate and we have ch
Previously on devices without a FRU we would not populate the product name in the `amdsmi_board_info_t` structure, now we will fallback to using the name listed according to the pciids file if available.

- **Updated CLI voltage curve command output**.
The output for `amd-smi metric --voltage-curve` now splits the frequency and voltage output by curve point or outputs N/A if not applicable
The output for `amd-smi metric --voltage-curve` now splits the frequency and voltage output by curve point or outputs N/A for each curve point if not applicable

```shell
GPU: 0
Expand All @@ -85,6 +101,42 @@ ASIC products. This requires users to update any ABIs using this structure.

### Fixes

- **Fixed multiple processes not being registered in `amd-smi process` with json and csv format**.
Multiple process outputs in the CLI tool were not being registered correctly. The json output did not handle multiple processes and is now in a new valid json format:

```shell
[
{
"gpu": 0,
"process_list": [
{
"process_info": {
"name": "TransferBench",
"pid": 420157,
"mem_usage": {
"value": 0,
"unit": "B"
}
}
},
{
"process_info": {
"name": "rvs",
"pid": 420315,
"mem_usage": {
"value": 0,
"unit": "B"
}
}
}
]
}
]
```

- **Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported**.
Throttle status may work for older ASICs, but will be replaced with PVIOL and TVIOL metrics for future ASIC support. It remains a field in the gpu_metrics API and in `amd-smi metric --power`.

- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**.
Previously if there was a partial failure to retrieve character strings, we would return
garbage output to users using the API. This fix intends to populate as many values as possible.
Expand All @@ -96,7 +148,7 @@ The parsing of `pp_od_clk_voltage` was not dynamic enough to work with the dropp

### Known Issues

- N/A
- **`amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do no currently work and will be supported in a future release**.

## amd_smi_lib for ROCm 6.1.2

Expand All @@ -106,7 +158,7 @@ The parsing of `pp_od_clk_voltage` was not dynamic enough to work with the dropp
Added APIs CLI and APIs to address LeftoverLocals security issues. Allowing clearing the sram data and setting process isolation on a per GPU basis. New APIs:
- `amdsmi_get_gpu_process_isolation()`
- `amdsmi_set_gpu_process_isolation()`
- `amdsmi_set_gpu_run_cleaner_shader()`
- `amdsmi_set_gpu_clear_sram_data()`

- **Added `MIN_POWER` to output of `amd-smi static --limit`**.
This change helps users identify the range to which they can change the power cap of the GPU. The change is added to simplify why a device supports (or does not support) power capping (also known as overdrive). See `amd-smi set -g all --power-cap <value in W>` or `amd-smi reset -g all --power-cap`.
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ find_program(GIT NAMES git)

## Setup the package version based on git tags.
set(PKG_VERSION_GIT_TAG_PREFIX "amdsmi_pkg_ver")
get_package_version_number("24.6.0" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
get_package_version_number("24.6.1" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
message("Package version: ${PKG_VERSION_STR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MAJOR "${CPACK_PACKAGE_VERSION_MAJOR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MINOR "${CPACK_PACKAGE_VERSION_MINOR}")
Expand Down
5 changes: 2 additions & 3 deletions amdsmi_cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ amd-smi will report the version and current platform detected when running the c
~$ amd-smi
usage: amd-smi [-h] ...

AMD System Management Interface | Version: 24.6.0.0 | ROCm version: 6.2.0 | Platform: Linux Baremetal
AMD System Management Interface | Version: 24.6.1.0 | ROCm version: 6.2.0 | Platform: Linux Baremetal

options:
-h, --help show this help message and exit
Expand Down Expand Up @@ -594,7 +594,7 @@ Command Modifiers:
usage: amd-smi monitor [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL]
[-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]]
[-w INTERVAL] [-W TIME] [-i ITERATIONS] [-p] [-t] [-u] [-m] [-n]
[-d] [-s] [-e] [-v] [-r]
[-d] [-e] [-v] [-r]

Monitor a target device for the specified arguments.
If no arguments are provided, all arguments will be enabled.
Expand Down Expand Up @@ -626,7 +626,6 @@ Monitor Arguments:
-m, --mem Monitor memory utilization (%) and clock (MHz)
-n, --encoder Monitor encoder utilization (%) and clock (MHz)
-d, --decoder Monitor decoder utilization (%) and clock (MHz)
-s, --throttle-status Monitor thermal throttle status
-e, --ecc Monitor ECC single bit, ECC double bit, and PCIe replay error counts
-v, --vram-usage Monitor memory usage in MB
-r, --pcie Monitor PCIe bandwidth in Mb/s
Expand Down
Loading

0 comments on commit d0d1823

Please sign in to comment.