Skip to content

Commit

Permalink
Merge amd-master into release/rocm-rel-6.2 20240710
Browse files Browse the repository at this point in the history
Signed-off-by: Maisam Arif <[email protected]>
Change-Id: I67885b9df87d2300c53c576f7b36369154b835a7
  • Loading branch information
Maisam Arif authored and Maisam Arif committed Jul 11, 2024
2 parents d0d1823 + 4732919 commit 32e3fda
Show file tree
Hide file tree
Showing 20 changed files with 348 additions and 176 deletions.
44 changes: 44 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr

### Additions

- **`amd-smi dmon` is now available as an alias to `amd-smi monitor`**.

- **Added optional process table under `amd-smi monitor -q`**.
The monitor subcommand within the CLI Tool now has the `-q` option to enable an optional process table underneath the original monitored output.

Expand Down Expand Up @@ -40,6 +42,48 @@ Added `AMDSMI_EVT_NOTIF_RING_HANG` to the possible events in the `amdsmi_evt_not

### Optimizations

- **Updated CLI error strings to specify invalid device type queried**

```shell
$ amd-smi static --asic --gpu 123123
Can not find a device: GPU '123123' Error code: -3
```

- **Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`**.
Previously if a processes with elevated permissions was running amd-smi would required sudo to display all output. Now amd-smi will populate all process data and return N/A for elevated process names instead. However if ran with sudo you will be able to see the name like so:

```shell
$ amd-smi process
GPU: 0
PROCESS_INFO:
NAME: N/A
PID: 1693982
MEMORY_USAGE:
GTT_MEM: 0.0 B
CPU_MEM: 0.0 B
VRAM_MEM: 10.1 GB
MEM_USAGE: 0.0 B
USAGE:
GFX: 0 ns
ENC: 0 ns
```

```shell
$ sudo amd-smi process
GPU: 0
PROCESS_INFO:
NAME: TransferBench
PID: 1693982
MEMORY_USAGE:
GTT_MEM: 0.0 B
CPU_MEM: 0.0 B
VRAM_MEM: 10.1 GB
MEM_USAGE: 0.0 B
USAGE:
GFX: 0 ns
ENC: 0 ns
```

- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**.
Changed the naming to be more accurate to what the function was doing. This change also extends to the CLI where we changed the `clear-sram-data` command to `clean_local_data`.

Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ find_program(GIT NAMES git)

## Setup the package version based on git tags.
set(PKG_VERSION_GIT_TAG_PREFIX "amdsmi_pkg_ver")
get_package_version_number("24.6.1" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
get_package_version_number("24.6.2" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
message("Package version: ${PKG_VERSION_STR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MAJOR "${CPACK_PACKAGE_VERSION_MAJOR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MINOR "${CPACK_PACKAGE_VERSION_MINOR}")
Expand Down
9 changes: 5 additions & 4 deletions amdsmi_cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,13 +73,13 @@ Type "help", "copyright", "credits" or "license" for more information.

## Usage

amd-smi will report the version and current platform detected when running the command without arguments:
AMD-SMI reports the version and current platform detected when running the command line interface (CLI) without arguments:

``` bash
~$ amd-smi
usage: amd-smi [-h] ...

AMD System Management Interface | Version: 24.6.1.0 | ROCm version: 6.2.0 | Platform: Linux Baremetal
AMD System Management Interface | Version: 24.6.2.0 | ROCm version: 6.2.0 | Platform: Linux Baremetal

options:
-h, --help show this help message and exit
Expand All @@ -97,7 +97,7 @@ AMD-SMI Commands:
topology Displays topology information of the devices
set Set options for devices
reset Reset options for devices
monitor Monitor metrics for target devices
monitor (dmon) Monitor metrics for target devices
xgmi Displays xgmi information of the devices
```

Expand Down Expand Up @@ -594,7 +594,7 @@ Command Modifiers:
usage: amd-smi monitor [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL]
[-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]]
[-w INTERVAL] [-W TIME] [-i ITERATIONS] [-p] [-t] [-u] [-m] [-n]
[-d] [-e] [-v] [-r]
[-d] [-e] [-v] [-r] [-q]

Monitor a target device for the specified arguments.
If no arguments are provided, all arguments will be enabled.
Expand Down Expand Up @@ -629,6 +629,7 @@ Monitor Arguments:
-e, --ecc Monitor ECC single bit, ECC double bit, and PCIe replay error counts
-v, --vram-usage Monitor memory usage in MB
-r, --pcie Monitor PCIe bandwidth in Mb/s
-q, --process Enable Process information table below monitor output

Command Modifiers:
--json Displays output in JSON format (human readable by default).
Expand Down
47 changes: 36 additions & 11 deletions amdsmi_cli/amdsmi_cli_exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ def __init__(self):
self.stdout_message = ''
self.message = ''
self.output_format = ''
self.device_type = ''

def __str__(self):
# Return message according to the current output format
Expand All @@ -83,7 +84,7 @@ def __str__(self):


class AmdSmiInvalidCommandException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -1
self.command = command
Expand All @@ -98,7 +99,7 @@ def __init__(self, command, outputformat):


class AmdSmiInvalidParameterException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -2
self.command = command
Expand All @@ -113,13 +114,22 @@ def __init__(self, command, outputformat):


class AmdSmiDeviceNotFoundException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str, gpu: bool, cpu: bool, core: bool):
super().__init__()
self.value = -3
self.command = command
self.output_format = outputformat

common_message = f"Can not find a device with the corresponding identifier: '{self.command}'"
# Handle different devices
self.device_type = ""
if gpu:
self.device_type = "GPU"
elif cpu:
self.device_type = "CPU"
elif core:
self.device_type = "CPU CORE"

common_message = f"Can not find a device: {self.device_type} '{self.command}'"

self.json_message["error"] = common_message
self.json_message["code"] = self.value
Expand All @@ -128,7 +138,7 @@ def __init__(self, command, outputformat):


class AmdSmiInvalidFilePathException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -4
self.command = command
Expand All @@ -143,7 +153,7 @@ def __init__(self, command, outputformat):


class AmdSmiInvalidParameterValueException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -5
self.command = command
Expand All @@ -158,7 +168,7 @@ def __init__(self, command, outputformat):


class AmdSmiMissingParameterValueException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -6
self.command = command
Expand All @@ -172,8 +182,23 @@ def __init__(self, command, outputformat):
self.stdout_message = f"{common_message} Error code: {self.value}"


class AmdSmiCommandNotSupportedException(AmdSmiException):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -7
self.command = command
self.output_format = outputformat

common_message = f"Command '{self.command}' is not supported on the system. Run '--help' for more info."

self.json_message["error"] = common_message
self.json_message["code"] = self.value
self.csv_message = f"error,code\n{common_message}, {self.value}"
self.stdout_message = f"{common_message} Error code: {self.value}"


class AmdSmiParameterNotSupportedException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -8
self.command = command
Expand All @@ -188,7 +213,7 @@ def __init__(self, command, outputformat):


class AmdSmiRequiredCommandException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -9
self.command = command
Expand All @@ -203,7 +228,7 @@ def __init__(self, command, outputformat):


class AmdSmiUnknownErrorException(AmdSmiException):
def __init__(self, command, outputformat):
def __init__(self, command, outputformat: str):
super().__init__()
self.value = -100
self.command = command
Expand All @@ -218,7 +243,7 @@ def __init__(self, command, outputformat):


class AmdSmiAMDSMIErrorException(AmdSmiException):
def __init__(self, outputformat, error_code):
def __init__(self, outputformat: str, error_code):
super().__init__()
self.value = -1000 - abs(error_code)
self.smilibcode = error_code
Expand Down
14 changes: 9 additions & 5 deletions amdsmi_cli/amdsmi_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,15 @@ def version(self, args):
self.logger.output['rocm_version'] = f'{rocm_version_str}'

if self.logger.is_human_readable_format():
print(f'AMDSMI Tool: {__version__} | '\
f'AMDSMI Library version: {amdsmi_lib_version_str} | ' \
f'ROCm version: {rocm_version_str}')
human_readable_output = f"AMDSMI Tool: {__version__} | " \
f"AMDSMI Library version: {amdsmi_lib_version_str} | " \
f"ROCm version: {rocm_version_str}"
# Custom human readable handling for version
if self.logger.destination == 'stdout':
print(human_readable_output)
else:
with self.logger.destination.open('a') as output_file:
output_file.write(human_readable_output + '\n')
elif self.logger.is_json_format() or self.logger.is_csv_format():
self.logger.print_output()

Expand Down Expand Up @@ -2631,8 +2637,6 @@ def process(self, args, multiple_devices=False, watching_output=False,
try:
process_list = amdsmi_interface.amdsmi_get_gpu_process_list(args.gpu)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
logging.debug("Failed to get process list for gpu %s | %s", gpu_id, e.get_error_info())
raise e

Expand Down
2 changes: 1 addition & 1 deletion amdsmi_cli/amdsmi_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -465,7 +465,7 @@ def print_output(self, multiple_device_enabled=False, watching_output=False, tab
self._print_tabular_output(multiple_device_enabled=multiple_device_enabled, watching_output=watching_output)
else:
self._print_human_readable_output(multiple_device_enabled=multiple_device_enabled,
watching_output=watching_output)
watching_output=watching_output)


def _print_json_output(self, multiple_device_enabled=False, watching_output=False):
Expand Down
20 changes: 16 additions & 4 deletions amdsmi_cli/amdsmi_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,11 @@ def __init__(self, version, list, static, firmware, bad_pages, metric,
help="Descriptions:",
metavar='')

# Store possible subcommands & aliases for later errors
self.possible_commands = ['version', 'list', 'static', 'firmware', 'ucode', 'bad-pages',
'metric', 'process', 'profile', 'event', 'topology', 'set',
'reset', 'monitor', 'dmon', 'xgmi']

# Add all subparsers
self._add_version_parser(self.subparsers, version)
self._add_list_parser(self.subparsers, list)
Expand Down Expand Up @@ -257,7 +262,9 @@ def __call__(self, parser, args, values, option_string=None):
if selected_device_handles == '':
raise amdsmi_cli_exceptions.AmdSmiMissingParameterValueException("--gpu", _GPUSelectAction.ouputformat)
else:
raise amdsmi_cli_exceptions.AmdSmiDeviceNotFoundException(selected_device_handles, _GPUSelectAction.ouputformat)
raise amdsmi_cli_exceptions.AmdSmiDeviceNotFoundException(selected_device_handles,
_GPUSelectAction.ouputformat,
True, False, False)

return _GPUSelectAction

Expand All @@ -283,7 +290,8 @@ def __call__(self, parser, args, values, option_string=None):
raise amdsmi_cli_exceptions.AmdSmiMissingParameterValueException("--cpu", _CPUSelectAction.ouputformat)
else:
raise amdsmi_cli_exceptions.AmdSmiDeviceNotFoundException(selected_device_handles,
_CPUSelectAction.ouputformat)
_CPUSelectAction.ouputformat,
False, True, False)
return _CPUSelectAction


Expand All @@ -308,7 +316,8 @@ def __call__(self, parser, args, values, option_string=None):
raise amdsmi_cli_exceptions.AmdSmiMissingParameterValueException("--core", _CoreSelectAction.ouputformat)
else:
raise amdsmi_cli_exceptions.AmdSmiDeviceNotFoundException(selected_device_handles,
_CoreSelectAction.ouputformat)
_CoreSelectAction.ouputformat,
False, False, True)
return _CoreSelectAction


Expand Down Expand Up @@ -1129,7 +1138,7 @@ def _add_monitor_parser(self, subparsers, func):
process_help = "Enable Process information table below monitor output"

# Create monitor subparser
monitor_parser = subparsers.add_parser('monitor', help=monitor_help, description=monitor_subcommand_help)
monitor_parser = subparsers.add_parser('monitor', help=monitor_help, description=monitor_subcommand_help, aliases=["dmon"])
monitor_parser._optionals.title = monitor_optionals_title
monitor_parser.formatter_class=lambda prog: AMDSMISubparserHelpFormatter(prog)
monitor_parser.set_defaults(func=func)
Expand Down Expand Up @@ -1232,6 +1241,9 @@ def error(self, message):
l = len("argument : invalid choice: ") + 1
message = message[l:]
message = message.split("'")[0]
# Check if the command is possible in other system configurations and error accordingly
if message in self.possible_commands:
raise amdsmi_cli_exceptions.AmdSmiCommandNotSupportedException(message, outputformat)
raise amdsmi_cli_exceptions.AmdSmiInvalidCommandException(message, outputformat)
elif "unrecognized arguments: " in message:
l = len("unrecognized arguments: ")
Expand Down
2 changes: 1 addition & 1 deletion docs/doxygen/Doxyfile
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ PROJECT_NAME = AMD SMI
# could be handy for archiving the generated documentation or if some version
# control system is used.

PROJECT_NUMBER = "24.6.1.0"
PROJECT_NUMBER = "24.6.2.0"

# Using the PROJECT_BRIEF tag one can provide an optional one line description
# for a project that appears at the top of each page and should give viewer a
Expand Down
5 changes: 3 additions & 2 deletions docs/how-to/using-AMD-SMI-CLI-tool.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ AMD-SMI reports the version and current platform detected when running the comma
~$ amd-smi
usage: amd-smi [-h] ...

AMD System Management Interface | Version: 24.6.1.0 | ROCm version: 6.2.0 | Platform: Linux Baremetal
AMD System Management Interface | Version: 24.6.2.0 | ROCm version: 6.2.0 | Platform: Linux Baremetal

options:
-h, --help show this help message and exit
Expand Down Expand Up @@ -521,7 +521,7 @@ Command Modifiers:
usage: amd-smi monitor [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL]
[-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]]
[-w INTERVAL] [-W TIME] [-i ITERATIONS] [-p] [-t] [-u] [-m] [-n]
[-d] [-e] [-v] [-r]
[-d] [-e] [-v] [-r] [-q]

Monitor a target device for the specified arguments.
If no arguments are provided, all arguments will be enabled.
Expand Down Expand Up @@ -556,6 +556,7 @@ Monitor Arguments:
-e, --ecc Monitor ECC single bit, ECC double bit, and PCIe replay error counts
-v, --vram-usage Monitor memory usage in MB
-r, --pcie Monitor PCIe bandwidth in Mb/s
-q, --process Enable Process information table below monitor output

Command Modifiers:
--json Displays output in JSON format (human readable by default).
Expand Down
Loading

0 comments on commit 32e3fda

Please sign in to comment.