Amdsmi

Latest version: v6.2.4

Safety actively analyzes 688931 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

00003800.0

0000:5c:00.0 50000-50000 50000-50000 50000-50000 N/A 50000-50000 50000-50000 50000-50000 50000-50000
0000:9f:00.0 50000-50000 50000-50000 50000-50000 50000-50000 N/A 50000-50000 50000-50000 50000-50000
0000:af:00.0 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 N/A 50000-50000 50000-50000
0000:bf:00.0 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 N/A 50000-50000
0000:df:00.0 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 N/A


Fixes

- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**.
Devices which do not report (eg. Navi3X/Navi2X/MI100) we have added checks to confirm these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string.
- **Fix for devices which have an older pyyaml installed**.
Platforms which are identified as having an older pyyaml version or pip, we no manually update both pip and pyyaml as needed. This corrects issues identified below. Fix impacts the following CLI commands:
- `amd-smi list`
- `amd-smi static`
- `amd-smi firmware`
- `amd-smi metric`
- `amd-smi topology`

shell
TypeError: dump_all() got an unexpected keyword argument 'sort_keys'


- **Fix for crash when user is not a member of video/render groups**.
AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid crashes when DRM/device data is inaccessable to the logged in user.

00002200.0

6.3.0

Changes

- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`**.
Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery:
- `uint64_t accumulation_counter` - used for all throttled calculations
- `uint64_t prochot_residency_acc` - Processor hot accumulator
- `uint64_t ppt_residency_acc` - Package Power Tracking (PPT) accumulator (used in PVIOL calculations)
- `uint64_t socket_thm_residency_acc` - Socket thermal accumulator - (used in TVIOL calculations)
- `uint64_t vr_thm_residency_acc` - Voltage Rail (VR) thermal accumulator
- `uint64_t hbm_thm_residency_acc` - High Bandwidth Memory (HBM) thermal accumulator
- `uint16_t num_partition` - corresponds to the current total number of partitions
- `struct amdgpu_xcp_metrics_t xcp_stats[MAX_NUM_XCP]` - for each partition associated with current GPU, provides gfx busy & accumulators, jpeg, and decoder (VCN) engine utilizations
- `uint32_t gfx_busy_inst[MAX_NUM_XCC]` - graphic engine utilization (%)
- `uint16_t jpeg_busy[MAX_NUM_JPEG_ENGS]` - jpeg engine utilization (%)
- `uint16_t vcn_busy[MAX_NUM_VCNS]` - decoder (VCN) engine utilization (%)
- `uint64_t gfx_busy_acc[MAX_NUM_XCC]` - graphic engine utilization accumulated (%)
- `uint32_t pcie_lc_perf_other_end_recovery` - corresponds to the pcie other end recovery counter

- **Added new violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`**.
***Only available for MI300+ ASICs.***
Users can now retrieve violation status' through either our Python or C++ APIs. Additionally, we have
added capability to view these outputs conviently through `amd-smi metric --throttle` and `amd-smi monitor --violation`.
Example outputs are listed below (below is for reference, output is subject to change):

shell
$ amd-smi metric --throttle
GPU: 0
THROTTLE:
ACCUMULATION_COUNTER: 3808991
PROCHOT_ACCUMULATED: 0
PPT_ACCUMULATED: 585613
SOCKET_THERMAL_ACCUMULATED: 2190
VR_THERMAL_ACCUMULATED: 0
HBM_THERMAL_ACCUMULATED: 0
PROCHOT_VIOLATION_STATUS: NOT ACTIVE
PPT_VIOLATION_STATUS: NOT ACTIVE
SOCKET_THERMAL_VIOLATION_STATUS: NOT ACTIVE
VR_THERMAL_VIOLATION_STATUS: NOT ACTIVE
HBM_THERMAL_VIOLATION_STATUS: NOT ACTIVE
PROCHOT_VIOLATION_ACTIVITY: 0 %
PPT_VIOLATION_ACTIVITY: 0 %
SOCKET_THERMAL_VIOLATION_ACTIVITY: 0 %
VR_THERMAL_VIOLATION_ACTIVITY: 0 %
HBM_THERMAL_VIOLATION_ACTIVITY: 0 %



GPU: 1
THROTTLE:
ACCUMULATION_COUNTER: 3806335
PROCHOT_ACCUMULATED: 0
PPT_ACCUMULATED: 586332
SOCKET_THERMAL_ACCUMULATED: 18010
VR_THERMAL_ACCUMULATED: 0
HBM_THERMAL_ACCUMULATED: 0
PROCHOT_VIOLATION_STATUS: NOT ACTIVE
PPT_VIOLATION_STATUS: NOT ACTIVE
SOCKET_THERMAL_VIOLATION_STATUS: NOT ACTIVE
VR_THERMAL_VIOLATION_STATUS: NOT ACTIVE
HBM_THERMAL_VIOLATION_STATUS: NOT ACTIVE
PROCHOT_VIOLATION_ACTIVITY: 0 %
PPT_VIOLATION_ACTIVITY: 0 %
SOCKET_THERMAL_VIOLATION_ACTIVITY: 0 %
VR_THERMAL_VIOLATION_ACTIVITY: 0 %
HBM_THERMAL_VIOLATION_ACTIVITY: 0 %

...


shell
$ amd-smi monitor --violation
GPU PVIOL TVIOL PHOT_TVIOL VR_TVIOL HBM_TVIOL
0 0 % 0 % 0 % 0 % 0 %
1 0 % 0 % 0 % 0 % 0 %
2 0 % 0 % 0 % 0 % 0 %
3 0 % 0 % 0 % 0 % 0 %
4 0 % 0 % 0 % 0 % 0 %
5 0 % 0 % 0 % 0 % 0 %
6 0 % 0 % 0 % 0 % 0 %
7 0 % 0 % 0 % 0 % 0 %
8 0 % 0 % 0 % 0 % 0 %
9 0 % 0 % 0 % 0 % 0 %
10 0 % 0 % 0 % 0 % 0 %
11 0 % 0 % 0 % 0 % 0 %
12 0 % 0 % 0 % 0 % 0 %
13 0 % 0 % 0 % 0 % 0 %
14 0 % 0 % 0 % 0 % 0 %
15 0 % 0 % 0 % 0 % 0 %
...


- **Added ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`**.
***Partition specific features are only available on MI300+ ASICs***
Users can now retrieve graphic utilization statistic on a per-XCP (per-partition) basis. Here all XCP activities will be listed,
but the current XCP is the partition id listed under both `amd-smi list` and `amd-smi static --partition`.
Example outputs are listed below (below is for reference, output is subject to change):

shell
$ amd-smi metric --usage
GPU: 0
USAGE:
GFX_ACTIVITY: 0 %
UMC_ACTIVITY: 0 %
MM_ACTIVITY: N/A
VCN_ACTIVITY: [0 %, N/A, N/A, N/A]
JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
GFX_BUSY_INST:
XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
JPEG_BUSY:
XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
VCN_BUSY:
XCP_0: [0 %, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A]
GFX_BUSY_ACC:
XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]

GPU: 1
USAGE:
GFX_ACTIVITY: 0 %
UMC_ACTIVITY: 0 %
MM_ACTIVITY: N/A
VCN_ACTIVITY: [0 %, N/A, N/A, N/A]
JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
GFX_BUSY_INST:
XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
JPEG_BUSY:
XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
VCN_BUSY:
XCP_0: [0 %, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A]
GFX_BUSY_ACC:
XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]

...


- **Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value**.
***Feature is only available on MI300+ ASICs***
Users can now retrieve both through `amdsmi_get_pcie_info()` which has an updated structure:

C
typedef struct {
...
struct pcie_metric_ {
uint16_t pcie_width; //!< current PCIe width
uint32_t pcie_speed; //!< current PCIe speed in MT/s
uint32_t pcie_bandwidth; //!< current instantaneous PCIe bandwidth in Mb/s
uint64_t pcie_replay_count; //!< total number of the replays issued on the PCIe link
uint64_t pcie_l0_to_recovery_count; //!< total number of times the PCIe link transitioned from L0 to the recovery state
uint64_t pcie_replay_roll_over_count; //!< total number of replay rollovers issued on the PCIe link
uint64_t pcie_nak_sent_count; //!< total number of NAKs issued on the PCIe link by the device
uint64_t pcie_nak_received_count; //!< total number of NAKs issued on the PCIe link by the receiver
uint32_t pcie_lc_perf_other_end_recovery_count; //!< PCIe other end recovery counter
uint64_t reserved[12];
} pcie_metric;
uint64_t reserved[32];
} amdsmi_pcie_info_t;


- Example outputs are listed below (below is for reference, output is subject to change):

shell
$ amd-smi metric --pcie
GPU: 0
PCIE:
WIDTH: 16
SPEED: 32 GT/s
BANDWIDTH: 18 Mb/s
REPLAY_COUNT: 0
L0_TO_RECOVERY_COUNT: 0
REPLAY_ROLL_OVER_COUNT: 0
NAK_SENT_COUNT: 0
NAK_RECEIVED_COUNT: 0
CURRENT_BANDWIDTH_SENT: N/A
CURRENT_BANDWIDTH_RECEIVED: N/A
MAX_PACKET_SIZE: N/A
LC_PERF_OTHER_END_RECOVERY: 0

GPU: 1
PCIE:
WIDTH: 16
SPEED: 32 GT/s
BANDWIDTH: 18 Mb/s
REPLAY_COUNT: 0
L0_TO_RECOVERY_COUNT: 0
REPLAY_ROLL_OVER_COUNT: 0
NAK_SENT_COUNT: 0
NAK_RECEIVED_COUNT: 0
CURRENT_BANDWIDTH_SENT: N/A
CURRENT_BANDWIDTH_RECEIVED: N/A
MAX_PACKET_SIZE: N/A
LC_PERF_OTHER_END_RECOVERY: 0
...


- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`**.
This aligns BDF output with ROCm SMI.
See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function.
- bits [63:32] = domain
- bits [31:28] = partition id
- bits [27:16] = reserved
- bits [15: 0] = pci bus/device/function

- **Moved python tests directory path install location**.
- `/opt/<rocm-path>/share/amd_smi/pytest/..` to `/opt/<rocm-path>/share/amd_smi/tests/python_unittest/..`
- On amd-smi-lib-tests uninstall, the amd_smi tests folder is removed.
- Removed pytest dependency, our python testing now only depends on the unittest framework.

- **Added retrieving a set of GPUs that are nearest to a given device at a specific link type level**.
- Added `amdsmi_get_link_topology_nearest()` function to amd-smi C and Python Libraries.

- **Added more supported utilization count types to `amdsmi_get_utilization_count()`**.

- **Added `amd-smi set -L/--clk-limit ...` command**.
Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency.

- **Added unittest functionality to test amdsmi API calls in Python**.

- **Changed the `power` parameter in `amdsmi_get_energy_count()` to `energy_accumulator`**.
- Changes propagate forwards into the python interface as well, however we are maintaing backwards compatibility and keeping the `power` field in the python API until ROCm 6.4.

- **Added GPU memory overdrive percentage to `amd-smi metric -o`**.
- Added `amdsmi_get_gpu_mem_overdrive_level()` function to amd-smi C and Python Libraries.

- **Added retrieving connection type and P2P capabilities between two GPUs**.
- Added `amdsmi_topo_get_p2p_status()` function to amd-smi C and Python Libraries.
- Added retrieving P2P link capabilities to CLI `amd-smi topology`.

shell
$ amd-smi topology -h
usage: amd-smi topology [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL]
[-g GPU [GPU ...]] [-a] [-w] [-o] [-t] [-b]

If no GPU is specified, returns information for all GPUs on the system.
If no topology argument is provided all topology information will be displayed.

Topology arguments:
-h, --help show this help message and exit
-g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices:
ID: 0 | BDF: 0000:0c:00.0 | UUID: <redacted>
ID: 1 | BDF: 0000:22:00.0 | UUID: <redacted>
ID: 2 | BDF: 0000:38:00.0 | UUID: <redacted>
ID: 3 | BDF: 0000:5c:00.0 | UUID: <redacted>
ID: 4 | BDF: 0000:9f:00.0 | UUID: <redacted>
ID: 5 | BDF: 0000:af:00.0 | UUID: <redacted>
ID: 6 | BDF: 0000:bf:00.0 | UUID: <redacted>
ID: 7 | BDF: 0000:df:00.0 | UUID: <redacted>
all | Selects all devices

-a, --access Displays link accessibility between GPUs
-w, --weight Displays relative weight between GPUs
-o, --hops Displays the number of hops between GPUs
-t, --link-type Displays the link type between GPUs
-b, --numa-bw Display max and min bandwidth between nodes
-c, --coherent Display cache coherant (or non-coherant) link capability between nodes
-n, --atomics Display 32 and 64-bit atomic io link capability between nodes
-d, --dma Display P2P direct memory access (DMA) link capability between nodes
-z, --bi-dir Display P2P bi-directional link capability between nodes

Command Modifiers:
--json Displays output in JSON format (human readable by default).
--csv Displays output in CSV format (human readable by default).
--file FILE Saves output into a file on the provided path (stdout by default).
--loglevel LEVEL Set the logging level from the possible choices:
DEBUG, INFO, WARNING, ERROR, CRITICAL


shell
$ amd-smi topology -cndz
CACHE COHERANCY TABLE:
0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0
0000:0c:00.0 SELF C NC NC C C C NC

6.2.1

Additions

- **Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**.
Guest VMs do not support getting current ECC counts from the Host cards.

- **Added `amd-smi static --ras`on Guest VMs**.
Guest VMs can view enabled/disabled ras features that are on Host cards.

Optimizations

- N/A

Fixes

- **Fixed TypeError in `amd-smi process -G`**.

- **Updated CLI error strings to handle empty and invalid GPU/CPU inputs**.

- **Fixed Guest VM showing passthrough options**.

- **Fixed firmware formatting where leading 0s were missing**.

Known Issues

- N/A

6.2.0

Additions

- **`amd-smi dmon` is now available as an alias to `amd-smi monitor`**.

- **Added optional process table under `amd-smi monitor -q`**.
The monitor subcommand within the CLI Tool now has the `-q` option to enable an optional process table underneath the original monitored output.

shell
$ amd-smi monitor -q
GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK ENC_UTIL ENC_CLOCK DEC_UTIL DEC_CLOCK SINGLE_ECC DOUBLE_ECC PCIE_REPLAY VRAM_USED VRAM_TOTAL PCIE_BW
0 199 W 103 °C 84 °C 99 % 1920 MHz 31 % 1000 MHz N/A 0 MHz N/A 0 MHz 0 0 0 1235 MB 16335 MB N/A Mb/s

PROCESS INFO:
GPU NAME PID GTT_MEM CPU_MEM VRAM_MEM MEM_USAGE GFX ENC
0 rvs 1564865 0.0 B 0.0 B 1.1 GB 0.0 B 0 ns 0 ns


- **Added Handling to detect VMs with passthrough configurations in CLI Tool**.
CLI Tool had only allowed a restricted set of options for Virtual Machines with passthrough GPUs. Now we offer an expanded set of functions availble to passthrough configured GPUs.

- **Added Process Isolation and Clear SRAM functionality to the CLI Tool for VMs**.
VMs now have the ability to set the process isolation and clear the sram from the CLI tool. Using the following commands

shell
amd-smi set --process-isolation <0 or 1>
amd-smi reset --clean_local_data


- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**.
Added macros to reference max size limitations for certain amdsmi functions such as max dpm policies and max fanspeed.

- **Added Ring Hang event**.
Added `AMDSMI_EVT_NOTIF_RING_HANG` to the possible events in the `amdsmi_evt_notification_type_t` enum.

Optimizations

- **Updated CLI error strings to specify invalid device type queried**

shell
$ amd-smi static --asic --gpu 123123
Can not find a device: GPU '123123' Error code: -3


- **Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`**.
Previously if a processes with elevated permissions was running amd-smi would required sudo to display all output. Now amd-smi will populate all process data and return N/A for elevated process names instead. However if ran with sudo you will be able to see the name like so:

shell
$ amd-smi process
GPU: 0
PROCESS_INFO:
NAME: N/A
PID: 1693982
MEMORY_USAGE:
GTT_MEM: 0.0 B
CPU_MEM: 0.0 B
VRAM_MEM: 10.1 GB
MEM_USAGE: 0.0 B
USAGE:
GFX: 0 ns
ENC: 0 ns


shell
$ sudo amd-smi process
GPU: 0
PROCESS_INFO:
NAME: TransferBench
PID: 1693982
MEMORY_USAGE:
GTT_MEM: 0.0 B
CPU_MEM: 0.0 B
VRAM_MEM: 10.1 GB
MEM_USAGE: 0.0 B
USAGE:
GFX: 0 ns
ENC: 0 ns


- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**.
Changed the naming to be more accurate to what the function was doing. This change also extends to the CLI where we changed the `clear-sram-data` command to `clean_local_data`.

- **Updated `amdsmi_clk_info_t` struct in amdsmi.h and amdsmi_interface.py to align with host/guest**.
Changed cur_clk to clk, changed sleep_clk to clk_deep_sleep, and added clk_locked value. New struct will be in the following format:

shell
typedef struct {
+ uint32_t clk;
uint32_t min_clk;
uint32_t max_clk;
+ uint8_t clk_locked;
+ uint8_t clk_deep_sleep;
uint32_t reserved[4];
} amdsmi_clk_info_t;


- **Multiple structure updates in amdsmi.h and amdsmi_interface.py to align with host/guest**.
Multiple structures used by APIs were changed for alignment unification:
- Changed `amdsmi_vram_info_t` `vram_size_mb` field changed to to `vram_size`
- Updated `amdsmi_vram_type_t` struct updated to include new enums and added `AMDSMI` prefix
- Updated `amdsmi_status_t` some enums were missing the `AMDSMI_STATUS` prefix
- Added `AMDSMI_PROCESSOR_TYPE` prefix to `processor_type_t` enums
- Removed the fields structure definition in favor for an anonymous definition in `amdsmi_bdf_t`

- **Added `AMDSMI` prefix in amdsmi.h and amdsmi_interface.py to align with host/guest**.
Multiple structures used by APIs were changed for alignment unification. `AMDSMI` prefix was added to the following structures:
- Added AMDSMI prefix to `amdsmi_container_types_t` enums
- Added AMDSMI prefix to `amdsmi_clk_type_t` enums
- Added AMDSMI prefix to `amdsmi_compute_partition_type_t` enums
- Added AMDSMI prefix to `amdsmi_memory_partition_type_t` enums
- Added AMDSMI prefix to `amdsmi_clk_type_t` enums
- Added AMDSMI prefix to `amdsmi_temperature_type_t` enums
- Added AMDSMI prefix to `amdsmi_fw_block_t` enums

- **Changed dpm_policy references to soc_pstate**.
The file structure referenced to dpm_policy changed to soc_pstate and we have changed the APIs and CLI tool to be inline with the current structure. `amdsmi_get_dpm_policy()` and `amdsmi_set_dpm_policy()` is no longer valid with the new API being `amdsmi_get_soc_pstate()` and `amdsmi_set_soc_pstate()`. The CLI tool has been changed from `--policy` to `--soc-pstate`

- **Updated `amdsmi_get_gpu_board_info()` product_name to fallback to pciids**.
Previously on devices without a FRU we would not populate the product name in the `amdsmi_board_info_t` structure, now we will fallback to using the name listed according to the pciids file if available.

- **Updated CLI voltage curve command output**.
The output for `amd-smi metric --voltage-curve` now splits the frequency and voltage output by curve point or outputs N/A for each curve point if not applicable

shell
GPU: 0
VOLTAGE_CURVE:
POINT_0_FREQUENCY: 872 Mhz
POINT_0_VOLTAGE: 736 mV
POINT_1_FREQUENCY: 1354 Mhz
POINT_1_VOLTAGE: 860 mV
POINT_2_FREQUENCY: 1837 Mhz
POINT_2_VOLTAGE: 1186 mV


- **Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`**.
Updated sizes that work for retreiving relavant board information across AMD's
ASIC products. This requires users to update any ABIs using this structure.

Fixes

- **Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool**.
When running `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any other subcommand that access the GPUs) a mutex would lock and not return requiring either a clear of the mutex in /dev/shm or rebooting the machine.

- **Fixed multiple processes not being registered in `amd-smi process` with json and csv format**.
Multiple process outputs in the CLI tool were not being registered correctly. The json output did not handle multiple processes and is now in a new valid json format:

shell
[
{
"gpu": 0,
"process_list": [
{
"process_info": {
"name": "TransferBench",
"pid": 420157,
"mem_usage": {
"value": 0,
"unit": "B"
}
}
},
{
"process_info": {
"name": "rvs",
"pid": 420315,
"mem_usage": {
"value": 0,
"unit": "B"
}
}
}
]
}
]


- **Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported**.
Throttle status may work for older ASICs, but will be replaced with PVIOL and TVIOL metrics for future ASIC support. It remains a field in the gpu_metrics API and in `amd-smi metric --power`.

- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**.
Previously if there was a partial failure to retrieve character strings, we would return
garbage output to users using the API. This fix intends to populate as many values as possible.
Then any failure(s) found along the way, `\0` is provided to `amdsmi_board_info_t`
structures data members which cannot be populated. Ensuring empty char string values.

- **Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`**.
The parsing of `pp_od_clk_voltage` was not dynamic enough to work with the dropping of voltage curve support on MI series cards. This propagates down to correcting the CLI's output `amd-smi metric --voltage-curve` to N/A if voltage curve is not enabled.

Known Issues

- **`amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do no currently work and will be supported in a future release**.

6.1.2

Additions

- **Added process isolation and clean shader APIs and CLI commands**.
Added APIs CLI and APIs to address LeftoverLocals security issues. Allowing clearing the sram data and setting process isolation on a per GPU basis. New APIs:
- `amdsmi_get_gpu_process_isolation()`
- `amdsmi_set_gpu_process_isolation()`
- `amdsmi_set_gpu_clear_sram_data()`

- **Added `MIN_POWER` to output of `amd-smi static --limit`**.
This change helps users identify the range to which they can change the power cap of the GPU. The change is added to simplify why a device supports (or does not support) power capping (also known as overdrive). See `amd-smi set -g all --power-cap <value in W>` or `amd-smi reset -g all --power-cap`.

shell
$ amd-smi static --limit
GPU: 0
LIMIT:
MAX_POWER: 203 W
MIN_POWER: 0 W
SOCKET_POWER: 203 W
SLOWDOWN_EDGE_TEMPERATURE: 100 °C
SLOWDOWN_HOTSPOT_TEMPERATURE: 110 °C
SLOWDOWN_VRAM_TEMPERATURE: 100 °C
SHUTDOWN_EDGE_TEMPERATURE: 105 °C
SHUTDOWN_HOTSPOT_TEMPERATURE: 115 °C
SHUTDOWN_VRAM_TEMPERATURE: 105 °C

GPU: 1
LIMIT:
MAX_POWER: 213 W
MIN_POWER: 213 W
SOCKET_POWER: 213 W
SLOWDOWN_EDGE_TEMPERATURE: 109 °C
SLOWDOWN_HOTSPOT_TEMPERATURE: 110 °C
SLOWDOWN_VRAM_TEMPERATURE: 100 °C
SHUTDOWN_EDGE_TEMPERATURE: 114 °C
SHUTDOWN_HOTSPOT_TEMPERATURE: 115 °C
SHUTDOWN_VRAM_TEMPERATURE: 105 °C


Optimizations

- **Updated `amd-smi monitor --pcie` output**.
The source for pcie bandwidth monitor output was a legacy file we no longer support and was causing delays within the monitor command. The output is no longer using TX/RX but instantaneous bandwidth from gpu_metrics instead; updated output:

shell
$ amd-smi monitor --pcie
GPU PCIE_BW
0 26 Mb/s


- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**.
`amdsmi_get_power_cap_info` will return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).

- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
Previously calls were returning "No bad pages found." if no pages were found, now it only returns the list type and can be empty.

- **Updated `amd-smi metric --ecc-blocks` output**.
The ecc blocks argument was outputing blocks without counters available, updated the filtering show blocks that counters are available for:

shell
$ amd-smi metric --ecc-block
GPU: 0
ECC_BLOCKS:
UMC:
CORRECTABLE_COUNT: 0
UNCORRECTABLE_COUNT: 0
DEFERRED_COUNT: 0
SDMA:
CORRECTABLE_COUNT: 0
UNCORRECTABLE_COUNT: 0
DEFERRED_COUNT: 0
GFX:
CORRECTABLE_COUNT: 0
UNCORRECTABLE_COUNT: 0
DEFERRED_COUNT: 0
MMHUB:
CORRECTABLE_COUNT: 0
UNCORRECTABLE_COUNT: 0
DEFERRED_COUNT: 0
PCIE_BIF:
CORRECTABLE_COUNT: 0
UNCORRECTABLE_COUNT: 0
DEFERRED_COUNT: 0
HDP:
CORRECTABLE_COUNT: 0
UNCORRECTABLE_COUNT: 0
DEFERRED_COUNT: 0
XGMI_WAFL:
CORRECTABLE_COUNT: 0
UNCORRECTABLE_COUNT: 0
DEFERRED_COUNT: 0


- **Removed `amdsmi_get_gpu_process_info` from Python library**.
amdsmi_get_gpu_process_info was removed from the C library in an earlier build, but the API was still in the Python interface.

Fixes

- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**.
These systems use an older version of gpu_metrics in amdgpu. This fix only updates what CLI outputs.
No change in any of our APIs.

shell
$ amd-smi metric --power
GPU: 0
POWER:
SOCKET_POWER: 11 W
GFX_VOLTAGE: 768 mV
SOC_VOLTAGE: 925 mV
MEM_VOLTAGE: 1250 mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED

GPU: 1
POWER:
SOCKET_POWER: 17 W
GFX_VOLTAGE: 781 mV
SOC_VOLTAGE: 806 mV
MEM_VOLTAGE: 1250 mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED


- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**.
Updates required `amdsmi_get_power_cap_info` to return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).

- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
Previously Python interface calls to populated bad pages resulted in a `ValueError: NULL pointer access`. This fixes the bad-pages subcommand CLI subcommand as well.

Known Issues

- N/A

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.