Changes
- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`**.
Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery:
- `uint64_t accumulation_counter` - used for all throttled calculations
- `uint64_t prochot_residency_acc` - Processor hot accumulator
- `uint64_t ppt_residency_acc` - Package Power Tracking (PPT) accumulator (used in PVIOL calculations)
- `uint64_t socket_thm_residency_acc` - Socket thermal accumulator - (used in TVIOL calculations)
- `uint64_t vr_thm_residency_acc` - Voltage Rail (VR) thermal accumulator
- `uint64_t hbm_thm_residency_acc` - High Bandwidth Memory (HBM) thermal accumulator
- `uint16_t num_partition` - corresponds to the current total number of partitions
- `struct amdgpu_xcp_metrics_t xcp_stats[MAX_NUM_XCP]` - for each partition associated with current GPU, provides gfx busy & accumulators, jpeg, and decoder (VCN) engine utilizations
- `uint32_t gfx_busy_inst[MAX_NUM_XCC]` - graphic engine utilization (%)
- `uint16_t jpeg_busy[MAX_NUM_JPEG_ENGS]` - jpeg engine utilization (%)
- `uint16_t vcn_busy[MAX_NUM_VCNS]` - decoder (VCN) engine utilization (%)
- `uint64_t gfx_busy_acc[MAX_NUM_XCC]` - graphic engine utilization accumulated (%)
- `uint32_t pcie_lc_perf_other_end_recovery` - corresponds to the pcie other end recovery counter
- **Added new violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`**.
***Only available for MI300+ ASICs.***
Users can now retrieve violation status' through either our Python or C++ APIs. Additionally, we have
added capability to view these outputs conviently through `amd-smi metric --throttle` and `amd-smi monitor --violation`.
Example outputs are listed below (below is for reference, output is subject to change):
shell
$ amd-smi metric --throttle
GPU: 0
THROTTLE:
ACCUMULATION_COUNTER: 3808991
PROCHOT_ACCUMULATED: 0
PPT_ACCUMULATED: 585613
SOCKET_THERMAL_ACCUMULATED: 2190
VR_THERMAL_ACCUMULATED: 0
HBM_THERMAL_ACCUMULATED: 0
PROCHOT_VIOLATION_STATUS: NOT ACTIVE
PPT_VIOLATION_STATUS: NOT ACTIVE
SOCKET_THERMAL_VIOLATION_STATUS: NOT ACTIVE
VR_THERMAL_VIOLATION_STATUS: NOT ACTIVE
HBM_THERMAL_VIOLATION_STATUS: NOT ACTIVE
PROCHOT_VIOLATION_ACTIVITY: 0 %
PPT_VIOLATION_ACTIVITY: 0 %
SOCKET_THERMAL_VIOLATION_ACTIVITY: 0 %
VR_THERMAL_VIOLATION_ACTIVITY: 0 %
HBM_THERMAL_VIOLATION_ACTIVITY: 0 %
GPU: 1
THROTTLE:
ACCUMULATION_COUNTER: 3806335
PROCHOT_ACCUMULATED: 0
PPT_ACCUMULATED: 586332
SOCKET_THERMAL_ACCUMULATED: 18010
VR_THERMAL_ACCUMULATED: 0
HBM_THERMAL_ACCUMULATED: 0
PROCHOT_VIOLATION_STATUS: NOT ACTIVE
PPT_VIOLATION_STATUS: NOT ACTIVE
SOCKET_THERMAL_VIOLATION_STATUS: NOT ACTIVE
VR_THERMAL_VIOLATION_STATUS: NOT ACTIVE
HBM_THERMAL_VIOLATION_STATUS: NOT ACTIVE
PROCHOT_VIOLATION_ACTIVITY: 0 %
PPT_VIOLATION_ACTIVITY: 0 %
SOCKET_THERMAL_VIOLATION_ACTIVITY: 0 %
VR_THERMAL_VIOLATION_ACTIVITY: 0 %
HBM_THERMAL_VIOLATION_ACTIVITY: 0 %
...
shell
$ amd-smi monitor --violation
GPU PVIOL TVIOL PHOT_TVIOL VR_TVIOL HBM_TVIOL
0 0 % 0 % 0 % 0 % 0 %
1 0 % 0 % 0 % 0 % 0 %
2 0 % 0 % 0 % 0 % 0 %
3 0 % 0 % 0 % 0 % 0 %
4 0 % 0 % 0 % 0 % 0 %
5 0 % 0 % 0 % 0 % 0 %
6 0 % 0 % 0 % 0 % 0 %
7 0 % 0 % 0 % 0 % 0 %
8 0 % 0 % 0 % 0 % 0 %
9 0 % 0 % 0 % 0 % 0 %
10 0 % 0 % 0 % 0 % 0 %
11 0 % 0 % 0 % 0 % 0 %
12 0 % 0 % 0 % 0 % 0 %
13 0 % 0 % 0 % 0 % 0 %
14 0 % 0 % 0 % 0 % 0 %
15 0 % 0 % 0 % 0 % 0 %
...
- **Added ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`**.
***Partition specific features are only available on MI300+ ASICs***
Users can now retrieve graphic utilization statistic on a per-XCP (per-partition) basis. Here all XCP activities will be listed,
but the current XCP is the partition id listed under both `amd-smi list` and `amd-smi static --partition`.
Example outputs are listed below (below is for reference, output is subject to change):
shell
$ amd-smi metric --usage
GPU: 0
USAGE:
GFX_ACTIVITY: 0 %
UMC_ACTIVITY: 0 %
MM_ACTIVITY: N/A
VCN_ACTIVITY: [0 %, N/A, N/A, N/A]
JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
GFX_BUSY_INST:
XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
JPEG_BUSY:
XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
VCN_BUSY:
XCP_0: [0 %, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A]
GFX_BUSY_ACC:
XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
GPU: 1
USAGE:
GFX_ACTIVITY: 0 %
UMC_ACTIVITY: 0 %
MM_ACTIVITY: N/A
VCN_ACTIVITY: [0 %, N/A, N/A, N/A]
JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
GFX_BUSY_INST:
XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
JPEG_BUSY:
XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
VCN_BUSY:
XCP_0: [0 %, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A]
GFX_BUSY_ACC:
XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
...
- **Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value**.
***Feature is only available on MI300+ ASICs***
Users can now retrieve both through `amdsmi_get_pcie_info()` which has an updated structure:
C
typedef struct {
...
struct pcie_metric_ {
uint16_t pcie_width; //!< current PCIe width
uint32_t pcie_speed; //!< current PCIe speed in MT/s
uint32_t pcie_bandwidth; //!< current instantaneous PCIe bandwidth in Mb/s
uint64_t pcie_replay_count; //!< total number of the replays issued on the PCIe link
uint64_t pcie_l0_to_recovery_count; //!< total number of times the PCIe link transitioned from L0 to the recovery state
uint64_t pcie_replay_roll_over_count; //!< total number of replay rollovers issued on the PCIe link
uint64_t pcie_nak_sent_count; //!< total number of NAKs issued on the PCIe link by the device
uint64_t pcie_nak_received_count; //!< total number of NAKs issued on the PCIe link by the receiver
uint32_t pcie_lc_perf_other_end_recovery_count; //!< PCIe other end recovery counter
uint64_t reserved[12];
} pcie_metric;
uint64_t reserved[32];
} amdsmi_pcie_info_t;
- Example outputs are listed below (below is for reference, output is subject to change):
shell
$ amd-smi metric --pcie
GPU: 0
PCIE:
WIDTH: 16
SPEED: 32 GT/s
BANDWIDTH: 18 Mb/s
REPLAY_COUNT: 0
L0_TO_RECOVERY_COUNT: 0
REPLAY_ROLL_OVER_COUNT: 0
NAK_SENT_COUNT: 0
NAK_RECEIVED_COUNT: 0
CURRENT_BANDWIDTH_SENT: N/A
CURRENT_BANDWIDTH_RECEIVED: N/A
MAX_PACKET_SIZE: N/A
LC_PERF_OTHER_END_RECOVERY: 0
GPU: 1
PCIE:
WIDTH: 16
SPEED: 32 GT/s
BANDWIDTH: 18 Mb/s
REPLAY_COUNT: 0
L0_TO_RECOVERY_COUNT: 0
REPLAY_ROLL_OVER_COUNT: 0
NAK_SENT_COUNT: 0
NAK_RECEIVED_COUNT: 0
CURRENT_BANDWIDTH_SENT: N/A
CURRENT_BANDWIDTH_RECEIVED: N/A
MAX_PACKET_SIZE: N/A
LC_PERF_OTHER_END_RECOVERY: 0
...
- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`**.
This aligns BDF output with ROCm SMI.
See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function.
- bits [63:32] = domain
- bits [31:28] = partition id
- bits [27:16] = reserved
- bits [15: 0] = pci bus/device/function
- **Moved python tests directory path install location**.
- `/opt/<rocm-path>/share/amd_smi/pytest/..` to `/opt/<rocm-path>/share/amd_smi/tests/python_unittest/..`
- On amd-smi-lib-tests uninstall, the amd_smi tests folder is removed.
- Removed pytest dependency, our python testing now only depends on the unittest framework.
- **Added retrieving a set of GPUs that are nearest to a given device at a specific link type level**.
- Added `amdsmi_get_link_topology_nearest()` function to amd-smi C and Python Libraries.
- **Added more supported utilization count types to `amdsmi_get_utilization_count()`**.
- **Added `amd-smi set -L/--clk-limit ...` command**.
Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency.
- **Added unittest functionality to test amdsmi API calls in Python**.
- **Changed the `power` parameter in `amdsmi_get_energy_count()` to `energy_accumulator`**.
- Changes propagate forwards into the python interface as well, however we are maintaing backwards compatibility and keeping the `power` field in the python API until ROCm 6.4.
- **Added GPU memory overdrive percentage to `amd-smi metric -o`**.
- Added `amdsmi_get_gpu_mem_overdrive_level()` function to amd-smi C and Python Libraries.
- **Added retrieving connection type and P2P capabilities between two GPUs**.
- Added `amdsmi_topo_get_p2p_status()` function to amd-smi C and Python Libraries.
- Added retrieving P2P link capabilities to CLI `amd-smi topology`.
shell
$ amd-smi topology -h
usage: amd-smi topology [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL]
[-g GPU [GPU ...]] [-a] [-w] [-o] [-t] [-b]
If no GPU is specified, returns information for all GPUs on the system.
If no topology argument is provided all topology information will be displayed.
Topology arguments:
-h, --help show this help message and exit
-g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices:
ID: 0 | BDF: 0000:0c:00.0 | UUID: <redacted>
ID: 1 | BDF: 0000:22:00.0 | UUID: <redacted>
ID: 2 | BDF: 0000:38:00.0 | UUID: <redacted>
ID: 3 | BDF: 0000:5c:00.0 | UUID: <redacted>
ID: 4 | BDF: 0000:9f:00.0 | UUID: <redacted>
ID: 5 | BDF: 0000:af:00.0 | UUID: <redacted>
ID: 6 | BDF: 0000:bf:00.0 | UUID: <redacted>
ID: 7 | BDF: 0000:df:00.0 | UUID: <redacted>
all | Selects all devices
-a, --access Displays link accessibility between GPUs
-w, --weight Displays relative weight between GPUs
-o, --hops Displays the number of hops between GPUs
-t, --link-type Displays the link type between GPUs
-b, --numa-bw Display max and min bandwidth between nodes
-c, --coherent Display cache coherant (or non-coherant) link capability between nodes
-n, --atomics Display 32 and 64-bit atomic io link capability between nodes
-d, --dma Display P2P direct memory access (DMA) link capability between nodes
-z, --bi-dir Display P2P bi-directional link capability between nodes
Command Modifiers:
--json Displays output in JSON format (human readable by default).
--csv Displays output in CSV format (human readable by default).
--file FILE Saves output into a file on the provided path (stdout by default).
--loglevel LEVEL Set the logging level from the possible choices:
DEBUG, INFO, WARNING, ERROR, CRITICAL
shell
$ amd-smi topology -cndz
CACHE COHERANCY TABLE:
0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0
0000:0c:00.0 SELF C NC NC C C C NC