What's Changed
------------
- Merged PR 3160: [security] bump onnx to 1.13.0. [Lisa Ong]
This resolves a high severity dependabot alert
- Merged PR 3157: Dynamic split dim tests. [Mason Remy]
Dynamic split dim tests
- Merged PR 3158: Do not unroll the profiling ops when vectorization
enabled. [Denny Sun]
when vectorization is enabled, the ops in kernel get unrolled, for example, without this fix the timer added to inner kernel will have 8 copies, which is definitely wrong.
- Merged PR 3153: Fix the lowering issue of the profiling ops. [Denny
Sun]
With this fix the kernel level profiling support can work end to end. Here is some example about how to use it:
tile_nest.iteration_logic
def _tile_logic():
EnterProfileRegion("pack_b_fn_outer")
pack_b_fn(B, B_temp, j, k)
ExitProfileRegion("pack_b_fn_outer")
EnterProfileRegion("matmul_fn_outer")
matmul_fn(A, B, C, B_temp, i, j, k)
ExitProfileRegion("matmul_fn_outer")
PrintProfileResults()
The timings printed out look like:
matmul_fn_outer 1 0.000100 ms
pack_b_fn_outer 1 0.000400 ms
matmul_fn_outer 2 0.000400 ms
pack_b_fn_outer 2 0.001200 ms
matmul_fn_outer 3 0.000600 ms
pack_b_fn_outer 3 0.001700 ms
matmul_fn_outer 4 0.000800 ms
pack_b_fn_outer 4 0.002300 ms
matmul_fn_outer 5 0.000900 ms
pack_b_fn_outer 5 0.002700 ms
matmul_fn_outer 6 0.001200 ms
pack_b_fn_outer 6 0.003200 ms
matmul_fn_outer 7 0.001500 ms
pack_b_fn_outer 7 0.003700 ms
matmul_fn_outer 8 0.001700 ms
pack_b_fn_outer 8 0.004000 ms
matmul_fn_outer 9 0.002000 ms
pack_b_fn_outer 9 0.004500 ms
matmul_fn_outer 10 0.002200 ms
pack_b_fn_outer 10 0.004800 ms
matmul_fn_outer 11 0.002400 ms
pack_b_fn_outer 11 0.005300 ms
matmul_fn_outer 12 0.002700 ms
pack_b_fn_outer 12 0.006500 ms
matmul_fn_outer 13 0.003100 ms
pack_b_fn_outer 13 0.007400 ms
matmul_fn_outer 14 0.003400 ms
pack_b_fn_outer 14 0.007800 ms
matmul_fn_outer 15 0.003700 ms
pack_b_fn_outer 15 0.008300 ms
matmul_fn_outer 16 0.004000 ms
pack_b_fn_outer 16 0.008800 ms
matmul_fn_outer 17 0.004400 ms
pack_b_fn_outer 17 0.009199 ms
matmul_fn_outer 18 0.004800 ms
pack_b_fn_outer 18 0.009599 ms
matmul_fn_outer 19 0.005100 ms
pack_b_fn_outer 19 0.010099 ms
matmul_fn_outer 20 0.005400 ms
pack_b_fn_outer 20 0.010599 ms
matmul_fn_outer 21 0.006000 ms
pack_b_fn_outer 21 0.011299 ms
matmul_fn_outer 22 0.006300 ms
pack_b_fn_outer 22 0.011899 ms
matmul_fn_outer 23 0.006500 ms
pack_b_fn_outer 23 0.012299 ms
matmul_fn_outer 24 0.006701 ms
pack_b_fn_outer 24 0.012699 ms
matmul_fn_outer 25 0.006901 ms
pack_b_fn_outer 25 0.013099 ms
matmul_fn_outer 26 0.007101 ms
pack_b_fn_outer 26 0.013399 ms
matmul_fn_outer 27 0.007300 ms
pack_b_fn_outer 27 0.013799 ms
matmul_fn_outer 28 0.007401 ms
pack_b_fn_outer 28 0.014100 ms
matmul_fn_outer 29 0.007601 ms
pack_b_fn_outer 29 0.014600 ms
matmul_fn_outer 30 0.007801 ms
pack_b_fn_outer 30 0.015000 ms
matmul_fn_outer 31 0.007901 ms
pack_b_fn_outer 31 0.015399 ms
matmul_fn_outer 32 0.008101 ms
pack_b_fn_outer 32 0.015699 ms
matmul_fn_outer 33 0.008301 ms
pack_b_fn_outer 33 0.015999 ms
matmul_fn_outer 34 0.008601 ms
pack_b_fn_outer 34 0.016...
- Merged PR 3152: [nfc] [test] Skip fast_exp mlas tests on unsupported
Aarch64. [Lisa Ong]
These tests generate `llvm.x86.avx.max.ps.256` which is not supported on non-intel processors like Apple M1
%28 = load <8 x float>, <8 x float>* %27, align 4, !dbg !19
%29 = call <8 x float> llvm.x86.avx.max.ps.256(<8 x float> %28, <8 x float> <float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000, float 0xC0561814A0000000>), !dbg !20
%30 = call <8 x float> llvm.fmuladd.v8f32(<8 x float> %29, <8 x float> <float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000, float 0x3FF7154760000000>, <8 x float> <float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000>), !dbg !21
%31 = fsub <8 x float> %30, <float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000, float 0x4168000000000000>, !dbg !22
**Full Changelog**: https://github.com/microsoft/Accera/compare/v1.2.24...v1.2.25