Author: Srinivas Yadav <43375352+srinivasyadav18users.noreply.github.com>
Date: Sat Oct 14 02:05:41 2023 -0500
Fixed HPX barrier synchronization (783)
Details:
- Fixed hpx barrier synchronization. HPX was hanging on larger cores
because blis was using non-hpx synchronization primitives. But when
using hpx-runtime only hpx-synchronization primitives should be used.
Hence, a C style wrapper hpx_barrier_t is introduced to perform hpx
barrier operations.
- Replaced hpx::for_loop with hpx::futures. Using hpx::for_loop with
hpx::barrier on n_threads greater than actual hardware thread count
causes synchronization issues making hpx hanging. This can be avoided
by using hpx::futures, which are relatively very lightweight, robust
and scalable.
commit 8fff1e31da1c87e46cacec112b0ac280ab47cd8b
Author: Field G. Van Zee <fgvanzeegmail.com>
Date: Thu Oct 12 15:51:41 2023 -0500
Fixed bug in sup threshold registration. (782)
Details:
- Fixed a bug that resulted in BLIS non-deterministically calling the
gemmsup handler, irrespective of the thresholds that are registered
via bli_cntx_set_blkszs().
- Deep dive: In bli_cntx_init_ref.c, the default values for the gemmsup
thresholds (BLIS_[MNK]T blocksizes) wre being set to zero so that no
operation ever matched the criteria for gemmsup (unless specific sup
thresholds are registered). HOWEVER, these thresholds are set via
bli_cntx_set_blkszs() which calls bli_blksz_copy_if_pos(), which was
only coping the thresholds into the gks' cntx_t if the values were
strictly positive. Thus, the zero values passed into
bli_cntx_set_blkszs() were being ignored and those threshold slots
within the gks were left uninitialized. The upshot of this is that the
reference gemmsup handler was being called for gemm problems
essentially at random (and as it turns out, very rarely the reference
gemmsup implementation would encounter a divide-by-zero error).
- The problem was fixed by changing bli_blksz_copy_if_pos() so that it
copies values that are non-negative (values >= 0 instead of > 0). The
function was also renamed to bli_blksz_copy_if_nonneg()
- Also needed to standardize use of -1 as the sole value to embed into
blksz_t structs as a signal to bli_cntx_set_blkszs() to *not* register
a value for that slot (and instead let whatever existing values
remain). This required updates to the bli_cntx_init_*() functions for
bgq, cortexa9, knc, penryn, power7, and template subconfigs, as some
of these codes were using 0 instead of -1.
- Fixes 781. Thanks to Devin Matthews for identifying, diagnosing, and
proposing a fix for this issue.
commit 1e264a42474b535431768ef925bbd518412d392e
Author: Abhishek Bagusetty <59661409+abagusettyusers.noreply.github.com>
Date: Mon Oct 2 18:29:46 2023 -0500
Update zen3 subconfig to support NVHPC compilers. (779)
Details:
- Parse $(CC_VENDOR) values of "nvc" in 'zen3' make_defs.mk file.
- Minor refactor to accommodate above edit.
- CREDITS file update.
commit c2099ed2519dcac8ee421faf999b36e1c2260be7
Author: Field G. Van Zee <fgvanzeegmail.com>
Date: Mon Oct 2 14:56:48 2023 -0500
Fixed brokenness when sba is disabled. (777)
Details:
- Previously, disabling the sba via --disable-sba-pools resulted in a
segfault due to a sanity-check-triggering abort(). The problem was
that the sba, as currently used in the l3 thread decorators, did not
yet (fully) support pools being disabled. The solution entailed
creating wrapper function, bli_sba_array_elem(), which either calls
bli_apool_array_elem() (when sba pools are enabled at configure time)
or returns a NULL sba_pool pointer (when sba pools are disabled), and
calling bli_sba_array_elem() in place of bli_apool_array_elem(). Note
that the NULL pointer returned by bli_sba_array_elem() when the sba
pools are disabled does no harm since in that situation the pointer
goes unreferenced when acquiring and releasing small blocks. Thanks to
John Mather for reporting this bug.
- Guarded the bodies of bli_sba_init() and bli_sba_finalize() with
ifdef BLIS_ENABLE_SBA_POOLS. I don't think this was actually necessary
to fix the aforementioned bug, but it seems like good practice.
- Moved the code in bli_l3_thrinfo_create() that checked that the array*
pointer is non-NULL before calling bli_sba_array_elem() (previously
bli_apool_array_elem()) into the definition of bli_sba_array_elem().
- Renamed various instances of 'pool' variables and function parameters
to 'sba_pool' to emphasize what kind of pool it represents.
- Whitespace changes.
commit 37ca4fd168525a71937d16aaf6a13c0de5b4daef
Author: Field G. Van Zee <fgvanzeegmail.com>
Date: Thu Sep 28 16:37:57 2023 -0500
Implemented [cz]symv_(), [cz]syr_(), [cz]rot_(). (778)
Details:
- Expanded existing BLAS compatibility APIs to provide interfaces to
[cz]symv_(), [cz]syr_(). This was easy since those operations were
already implemented natively in BLIS; the APIs were previously
omitted only because they were not formally part of the BLAS.
- Implemented [cz]rot_() by feeding code from LAPACK 3.11 through
f2c.
- Thanks to James Foster for pointing out that LAPACK contains these
additional symbols, which prompted these additions, as well as for
testing the [cz]rot_() functions from Julia's test infrastructure.
- CREDITS file update.
commit 6f412204004666abac266409a203cb635efbabf3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 26 18:00:54 2023 -0500
Added 'altra', 'altramax' subconfigs. (775)
Details:
- Forward-ported 'altra' and 'altramax' subconfigurations from the
older 'stable' branch lineage [1]. These subconfigs primarily target
the Ampere Altra and AltraMax (ARM) processors. They also contain
"QuickStart" directories with information and scripts to help
use BLIS on these microarchitectures. Thanks to Jeff Diamond and
Leick Robinson for developing these subconfigs and resources.
- Updated kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c according to
changes in the 'stable' lineage, mostly related to re-enabling of
assembly code branches that target general stride IO.
[1] Note that the 'stable' branch is being used to make sure that more
recent commits do not introduce unreasonable performance
regressions. As such, the name should be interpreted as shorthand
for "performance stable," not "API stable."
commit a4a63295b96ed5b32f4df6477d24db07bf431202
Author: Srinivas Yadav <43375352+srinivasyadav18users.noreply.github.com>
Date: Tue Sep 26 17:58:38 2023 -0500
Fixes to HPC runtime code path. (773)
Details:
- Fixed hpx::for_each invocation and replace with hpx::for_loop. The HPX
runtime was initialized using hpx::start, but the hpx::for_each
function was being called on a non-hpx runtime (i.e standard BLIS
runtime - single main thread). To run hpx::for_each on HPX runtime
correctly, the code now uses hpx::run_as_hpx_thread(func, args...).
- Replaced hpx::for_each with hpx::for_loop, which eliminates use of
hpx::util::counting_iterator.
- Employ hpx::execution::chunk_size(1) to make sure that a thread
resides on a particular core.
- Replaced hpx::apply() with updated version hpx::post().
- Initialize tdata->id = 0 in libblis.c to 0, as it is the main thread
and is needed for writing results to output file.
- By default, if not specified, the HPX runtime uses all N threads/cores
available in the system. But, if we want to only specify n_threads out
N threads, we use hpx::execution::experimental::num_cores(n_threads).
commit c6546c1131b1ddd45ef13f9f2b620ce2e955dbf8
Author: John Mather <54645798+jmather-sesiusers.noreply.github.com>
Date: Wed Sep 20 13:41:07 2023 -0400
Fixed broken link in Multithreading.md. (774)
Details:
- Replaced 404'd link in docs/Multithreading.md with an archive from
The Wayback Machine.
- CREDITS file update.
commit 6dcf7666eff14348e82fbc2750be4b199321e1b9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 27 14:18:57 2023 -0500
Revamped bli_init() to use TLS where feasible. (767)
Details:
- Revamped bli_init_apis() and bli_finalize_apis() to use separate
bli_pthread_switch_t objects for each of the five sub-API init
functions, with the objects for the 'ind' and 'rntm' sub-APIs being
declared with BLIS_THREAD_LOCAL. This allows some APIs to be treated
as thread-local and the rest as thread-shared. Thanks to Edward Smyth
for requesting application thread-specific rntm_t structs, which
inspired these change.
- Combined bli_thread_init_from_env() and bli_pack_init_from_env() into
a new function, bli_rntm_init_rntm_from_env(), and placed the combined
code in bli_rntm.c inside of a new bli_rntm_init() function. Then
removed the (now empty) bli_pack_init() and _finalize() function defs.
- Deprecated bli_rntm_init() for the purposes of initializing a rntm_t
(temporarily preserving it as bli_rntm_clear() in a cpp-undefined code
block) so that the function name could be used for the aforementioned
bli_rntm_init() function.
- Updated libblis_test_pobj_create() in test_libblis.c to use a static
rntm_t initializer instead of the deprecated bli_rntm_init()
function-based option.
- Minor updates to docs/Multithreading.md, including removal of
bli_rntm_init() in the example of how to initialize rntm_t structs.
- Changed the return value of bli_gks_init(), bli_ind_init(),
bli_memsys_init(), bli_thread_init(), and bli_rntm_init() (and their
finalize() counterparts) from 'void' to 'int' so that those functions
match the function type expected by bli_pthread_switch_on()/_off().
Those init/finalize functions now return 0 to indicate success, which
is needed so that the switch actually changes state from off to on
and vice versa.
- Defined bli_thread_reset(), which copies the contents of the
global_rntm_at_init() struct into the global_rntm struct (for the
current application thread).
- Guard calls to bli_pthread_mutex_lock()/_unlock() in
- bli_pack_set_pack_a() and _pack_b()
- bli_rntm_init_from_global()
- bli_thread_set_ways()
- bli_thread_set_num_threads()
- bli_thread_set_thread_impl()
- bli_thread_reset()
- bli_l3_ind_oper_set_enable()
with ifdef BLIS_DISABLE_TLS (since TLS precludes the possibility of
race conditions).
- In frame/base/bli_rntm.c, declare global_rntm, global_rntm_at_init,
and global_rntm_mutex as BLIS_THREAD_LOCAL so that separate
application threads can change the number of ways of BLIS parallelism
independently from one another.
- Access global_rntm only via a new private (not exported) function,
bli_global_rntm(). Defined a similar function for a rntm_t new to
this commit, global_rntm_at_init, which preserves the state of the
global rntm at initialization-time.
- In frame/3/bli_l3_ind.c, added a guard to the declaration of the
static variable oper_st_mutex with ifdef BLIS_DISABLE_TLS so that the
mutex is omitted altogether when TLS is enabled (which prevents the
compiler from warning about an unused variable).
- Removed redundant code from bli_thread.c:
ifdef BLIS_ENABLE_HPX
include "bli_thread_hpx.h"
endif
since this code is already present in bli_thread.h.
- Thanks to Minh Quan Ho for his review of and feedback on this commit.
- Comment updates.
commit fa6a9b24ae2ddbd5f30f657d46004843581c768c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 19 12:44:34 2023 -0500
Fixed error when using common.mk from testsuite. (768)
Details:
- Commit 2db31e0 (755) inserted logic into common.mk that attempts to
preprocess build/detect/android/bionic.h to determine whether the
__BIONIC__ macro is defined (in which case -lrt should not be included
in LDFLAGS). However, the path to bionic.h was encoded without regard
to DIST_PATH, and so utilizing common.mk anywhere that isn't the top-
level directory (such as in the testsuite directory) resulted in a
compiler error:
gcc: error: build/detect/android/bionic.h: No such file or directory
gcc: fatal error: no input files
compilation terminated.
This commit adds a $(DIST_PATH) prefix to the path to bionic.h so that
it can be located from other applications' Makefiles that use BLIS's
makefile fragments.
commit 634e532c8dcce7383d96ba33276df65c656b2198
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 9 21:54:49 2023 -0500
Set thrcomm timpl_t id inside init functions. (766)
Details:
- Previously, the timpl_t id being used when a thrcomm_t is being
initialized was set within the bli_thrcomm_init() dispatch function
after the timpl_t-specific bli_thrcomm_init_*() function returned. But
it just occurred to me that each bli_thrcomm_init_*() function already
intrinsically knows its own timpl_t value. This commit shifts the
setting of the thrcomm_t.ti field into the corresponding
bli_thrcomm_init_*() function for each timpl_t type (e.g. single,
openmp, pthreads, hpx).
- Removed long-deprecated code dating back nearly 10 years.
- Whitespace changes
- Comment updates.
commit 3cf17b4a91232709bc6a205b0e4d7ecc96579aa9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 7 13:46:20 2023 -0500
Small fixes/improvements to docs/Multithreading.md. (764)
Details:
- Added reminders that include "blis.h" must be added to source files
in order to access BLIS API function prototypes. Thanks to Barry Smith
for suggesting this improvement.
- Fixed pre-existing typos.
- CREDITS file update.
commit dbc79812c390f812c7bf030bfcf87e947a1443c4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 28 18:16:38 2023 -0500
CREDITS file update.
Details:
- Thanks to Igor Zhuravlov for PR 753 (commit 915daaa).
commit 915daaa43cd189c86d93d72cd249714f126e9425
Author: Igor Zhuravlov <zhuravlov.ipya.ru>
Date: Thu Jul 27 20:33:59 2023 +0000
Fix typos in docs + example code comments. (753)
Details:
- Fixed various typos in API documentation in docs/BLIS*API.md and
comments in the source code examples within examples/?api/*.c.
commit 2db31e057e7e9c97fc60021b5ae72a01a48d7588
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Jul 27 15:27:21 2023 -0500
Exclude -lrt on Android with Bionic libraries. (755)
Details:
- Added build/detect/android/bionic.h header to test whether the
__BIONIC__ cpp macro is defined.
- In common.mk, only add -lrt to LDFLAGS when Bionic is not present.
- CREDITS file update.
commit 22ad8c1b752364784f320168b31995945ad84a59
Author: ct-clmsn <ct.clmsngmail.com>
Date: Thu Jul 27 16:23:29 2023 -0400
Small fixes to support hpx in the testsuite (759)
Details:
- Minor changes to test_libblis.c to support hpx.
commit c91b41d022e33da82b3b06c82be047a29873d9b6
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Wed Jul 26 14:37:08 2023 -0500
Auto-detect the RISC-V ABI of the compiler and use -mabi= during RISC-V Builds (750)
Details:
- Generate a build error if there is a 32/64-bit mismatch between the
RISC-V ABI or architecture and the BLIS configuration selected.
- Handle Q, Zicsr, ZiFencei, Zba, Zbb, Zbc, Zbs and Zfh extensions in
the RISC-V architecture auto-detection. ZiFencei and Zicsr is not
detectable with built-in RISC-V macros right now.
- ZiFencei is not important for BLIS because doesn't it have
Just-In-Time compilation or self-modifying code, and Zicsr is implied
by the floating-point extensions, which are required for good
performance in BLIS.
- Move RISC-V autodetect header files to build/detect/riscv/.
commit a0b04e3c007f1207e5678bf20c07752906742fb7 (origin/aocl-blas, aocl-blas)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 26 17:59:21 2023 -0500
Rewrote regen-symbols.sh (gen-libblis-symbols.sh). (751)
Details:
- Wrote an alternative to regen-symbols.sh, gen-libblis-symbols.sh,
that generates a list of exported symbols from the monolithic blis.h
file rather than peeking inside of the shared object via nm. (This new
script lives in the 'build' directory and the older script has been
retired to build/old.) Special thanks to Devin Matthews for authoring
gen-libblis-symbols.sh.
- Added a 'symbols' target to the top-level Makefile which will refresh
build/libblis-symbols.def, with supporting changes to common.mk.
- Updates to build/libblis-symbols.def using the new symbol-generating
script.
commit 6b894c30b9bb2c2518848d74e4c8d96844f77f24
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 12 17:22:44 2023 -0500
Rewrote/fixed broken tree barrier implementation.
Details:
- Rewrote the defintion of bli_thrcomm_tree_barrier() so that it (a)
actually worked again, and (b) used atomics instead of a basic C99
spin loop. (Note that the conventional barrier implementation is
still enabled by default; the tree barrier must be toggled on
manually within the configuration.)
- Added an early return to the definition of bli_thrcomm_barrier() in
the cases where comm == NULL or comm->n_threads == 1.
- Reordered thread-related and thread-dependent header include
directives in blis.h so that the BLIS_TREE_BARRIER and
BLIS_TREE_BARRIER_ARITY macros, which would be defined in the target
configuration's in the bli_family_*.h file, would be included prior
to the inclusion of the thrcomm_t header that uses them.
- Changed the type of barrier_t.count from 'int' to 'dim_t'.
- Changed the type of barrier_t.signal from 'volatile int' to 'gint_t'.
- Special thanks to Leick Robinson for contributing these changes.
- Whitespace changes.
commit d639554894b6252a86bd3164921bce6fbb9e3b5e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 7 16:11:14 2023 -0500
Pad thrcomm_t fields to avoid false sharing.
Details:
- Inserted a cache line of padding between various fields of the
thrcomm_t and, in the case of the (presently defunct) tree barrier,
fields of the barrier_t. This additional padding ensures that these
fields, which both serve different purposes when performing a thread
barrier, are only accessed when needed (and not just due to their
spatial locality with their cache line neighbors).
- Added a new cpp macro constant, BLIS_CACHE_LINE_SIZE, to
bli_config_macro_defs. This new constant defines the size of a cache
line (in bytes) and defaults to 64.
- Special thanks to Leick Robinson for discovering this false sharing
issue and developing/submitting the patch.
commit 89b7863fc9a88903917deedc6a5ad9fd17f83713
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon May 8 16:51:18 2023 -0500
Fix 1m enablement for herk/her2k/syrk/syr2k. (743)
Details:
- Ever since 28b0982, herk, her2k, syrk, and syr2k have been implemented
in terms of the gemmt expert API. And since the decision of which
induced method to use (1m or native) is made *below* the level of the
expert API, executing any of {herk,her2k,syrk,syr2k} results in BLIS
checking the enablement status for gemmt.
- This commit applies a band-aid of sorts to this issue by modifying
bli_l3_ind_oper_get_enable() and bli_l3_ind_oper_set_enable() so that
any attempts to query or modify the internal enablement status for
herk, her2k, syrk, or syr2k instead does so for gemmt.
- This solution isn't perfect since, in theory, the user could enable 1m
for, say, herk but then disable it for syrk, and then be confused when
herk runs via native execution. But we don't anticipate that users
modify 1m enablement at the operation level, and so in practice this
solution is likely fine for now.
commit 138de3b3e88c5bf7d8718c45c88811771cf42db8
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Sun May 7 13:01:38 2023 -0700
add nvhpc compiler support (719)
Add detection of the NVIDIA nvhpc compiler (`nvc`) in `configure`, and adjust some warning options in `config.mk`. Currently, no specific options for `nvc` have been added in the relevant configurations so it may not be usable without further tweaks.
commit 0873c0f6ed03fea321d1631b3d1a385a306aa797
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 7 14:03:19 2023 -0500
Consolidate INSERT_ macro sets via variadic macros. (744)
Details:
- Consolidated INSERT_GENTFUNC_* (and corresponding GENTPROT) macro sets
using variadic macros (__VA_ARGS__), which means we no longer need a
different INSERT_ macro for each possible number of arguments the
macro might take. This change seems reasonable given that variadic
macros are a standard C99 feature and widely supported. I took care
not to use variadic macros where 0 variadic arguments are expected
since that is a non-standard extension.
- Added pre-typecast parentheses to arithmetic expressions in printf()
statements in bli_thread_range_tlb.c.
commit ef9d3e6675320a53e7cb477c16b01388e708b1da
Author: h-vetinari <h.vetinarigmx.com>
Date: Sun May 7 04:59:35 2023 +1100
Added missing include <io.h> for Windows. (747)
Details:
- This commit fixes issue 746, in which the _access() function (called
from within blastest/f2c/open.c) is undeclared when compiling on
Windows with clang 16.
commit 6fd9aabb03d172a792a7eeb106c7d965cf038421
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri May 5 14:22:52 2023 -0500
Fix bug in detecting Fortran compiler vendor (745)
`FC` was used instead of `found_fc`.
commit 8215b02f99aa77ecc7d813508c247565115319d7
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Wed Apr 12 12:59:27 2023 -0500
Apply 738 to make_defs.mk of RISC-V subconfigs. (740)
Details:
- PR 738 -- which moved -fPIC flag insertion responsibilities from
common.mk to the subconfigs' individual make_defs.mk files -- was
merged shortly before the introduction of new RISC-V subconfigs in
693. This commit brings those RISC-V subconfigs up to date with the
new -fPIC conventions.
commit 6b38c5ac07a2a27738674784e58aa699bf895447
Author: angsch <17718454+angschusers.noreply.github.com>
Date: Tue Apr 11 19:27:43 2023 +0200
Add RISC-V target (693)
Details:
- There are four RISC-V base configurations: 'rv32i', 'rv32iv', 'rv64i',
and 'rv64iv', namely the 32-bit and 64-bit implementations with and
without the 'V' vector extension. Additional extensions such as 'M'
(multiplication), 'A' (atomics), 'F' ('float' hardware support), 'D'
('double' hardware support), and 'C' (compressed-length instructions),
are automatically used when available. If they are not available, then
software equivalents (e.g., softfloat and -latomic) are used.
- './configure auto' can be invoked on a RISC-V build platform, and will
automatically detect RISC-V CPU extensions through the RISC-V C API:
https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md
- The assembly kernels assume the presence of the vector extension
RVV 1.0.
- It is possible to build 'rv[32,64]iv' for any value of VLEN.
However, if VLEN < 128, the targets will fall back to the generic
kernels and blocksizes.
- The vector microkernels are vector-length agnostic and work with
every VLEN >=128, but are expected to work best with smaller vector
lengths, i.e., VLEN <= 512.
- The assembly kernels cover column major storage (rs_c == 1).
- The blocksizes aim at being a good generic choice for out-of-order
cores. They are not tuned to a specific RISC-V HPC core.
- The vector kernels have been tested using vlen={128,256,512}.
- The single- and double-precision assembly code routines for 'sgemm'
and 'dgemm', or for 'cgemm' and 'zgemm', are combined in their RISC-V
vector assembly source code, and are differentiated only with macros.
- The XLEN=32 and XLEN=64 versions of the RISC-V assembly code are
identical, except that callee-saved registers are saved and restored
differently. There are RISC-V assembly code include files for
handling the saving and restoring of callee-saved registers, and they
are future-proof if ever XLEN=128.
- Multiplications, such as computing array strides and offsets, are
performed in C, and later passed to the RISC-V assembly kernels. This
is so that the compiler can determine whether the 'M' (multiply)
extension is available and use multiplication instructions, or call
library helper functions instead.
- A new macro called bli_static_assert() has been added to perform
static assertions at compile-time, regardless of the C/C++ dialect of
the compiler. The original motivation of this was to ensure that
calling RISC-V assembly kernels would not silently truncate arguments
of type 'dim_t' or 'inc_t' (so-called "narrowing conversions").
- RISC-V CI tests have been added to Travis CI, using the
riscv-gnu-toolchain cross-compiler, and qemu simulator.
- Thanks to Lee Killough for collaborating on this commit.
commit 593d01761910af6a9a16ee0ac097142732f73c29
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 8 16:44:16 2023 -0500
CREDITS file update.
commit 259f68479671bbaf9c5986759aaa0004f9b05a24
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 7 16:11:34 2023 -0500
CREDITS file update.
Details:
- Added attributions associated with commits:
- 98d4678 9b1beec: bartoldeman
- 2b05948 059f151: ct-clmsn
- Reordered attirubtion for decandia50.
commit aea8e1d9243631635ca788d5e14f0f29328e637d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 3 12:17:51 2023 -0500
Optionally disable thread-local storage. (735)
Details:
- Implemented a new configure option, --disable-tls, which allows the
user to optionally disable the use of thread-local storage qualifiers
on static variables in BLIS. This option will rarely be needed, but
in some situations may allow BLIS to compile when TLS is unavailable.
Thanks to Nick Knight for suggesting this option.
- Unlike the --disable-system option, --disable-tls does not forcibly
disable threading. Instead, warnings of the possible consequences of
using threading with TLS disabled are added to:
- the output of './configure --help';
- the output of 'configure' the --disable-tls option is parsed;
- the informational header output by the testsuite.
Thanks to Minh Quan Ho for suggesting these warnings.
- Modified frame/include/bli_lang_defs.h so that BLIS_THREAD_LOCAL is
defined to nothing when BLIS_ENABLE_TLS is not defined.
- Defined bli_info_get_enable_tls(), which returns whether the cpp macro
BLIS_ENABLE_TLS was defined.
- Edited --disable-system configure status output for clarity.
- Whitespace updates.
commit 3f1432abe75cc306ef90a04381d7e0d8739fded8
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Mon Apr 3 12:10:59 2023 -0500
Add output.testsuite to .gitignore (736)
Details:
- Added `output.testsuite` to .gitignore since it was previously not
being matched by `output.testsuite.*`.
commit 38fc5237520a2f20914a9de8bb14d5999009b3fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 30 17:30:07 2023 -0500
Added mm_algorithm pdf files (bp and pb).
Details:
- Added PDF versions of the PowerPoint files added in 17cd260.
commit 17cd260cb504b2f3997c32daec77f4c828fbb32b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 29 21:47:12 2023 -0500
Added mm_algorithm pptx files (bp and pb).
Details:
- Added two PowerPoint files that contain slides depicting the classic
Goto algorithm for matrix multiplication as well as its sister
"panel-block" algorithm. These files reside in docs/diagrams.
commit 9d778e0f7c94d8752dd578101e4fc6893a1f54ef
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 29 17:36:49 2023 -0500
Move -fPIC insertion to subconfigs' make_defs.mk. (738)
* Move -fPIC insertion to subconfigs' make_defs.mk.
Details:
- Previously, common.mk was appending -fPIC to the CPICFLAGS variables
set within the various subconfigurations' make_defs.mk files. This
seemed somewhat unintuitive, and so now the -fPIC flag is assigned to
the various subconfigs' CPICFLAGS variables in the respective
make_defs.mk files.
- This also commit changes the logic in common.mk so that instead of
appending, the variable is overwritten, but now *only* in the case
of Windows (since apparently -fPIC needs to be omitted there). Thanks
to Nick Knight for catching and reporting this weirdness.
commit 04090df01175477394d1e73af2e5769751d47cd6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 27 14:13:10 2023 -0500
Fixed compile errors with `BLIS_DISABLE_BLAS_DEFS`. (730)
* Fixed compile errors with BLIS_DISABLE_BLAS_DEFS.
Details:
- This commit fixes a compile-time error related to the type definition
(prototype) of dsdot_() when BLIS_DISABLE_BLAS_DEFS is defined by the
application (or the configuration), which is actually a symptom of a
larger design issue when disabling BLAS prototypes. The macro was
intended to allow applications to bring their own BLAS prototypes and
suppress the inclusion of duplicate (or possibly conflicting)
prototypes within blis.h. However, prototypes are still needed during
compilation even if they are ultimately omitted from blis.h. The
problem is that almost every source file in BLIS--including the BLAS
compatibility layer--only includes one header (blis.h), and if we
were to include a new header in the BLAS source files (to isolate
only the BLAS prototypes), we would also have to make the build system
aware of the location of those headers. Thanks to Edward Smyth of AMD
for reporting this issue.
- The solution I settled upon was to remove all cpp guards from all BLAS
headers (by changing them to if 1, for easy search-and-replace
anchoring in the future if we ever need to re-insert guards) and
modifying bli_blas.h so that the BLAS prototypes are included if
either (a) BLIS_ENABLE_BLAS_DEFS is defined, or (b)
BLIS_ENABLE_BLAS_DEFS is *not* defined but BLIS_IS_BUILDING_LIBRARY
*is* defined. (Thanks to Devin Matthews for steering me away from an
inferior solution.)
- This commit also spins off the actual BLAS prototypes/definitions to
a separate file, bli_blas_defs.h.
- CREDITS file update.
commit 5f841307f668f65b7ed5a479bd8374d2581208cf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 24 20:05:13 2023 -0500
Omit -fPIC if shared library build is disabled. (732)
Details:
- Updated common.mk so that when --disable-shared option is given to
configure:
1. The -fPIC compiler flag is omitted from the individual
configuration family members' CPICFLAGS variables (which are
initialized in each subconfig's make_defs.mk file); and
2. The BUILD_SYMFLAGS variable, which contains compiler flags needed
to control the symbol export behavior, is left blank.
- The net result of these changes is that flags specific to shared
library builds are only used when a shared library is actually
scheduled to be built. Thanks to Nick Knight for reporting this issue.
- CREDITS file update.
commit 72c37eb80f964b7840377076e5009aec5b29d320 (origin/riscv)
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Mar 23 16:01:55 2023 -0500
Updated configure to pass all shellcheck checks. (729)
Details:
- Modified configure so that it passes all 'shellcheck' checks,
disabling ones which we violate but which are just stylistic, or are
special cases in our code.
- Miscellaneous other minor changes, such as rearranged redirections in
long sed/perl pipes to look more natural.
- Whitespace tweaks.
commit 60f36347c16e6336215cd52b4e5f3c0f96e7c253
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 22 20:37:30 2023 -0600
Fixed bugs in scal2v ref kernel when alpha == 1. (728)
Details:
- Fixed a typo bug in ref_kernels/1/bli_scal2v_ref.c where the
conditional that was supposed to be checking for cases when alpha is
equal to 1.0 (so that copyv could be used instead of scal2v) was
instead erroneously comparing alpha against 0.0.
- Fixed another bug in the same function whereby BLIS_NO_CONJUGATE was
erroneously being passed into copyv instead of the kernel's conjx
parameter. This second bug was inert, however, due to the first bug
since the "alpha == 0.0" case was already being handled, resulting in
the code block never executing.
commit fab18dca46618799bb0b4f652820b33d36a5d4d4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 22 16:50:00 2023 -0600
Use 'void*' datatypes in kernel APIs. (727)
Details:
- Migrated all kernel APIs to use void* pointers instead of float*,
double*, scomplex*, and dcomplex* pointers. This allows us to define
many fewer kernel function pointer types, which also makes it much
easier to know which function pointer type to use at any given time.
(For example, whereas before there was ?axpyv_ker_ft, ?axpyv_ker_vft,
and axpyv_ker_vft, now there is just axpyv_ker_ft, which is equivalent
so what axpyv_ker_vft used to be.)
- Refactored how kernel function prototypes and kernel function types
are defined so as to reduce redundant code. Specifically, the
function signatures (excluding cntx_t* and, in the case of level-3
microkernels, auxinfo_t*) are defined in new headers named, for
example, bli_l1v_ker_params.h. Those signatures are reused via macro
instantiation when defining both kernel prototypes and kernel function
types. This will hopefully make it a little easier to update, add, and
manage kernel APIs going forward.
- Updated all reference kernels according to the aforementioned switch
to void* pointers.
- Updated all optimzied kernels according to the aforementioned switch
to void* pointers. This sometimes required renaming variables,
inserting typecasting so that pointer arithmetic could continue to
function as intended, and related tweaks.
- Updated sandbox/gemmlike according to the aforementioned switch to
void* pointers.
- Renamed:
- frame/1/bli_l1v_ft_ker.h -> frame/1/bli_l1v_ker_ft.h
- frame/1f/bli_l1f_ft_ker.h -> frame/1f/bli_l1f_ker_ft.h
- frame/1m/bli_l1m_ft_ker.h -> frame/1m/bli_l1m_ker_ft.h
- frame/3/bli_l1m_ft_ukr.h -> frame/3/bli_l1m_ukr_ft.h
- frame/3/bli_l3_sup_ft_ker.h -> frame/3/bli_l3_sup_ker_ft.h
to better align with naming of neighboring files.
- Added the missing "void* params" argument to bli_?packm_struc_cxk() in
frame/1m/packm/bli_packm_struc_cxk.c. This argument is being passed
into the function from bli_packm_blk_var1(), but wasn't being "caught"
by the function definition itself. The function prototype for
bli_?packm_struc_cxk() also needed updating.
- Reordered the last two parameters in bli_?packm_struc_cxk().
(Previously, the "void* params" was passed in after the
"const cntx_t* cntx", although because of the above bug the params
argument wasn't actually present in the function definition.)
commit 93c63d1f469c4650df082d0fa2f29c46db0e25f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 20 11:14:23 2023 -0600
Use 'const' pointers in kernel APIs. (722)
Details:
- Qualified all input-only data pointers in the various kernel APIs with
the 'const' keyword while also removing 'restrict' from those kernel
APIs. (Use of 'restrict' was maintained in kernel implementations,
where appropriate.) This affected the function pointer types defined
for all of the kernels, their prototypes, and the reference and
optimized kernel definitions' signatures.
- Templatized the definitions of copys_mxn and xpbys_mxn static inline
functions.
- Minor whitespace and style changes (e.g. combining local variable
declaration and initialization into a single statement).
- Removed some unused kernel code left in 'old' directories.
- Thanks to Nisanth M P for helping to validate changes to the power10
microkernels.
commit 4e18cd34f909c5045597f411340ede3a5e0bc5e1
Author: RuQing Xu <ruqing.xuphys.s.u-tokyo.ac.jp>
Date: Sun Feb 19 04:18:41 2023 +0900
Restored ArmSVE general storage case. (708)
Details:
- Restored general storage case in armsve kernels.
- Reason for doing this: Though real `g`-storage is difficult to
speedup, `g`-codepath here can provide a good support for
transposed-storage. i.e. at least good for `GEMM_UKR_SETUP_CT_AMBI`.
- By experience, this solution is only *a little* slower than in-reg
transpose. Plus in-reg transpose is only possible for a fixed VL in
our case.
commit 0ba6e9eafb1e667373d9dbc2aa045557921f33e2
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Sat Feb 18 13:15:42 2023 -0600
Refined emacs handling of indentation. (717)
Details:
- This refines the emacs autoformatting to be better in line with
contribution guidelines.
- Removed a stray shebang in a .mk file which confuses emacs about the
file mode, which should be makefile-mode. (emacs also removes stray
whitespace at the ends of lines.)
commit 059f15105b1643fe56084f883c22b3cadf368b39
Author: ct-clmsn <ct.clmsngmail.com>
Date: Sat Feb 18 14:13:23 2023 -0500
Updated hpx namespace for make_count_shape. (725)
Details:
- The hpx namespace for *counting_shape changed. This PR updates the use
of counting_shape in blis to comply with the change in hpx.
- Co-authored-by: ctaylor <ctaylortactcomplabs.com>
commit 0b421eff130b5c896edcc09e7358d18564d177e9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Feb 18 13:11:41 2023 -0600
Added an 'arm64' entry to `.travis.yml`. (726)
Details:
- Added a new 'arm64' entry to the .travis.yml file in an attempt to get
Travis CI to compile both NEON and SVE kernels, even if only NEON
kernels are exercised in the testing. With this new 'arm64' entry, the
'cortexa57' entry becomes redundant and may be removed. Thanks to
RuQing Xu for this suggestion.
- Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in
bli_kernels_arm64.h, which meant that the default value of 64 was
being used. This caused a runtime consistency check to fail in
bli_gks.c (in Travis CI), one which requires that
mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE
for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is
defined as
BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2
This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64'
configuration, thus overriding the default and (hopefully) avoiding
the aforementioned consistency check failures.
- Appended '|| cat ./output.testsuite' to all 'make' commands in
travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion.
- Whitespace changes.
commit b1d3fc7e5b0927086e336a23f16ea59aa3611ccb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 10 15:34:47 2023 -0600
Redirect grep stderr to /dev/null. (723)
Details:
- In common.mk, added a redirection of stderr to /dev/null for the grep
command being used to gather a list of header files included from
bli_cntx_ref.c. The redirection is desirable because as of grep 3.8,
regular expressions with "stray" backslashes trigger warnings [1].
But removing the backslash seems to break the BLIS build system when
using pre-3.8 versions of grep, so this seems to be easiest way to
satisfy the BLIS build system for both pre- and post-3.8 grep
environments.
[1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html
commit e3d352f1fcc93e6a46fde1aa4a7f0a18fb27bd42
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Feb 8 06:11:41 2023 +0530
Added runtime selection of 'power' config family. (718)
Details:
- Created a 'power' umbrella configuration family, which, when targeted
at configure-time, will build both 'power9' and 'power10' subconfigs.
(With this feature, a BLIS shared library could be compiled on a
power9 system and run on power10 and vice-versa. Unoptimised code
will execute if it is linked and run on any other generic system.)
- This new configuration family will only work with gcc, since that is
the only compiler supported by both power9 and power10 subconfigs in
BLIS.
- Documented power9 and power10 as supported microarchitectures in the
docs/HardwareSupport.md document.
commit e730c685d09336b3bd09e86c94330c4eba967f3e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 6 15:31:54 2023 -0600
Define `BLIS_VERSION_STRING` in `blis.h`. (720)
Details:
- Previously, the version string was communicated from configure to
config.mk (via the config.mk.in template), where it was included via
the top-level Makefile, where it was then used to define the
preprocessor macro BLIS_VERSION_STRING via a command line argument to
the compiler (via -D). This macro is then used within bli_info.c to
initialize a static string which can then be queried via the
bli_info_get_version_str() function. However, there are some
applications that may find utility in being able to access the version
string by inspecting the monolithic (flattened) blis.h header file
that is created at compile time and installed alongside the library.
This commit moves the definition of BLIS_VERSION_STRING into
bli_config.h (via the bli_config.h.in template) so that it is
embedded in blis.h. The version string is now available in three
places:
- the static/shared library, which is installed in the 'lib'
subdirectory of the install prefix (query-able via the
bli_info_get_version_str() function);
- the config.mk makefile fragment, which is installed in the 'share'
subdirectory of the install prefix (in the VERSION variable);
- the blis.h header file, which is installed in the 'include'
subdirectory of the install prefix (via the BLIS_VERSION_STRING
macro constant).
Thanks to Mohsen Aznaveh and Tim Davis for providing the idea for this
change.
- CREDITS file update.
commit dc5d00a6ce0350cd82859d8c24f23d98f205d8db
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Fri Jan 27 17:36:47 2023 -0600
Typecast printf() args to avoid compiler warnings. (716)
Details:
- In bli_thread_range_tlb.c, typecast integer arguments passed to
printf() -- which are typically disabled unless debugging -- to type
"long" to guarantee a match to the "%ld" format specifiers used in
those calls. This avoids spurious warnings with certain compilers in
certain toolchain environments, such as 32-bit RISC-V (rv32iv).
commit ecbcf4008815035c695822fcaf106477debff89a
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Wed Jan 18 20:35:50 2023 -0600
Use here-document for 'configure --help' output. (714)
Details:
- Changed the configure script function that outputs "--help" text to do
so via so-called "here-document" syntax for improved readability and
maintainability. The change eliminates hundreds of echo statements and
makes it easier to change existing configure options' help text, along
with other benefits such as eliminating the need to escape double-
quote characters (").
commit c334ec278f5e2a101625629b2e13bbf1b38dede5
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jan 18 13:10:19 2023 -0600
Merge tlb- and slab/rr-specific gemm macrokernels. (711)
Details:
- Merged the tlb-specific gemm macrokernel (_var2b) with the slab/rr-
specific one (var2) so that a single function can be compiled with
either tlb or slab/rr support, depending on the value of the
BLIS_ENABLE_JRIR_TLB, _SLAB, and _RR. This is done by incorporating
information from both approaches: the start/end/inc for the JR and IR
loops from slab or rr partitioning; and the number of assigned
microtiles, plus the starting IR dimension offset for all iterations
after the first (ir_next). With these changes, slab, rr, and tlb can
all be parameterized by initializing a similar set of variables prior
to the jr loop.
- Removed the wrap-around logic that sets the "b_next" field of the
auxinfo_t struct, which executes during the last IR iteration of the
last JR iteration. The potential benefit of this code is so minor
(and hinges on the microkernel making use of the b_next field) that
it's arguably not worth including. The code also does the wrong
thing for some threads whenever JR_NT > 1, since only thread 0 (in the
JR group) would even compute with the first micropanel of B.
- Re-expressed the definition of bli_is_last_iter_slrr so that slab and
tlb use the same code rather than rr and tlb.
- Adjusted the initialization of the gemm control tree accordingly.
commit 5793a77937aee9847a5692c8e44b36a6380800a1
Author: HarshDave12 <122850830+HarshDave12users.noreply.github.com>
Date: Tue Jan 17 21:55:02 2023 +0530
Fixed mis-mapped instruction for VEXTRACTF64X2. (713)
Details:
- This commit fixes a typo in the macro definition for the extended
inline assembly macro VEXTRACTF64X2 in bli_x86_asm_macros.h. The macro
was previously defined (incorrectly) in terms of the vextractf64x4
instruction rather than vextractf64x2.
- CREDITS file update.
commit 16d2e9ea9ca0853197b416eba701b840a8587bca
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 13 20:03:01 2023 -0600
Defined lt, lte, gt, gte + misc. other updates. (712)
Details:
- Changed invertsc operation to be a non-destructive operation; that is,
it now takes separate input and output operands. This change applies
to both the object and typed APIs.
- Defined an alternative square root operation, sqrtrsc, which, when
operating on complex scalars, assumes the imaginary part of the input
to be zero.
- Changed the semantics of addm, subm, copym, axpym, scal2m, and xpbym
so that when the source matrix has an implicit unit diagonal, the
operation leaves the diagonal of the destination matrix untouched.
Previously, the operations would interpret an implicit unit diagonal
on the source matrix as a request to manifest the unit diagonal
*explicitly* on output (either as something to copy in the case of
copym, or something to compute with in the cases of addm, subm, axpym,
scal2m, and xpbym). It turns out that this behavior was too cute by
half and could cause unintended headaches for practical use cases.
(This change in behavior also required small modifications to the trmv
and trsv testsuite modules so that they would properly test matrices
with unit diagonals.)
- Added missing dependencies for copym to gemv, ger, hemv, trmv, and
trsv testsuite modules.
- Implemented level-0-like ltsc, ltesc, gtsc, gtesc operations in
frame/util, which use lt, lte, gt, and gte level-0 scalar macros.
- Trivial variable rename in bli_part.c to harmonize with other
variable naming conventions.
commit 9a366b14fe52c469f4664ef5dd93d85be8d97baa
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 12 13:07:22 2023 -0600
Implement cntx_t pointer caching in gks. (709)
Details:
- Refactored the gks cntx_t query functions so that: (1) there is a
clearer pattern of similarity between functions that query a native
context and those that query its induced (1m) counterpart; and (2)
queried cntx_t pointers (for both native and induced cntx_t pointers)
are cached (by default), or deep-queried upon each invocation,
depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined.
- Refactored query-related functions in bli_arch.c to cache the queried
arch_t value (by default), or deep-query the arch_t value upon each
invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is
defined.
- Tweaked the behavior of bli_gks_query_ind_cntx_impl() (formerly named
bli_gks_query_ind_cntx()) so that the induced method cntx_t struct is
repopulated each time the function is called. (It is still only
allocated once on first call.) This was mostly done in preparation for
some future in which the arch_t value might change at runtime. In such
a scenario, the induced method context would need to be recalculated
any time the native context changes.
- Added preprocessor logic to bli_config_macro_defs.h to handle enabling
or disabling of cntx_t pointer caching (via BLIS_ENABLE_GKS_CACHING).
- For now, cntx_t pointer caching is enabled by default and does not
correspond to any official configure option. Disabling can be done
by inserting a define for BLIS_DISABLE_GKS_CACHING into the
appropriate bli_family_*.h header file within the configuration of
interest.
- Thanks to Harihara Sudhan S (AMD) for suggesting that cntxt_t pointers
(and not just arch_t values) be cached.
- Comment updates.
commit b895ec9f1f66fb93972589c06bff171337153a31
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Jan 11 09:02:32 2023 +0530
Fixing type-mismatch errors in power10 sandbox (701)
Details:
- This commit fixes a mismatch between the function type signature of
bli_gemm_ex() required by BLIS and the version of the function defined
within the power10 sandbox. It also performs typecasting upon calling
bli_gemm_front() to attain type consistency with the type signature
defined by BLIS for bli_gemm_front().
commit 38d88d5c131253066cad4f98eea06fa9299cae3b
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jan 10 21:24:58 2023 -0600
Define new global scalar (obj_t) constants. (703)
Details:
- This commit defines the following new global scalar constants:
- BLIS_ONE_I: This constant encodes the imaginary unit.
- BLIS_MINUS_ONE_I: This constant encodes the negative imaginary unit.
- BLIS_NAN: This constant encodes a not-a-number value. Both real and
imaginary parts are set to NaN for complex datatypes.
commit cdb22b8ffa5b31a0c16ac1a7bcecefeb5216f669
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Jan 11 08:50:57 2023 +0530
Disable power10 kernels other than sgemm, dgemm. (705)
Details:
- There is a power10 sandbox which uses microkernels for datatypes other
than float and double (or scomplex/dcomplex). In a regular power10-
configured build (that is, with the sandbox disabled), there were
compile errors for some of these other non-sgemm/non-dgemm
microkernels. This commit protects those kernels with a new cpp macro
guard (which is defined in sandbox/power10/bli_sandbox.h) that
prevents that kernel code from being compiled for normal, non-sandbox
power10 builds.
commit d220f9c436c0dae409974724d42ab6c52f12a726
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Jan 11 08:43:03 2023 +0530
Fix k = 0 edge case in power10 microkernels (706)
Details:
- When power10 sgemm and dgemm microkernels are called with k = 0, they
become caught in infinite loops and segfault. This is fixed now via an
early exit in the case of k = 0.
commit 2e1ba9d13c23a06a7b6f8bd326af428f7ea68c31
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 10 21:05:54 2023 -0600
Tile-level partitioning in jr/ir loops (ex-trsm). (695)
Details:
- Reimplemented parallelization of the JR loop in gemmt (which is
recycled for herk, her2k, syrk, and syr2k). Previously, the
rectangular region of the current MC x NC panel of C would be
parallelized separately from from the diagonal region of that same
submatrix, with the rectangular portion being assigned to threads via
slab or round-robin (rr) partitioning (as determined at configure-
time) and the diagonal region being assigned via round-robin. This
approach did not work well when extracting lots of parallelism from
the JR loop and was often suboptimal even for smaller degrees of
parallelism. This commit implements tile-level load balancing (tlb) in
which the IR loop is effectively subjugated in service of more
equitably dividing work in the JR loop. This approach is especially
potent for certain situations where the diagonal region of the MC x NR
panel of C are significant relative to the entire region. However, it
also seems to benefit many problem sizes of other level-3 operations
(excluding trsm, which has an inherent algorithmic dependency in the
IR loop that prevents the application of tlb). For now, tlb is
implemented as _var2b.c macrokernels for gemm (which forms the basis
for gemm, hemm, and symm), gemmt (which forms the basis of herk,
her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and
trmm3). Which function pointers (_var2() or _var2b()) are embedded in
the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp
macro is defined, which is controlled by the value passed to the
existing --thread-part-jrir=METHOD (or -r METHOD) configure option.
This script adds 'tlb' as a valid option alongside the previously
supported values of 'slab' and 'rr'. ('slab' is still the default.)
Thanks to Leick Robinson for abstractly inspiring this work, and to
Minh Quan Ho for inquiring (in PR 562, and before that in Issue 437)
about the possibility of improved load balance in macrokernel loops,
and even prototyping what it might look like, long before I fully
understood the problem.
- In bli_thread_range_weighted_sub(), tweaked the the way we compute the
area of the current MC x NC trapezoidal panel of C by better taking
into account the microtile structure along the diagonal. Previously,
it was an underestimate, as it assumed MR = NR = 1 (that is, it
assumed that the microtile column of C that overlapped with microtiles
exactly coincided with the diagonal). Now, we only assume MR = NR.
This is still a slight underestimate when MR != NR, so the additional
area is scaled by 1.5 in a hackish attempt to compensate for this, as
well as other additional effects that are difficult to model (such as
the increased cost of writing to temporary tiles before finally
updating C). The net effect of this better estimation of the
trapezoidal area should be (on average) slightly larger regions
assigned to threads that have little or no overlap with the diagonal
region (and correspondingly slightly smaller regions in the diagonal
region), which we expect will lead to slightly better load balancing
in most situations.
- Spun off the contents of bli_thread.[ch] that relate to computing
thread ranges into one of three source/header file pairs:
- bli_thread_range.[ch], which define functions that are not specific
to the jr/ir loops;
- bli_thread_range_slab_rr.[ch], which define functions that implement
slab or round-robin partitioning for the jr/ir loops;
- bli_thread_range_tlb.[ch], which define functions that implement
tlb for the jr/ir loops.
- Fixed the computation of a_next in the last iteration of the IR loop
in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around
to the first micropanel of the current MC x KC packed block of A.
However, this is almost never actually the micropanel that is used
next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next
correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use
in the upper-stored case (which *does* actually always choose the
first micropanel of A as its a_next at the end of the IR loop).
- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-
intersecting case of gemmt_l_ker_var2() and the above-diagonal case
of gemmt_u_ker_var2() since these cases will only coincide with the
last iteration of the IR loop in very small problems.
- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of
which explicitly considers whether the current microtile is the last
tile that intersects the diagonal. (The former does the same, but the
computation coincides with the original bli_is_last_iter().) These
functions are now used in gemmt to test when a_next (or a2) should
"wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()
and bli_is_last_iter_tlb_u(), which are similar to the aforementioned
functions but are used when employing tlb in gemmt.
- Redefined macros in bli_packm_thrinfo.h, which test whether an
iteration of work is assigned to a thread, as static inline functions
in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).
In the process of redefining these macros, I also renamed them from
bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().
- Renamed
bli_thread_range_jrir_rr() -> bli_thread_range_rr()
bli_thread_range_jrir_sl() -> bli_thread_range_sl()
bli_thread_range_jrir() -> bli_thread_range_slrr()
- Renamed
bli_is_last_iter() -> bli_is_last_iter_slrr()
- Defined
bli_info_get_thread_jrir_tlb()
and renamed:
- bli_info_get_thread_part_jrir_slab() ->
bli_info_get_thread_jrir_slab()
- bli_info_get_thread_part_jrir_rr() ->
bli_info_get_thread_jrir_rr()
- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism
into the JR loop when tlb is enabled for non-trsm level-3 operations.
- Added a sanity check to prevent bli_prune_unref_mparts() from being
used on packed objects. This prohibition is necessary because the
current implementation does not take into account the atomicity of
packed micropanel widths relative to the diagonal of structured
matrices. That is, the function prunes greedily without regard to
whether doing so would prune off part of a micropanel *which has
already been packed* and assigned to a thread for inclusion in the
computation.
- Further restricted early returns in bli_prune_unref_mparts() to
situations where the primary matrix is not only of general structure
but also dense (in terms of its uplo_t value). The addition of the
matrix's dense-ness to the conditional is required because gemmt is
somewhat unusual in that its C matrix has general structure but is
marked as lower- or upper-stored via its uplo_t. By only checking
for general structure, attempts to prune gemmt C matrices would
incorrectly result in early returns, even though that operation
effectively treats the matrix as symmetric (and stored in only one
triangle).
- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges
were computed when 1 < bf. Thankfully, this bug was not yet
manifesting since all current invocations used bf == 1.
- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()
that would perform incorrect pruning of unreferenced regions above
where the diagonal of a lower-stored matrix intersects the right edge.
Thankfully, the bug was not harming anything since those unreferenced
regions were being pruned prior to the macrokernel.
- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved
C into rectangular and diagonal regions prior to parallelizing each
separately. The new macrokernels use a unified loop structure where
quadratic (slab) partitioning is used.
- Updated all level-3 macrokernels to have a more uniform coding style,
such as wrt combining variable declarations with initializations as
well as the use of const.
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
bli_thrinfo_thread_id(), respectively. This change probably should
have been included in aeb5f0c.
- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that
corresponded to functions that were removed in aeb5f0c.
- Other very minor cleanups.
- Comment updates.
commit b6735ca26b9d459d9253795dc5841ae8de9e84c9
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jan 6 14:10:01 2023 -0600
Refactor structure awareness in packm_blk_var1.c. (707)
Details:
- Factored some of the structure awareness out of the loop in
bli_packm_blk_var1(). So instead of having a single loop with
conditionals in the body to handle various kinds of structure (and
stored/unstored submatrix placement), we now have a conditional branch
to handle various structure/storage scenarios with a loop in each
section. This change was originally motivated to choose slab or round-
robin partitioning (in the context of triangular matrices) based on
the structure of the entire block (or panel) being packed rather than
each micropanel individually. Previously, the code would attempt to
limit rr to the portion of the block that intersects the diagonal and
use slab for the remainder. However, that approach was not well-thought
out and in many situations this would lead to inferior load balancing
when compared to using round-robin for the entire block (or panel).
This commit has the added benefit of incurring less overhead during
the packing process now that each of the new loops is simpler.
commit f956b79922da412791e4c8b8b846b3aafc0a5ee0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Dec 31 20:18:08 2022 -0600
Switch to l3 sup decorator in gemmlike sandbox. (704)
Details:
- Modified the gemmlike sandbox to call bli_l3_sup_thread_decorator()
rather than a local analogue of that code. This reduces redundant
logic and makes it easier for the sandbox to inherit future
improvements to the framework's threading code.
- Moved addon/gemmd to addon/old/gemmd. This code has fallen out of date
and is taking too much effort to maintain. We will very likely
reimplement it completely once future changes are made to the
framework proper.
commit 538150c5845ad903773ca797c740048174116aa4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Dec 25 22:28:09 2022 -0600
Applied race condition fix to sup thread decorator.
Details:
- Applied the race condition bugfix in commit 7d23dc2 to the
corresponding sup code in bli_l3_sup_decor.c. Note that in the case
of sup, the race condition would have only manifested when optional
packing was enabled at runtime (typically via setting BLIS_PACK_A
and/or BLIS_PACK_B environment variables).
- Both the fix in this commit and the fix in 7d23dc2 address bugs
that were introduced when the thrinfo_t trees/communicators were
restructured in the October omnibus commit (aeb5f0c).
commit 7d23dc2a064a371dc9883e2c2c7236a70912428c
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Dec 25 19:09:14 2022 -0600
Fix a race condition which manifested as incorrect results (rarely). (702)
The problem occurs when there are at least two teams of threads packing different parts of a matrix, and where each team has at least two threads; call them team A and team B. The problematic sequence is:
1. The chief of team A checks out a block B and broadcasts the pointer to its teammates.
2. Team A completely packs their data and perform a barrier amongst themselves.
3. Team A commences computing with the packed data.
4. The chief of team A finishes computing before its teammates, then calls bli_thrinfo_free on its thrinfo_t struct (which contains the mem_t object referencing the buffer B). This causes buffer B to be checked back in to the pba.
5. The chief of team B checks out the *same* block B that was just checked back in and broadcasts the pointer to its teammates.
6. DATA RACE: now the remaining threads of team A are reading *while* team B are writing to the same buffer B. If team A write new data before team B are done computing then an incorrect result is generated.
The solution is to place a global barrier before the call to bli_thrinfo_free at the end of the computation.
Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>
commit 3accacf57d11e9b109339754f91bf22329b6cb6a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 16 10:26:33 2022 -0600
Skip 1m optimization when forcing hemm_l/symm_l. (697)
Details:
- Fixed a bug in right-sided hemm when:
- using the 1m method,
- defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration,
and
- the storage of C matches the gemm microkernel IO preference PRIOR to
the right-sidedness being detected and recast in terms of the left-
side code path.
It turns out that bli_gemm_ind_recast_1m_params() was applying its
optimization (recasting a complex-domain macrokernel calling a 1m
virtual microkernel to a real-domain macrokernel calling the real-
domain microkernel) in situations in which it should not have. The
optimization was silently assuming that the storage of C always
matched that of the microkernel preference, since the front-end (in
this case, bli_hemm_front()) would have already had a chance to
transpose the operation to bring the two into agreement. However, by
disabling right-sided hemm, we deprive BLIS of that flexibility (as a
transposed left-sided case would necessarily have to become a right-
sided case), and thus the assumption was no longer holding in all
cases. Thanks to Nisanth M P for reporting this bug in Issue 621.
- The aforementioned bug, and its bugfix, also apply to symm when
BLIS_DISABLE_SYMM_RIGHT is defined.
- Comment updates.
- CREDITS file update.
commit 4833ba224eba54df3f349bcb7e188bcc53442449
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 12 20:26:02 2022 -0600
Fixed perf of mt sup with packing, and mt gemmlike. (696)
Details:
- Brought the gemmsup code path up to date relative to the latest
thrinfo_t semantics introduced in the October Omnibus commit
(aeb5f0c). This was done by passing the prenode (instead of the
current node) into the packm variant within bli_l3_sup_packm.c as well
as creating the prenodes and attaching them to the thrinfo_t tree in
bli_l3_sup_thrinfo_create(). These changes erase the performance
degradation introduced in the omnibus when running multithreaded sup
with optional packing enabled. Special thanks to Devin Matthews for
sussing out this fix in short order.
- Fixed the gemmlike sandbox in a manner similar to that of sup with
packing, described above. This also involved passing the prenode into
the local gemmlike packm variant. (Recall that gemmlike recycles the
use of bli_l3_sup_thrinfo_create(), so it automatically inherits that
part of the sup fix described above.)
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
bli_thrinfo_thread_id(), respectively.
commit db10dd8e11a12d85017f84455558a82c0093b1da
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 29 19:10:31 2022 -0600
Fixed _gemm_small() prototype; disabled gemm_small.
Details:
- Fixed a mismatch between the prototype for bli_gemm_small() in
bli_gemm_front.h and the actual definition of bli_gemm_small() in
kernels/zen/3/bli_gemm_small.c. The former was erroneously declaring
the cntl_t* argument as 'const'. Thanks to Jeff Diamond for reporting
this issue.
- Commented out BLIS_ENABLE_SMALL_MATRIX, BLIS_ENABLE_SMALL_MATRIX_TRSM
macro definitions in config/zen3/bli_family_zen3.h. AMD's small matrix
implementation should probably remain disabled in vanilla BLIS, at
least for now.
commit f0337b784d164ae505ca0e11277a1155680500d1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Nov 13 21:36:47 2022 -0600
Trival whitespace/comment tweaks.
Details:
- Trivial whitespace and comment changes, most of which ideally would
have been part of the previous commit pertaining to HPX (2b05948).
commit 2b05948ad2c9785bc53f376d53a7141cbc917447
Author: ct-clmsn <ct.clmsngmail.com>
Date: Sun Nov 13 17:40:22 2022 -0500
blis support for hpx (682)
Implement threading backend via HPX.
HPX is an asynchronous many task runtime system used in high performance computing applications. The runtime implements the ISO C++ parallelism specification and provides a user-space thread implementation.
This PR provides BLIS a thread backend implementation using HPX and resolves feature request 681. The configuration script, makefiles, and testsuite have been updated to support an HPX build option. The addition of HPX support provides other developers an exemplar for integrating other C++ threading backends into BLIS.
Co-authored-by: ctaylor <ctaylorpennywise.cm.cluster>
Co-authored-by: Devin Matthews <damatthewssmu.edu>
commit e1ea25da43508925e33d4e57e420cfc0a9de793f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 11 12:07:51 2022 -0600
Fixed subtle barrier_fpa bug in bli_thrcomm.c. (690)
Details:
- In bli_thrcommo.c, correctly initialize the BLIS_OPENMP element of the
barrier function pointer array (barrier_fpa) to NULL when
BLIS_ENABLE_OPENMP is *not* defined. Similarly, initialize the
BLIS_POSIX element of barrier_fpa to NULL when BLIS_ENABLE_PTHREADS is
not enabled. This bug was introduced in a1a5a9b and was likely the
result of an incomplete edit. The effects of the bug would have
likely manifested when querying a thrcomm_t that was initialized with
a timpl_t value corresponding to a threading implementation that was
omitted from the -t option at configure-time.
commit dc6e5f3f5770074ba38554541b8b64711a68c084
Author: leekillough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Nov 3 18:33:08 2022 -0500
Enhance emacs formatting of C files to remove trailing whitespace and ensure a newline at the end of file
commit 713d078075a4a563a43d83fd0880ab5091c2e4a4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 3 20:00:11 2022 -0500
Delete mpi_test garbage. (689)
Details:
- tlrmchlsmth: "What even is this? No comments, no commit message, not
used by anything. Trash."
commit 8d813f7f12732d52c95570ae884d5defbfd19234
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 3 19:10:47 2022 -0500
Some decluttering of the top-level directory.
Details:
- Relocated 'mpi_test' directory to test/mpi_test.
- Relocated 'so_version' and 'version' files from top-level directory to
'build' directory.
- Updated build/bump-version.sh script to accommodate relocation of
'version' file to 'build' directory.
- Updated configure script to accommodate relocation of 'so_version'
file to 'build' directory.
- Updated INSTALL file to replace pointers to blis-devel mailing list
with a pointer to docs/Discord.md.
- Updated RELEASING file to contain a reminder to consider whether the
so_version file should be updated prior to the release.
commit 6774bf08c92fc6983706a91bbb93b960e8eef285
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Nov 3 15:20:47 2022 -0500
Fix typo in configure --help text. (686)
Details:
- Fixed a misspelling in the --help description for the --int-size (-i)
configure option.
commit 872898d817f35702e7678ff7f3eeff0f12e641f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 21:53:22 2022 -0500
Fixed trmm[3]/trsm performance bug in cf7d616. (685)
Details:
- Fixed a performance bug in the packing of micropanels that intersect
the diagonal of triangular matrices (i.e., those found in trmm, trmm3,
and trsm). This bug was introduced in cf7d616 and stemmed from an
ill-formed boolean conditional expression in bli_packm_blk_var1().
This conditional would chose when to use round-robin parallel work
allocation, but checked for the triangularity of the submatrix being
packed while failing also to check for whether the current micropanel
actually intersected the diagonal. The net result of this bug was that
*all* micropanels of a triangular matrix, no matter where the upanels
resided within the matrix, were assigned to threads via a round-robin
policy. This affected some microarchitectures and threading
configurations much worse than others, but it seems that overall the
effect was universally negative, likely because of the reduced spatial
locality during the packing with round-robin. Thanks to Leick Robinson
for his tireless efforts in helping track down this issue.
commit edcc2f9940449f7d9cefcfc02159d27b013e7995
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 19:04:49 2022 -0500
Support --nosup, --sup configure options. (684)
Details:
- Added --nosup and --sup as alternative ways of requesting that sup be
disabled or enabled. These are analagous to --disable-sup-handling and
--enable-sup-handling, respectively. (I got tired of typing out
--disable-sup-handling and needed a shorthand notation.)
- Tweaked message output by configure when sup is enable/disabled for
clarity and specificity.
- Whitespace changes.
commit 5eea6ad9eb25f37685d1ae4ae08c73cd1daca297
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 17:07:54 2022 -0500
Add mention of Wilkinson Prize to README.md. (683)
Details:
- Added blurbs and links to Wilkinson Prize to README.md.
- Added mention of both Best Paper and Wilkinson Prizes to the top of
README.md.
- Other minor tweaks.
commit 29f79f030e939969d4f3876c4fdaac7b0c5daa63
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 31 18:57:45 2022 -0500
Fixed performance bug caused by redundant packing. (680)
Details:
- Fixed a performance bug whereby multiple threads were redundantly
packing the same (rather than separate) micropanels. This bug was
caused by different parts of the code using the num_threads/thread_id
field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
standardize on the latter and provide a "fake" thrinfo_t sub-prenode
in the thrinfo tree which consists of single-member thread teams. The
single team with multiple threads node is still required since it and
only it can be used to perform barriers and broadcasts (e.g. of the
packed buffer pointer).
commit aeb5f0cc19665456e990a7ffccdb09da2e3f504b
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 27 12:39:11 2022 -0500
Omnibus PR - Oct 2023 (678)
Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
commits that affect non-trivial aspects of BLIS. The major highlights:
- Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
cntl_t) to the thrinfo_t object. This allows the rntm_t to be
effectively const (although it is sometimes copied internally and
modified to reflect different ways of parallelism). Moving the mem_t
sets the stage for sharing a global control tree amongst all
threads.
- De-templatized the macrokernels for gemmt, trmm, and trsm to match
the macrokernel for gemm, which has been de-templatized since
54fa28b.
- Reimplemented bli_l3_determine_kc() by separating out the logic for
adjusting KC based on MR/NR for triangular A and/or B into a new
function, bli_l3_adjust_kc(). For now, this function is still called
from bli_l3_determine_kc(), but in the future we plan to have it
called once when constructing the control tree.
- Refactored the level-3 thread decorator into two parts:
- One part deals only with launching threads, each one calling a
generic thread entry function. This code resides in frame/thread
and constitutes the definition of bli_thread_launch(). Note that
it is specific to the threading implementation (OpenMP, pthreads,
single, etc.)
- The other part deals with passing the matrix operands and related
information into bli_thread_launch(). This is the "l3 decorator"
and now resides in frame/3. It is agnostic to the threading
implementation.
- Modified the "level" of the thread control tree passed in at each
operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
passed in a communicator representing the active thread teams which
would share the available work. Now, the *parent* thread comm is
passed in. The operation then grabs the child comm and uses it to
partition the work. The difference is in bli_trsm_blk_var1(), where
there are now two children nodes for this single operation (i.e. the
thread control tree is split one level above where the control tree
is). The sub-prenode is used for the trsm subproblem while the
normal sub-node is used for the gemm part. Importantly, the parent
comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
added back in the future when the control tree's creation is moved so
that it happens much sooner (provided that bli_*_front() have not been
absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
for consistency. This includes _num_threads(), _thread_id(), _n_way(),
_work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
_am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
misspelled.
commit c803b03e52a7a6997a8d304a8cfa9acf7c1c555b
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Oct 26 18:20:00 2022 -0500
Add check to disable armsve on Apple M1.
commit 2dd692b710b6a9889f7ebdd7934a2108be5c5530
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Oct 26 18:10:26 2022 -0500
Fix auto-detection of firestorm (Apple M1).
commit 88105dbecf0f9dfbfa30215743346e8bd6afb971
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 21 15:16:12 2022 -0500
Added Discord documentation (677)
Details:
- Added a docs/Discord.md markdown document that walks the reader
through creating a Discord account, obtaining the invite link, and
using the link to join the BLIS Discord server.
- Updated README.md to reference the new Discord.md document in multiple
places, including via the official Discord logo (used with explicit
permission from representatives at Discord Inc.).
commit 23f5b8df3e802a27bacd92571184ec57bbdfa646
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 17 20:21:21 2022 -0500
Shuffled checked properties in bli_l3_check.c. (676)
Details:
- Added certain checks for matrix structure to the level-3 operations'
_check() functions, and slightly reorganized existing checks.
commit 9453e0f163503f64a290256b4be53d8882224863
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 3 19:46:20 2022 -0500
CREDITS file update.
Details:
- This attribution was intended to go in PR 647.
commit 76a23bd8c33e161221891935a489df9a9fb9c8c0
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 3 15:55:07 2022 -0500
Reinstate sanity check in bli_pool_finalize. (671)
Details:
- Added a reinit argument to bli_pool_finalize(). This bool will signal
whether or not the function is being called from bli_pool_reinit(). If
it is not being called from _reinit(), we can safely check to confirm
that .top_index == 0 (i.e., all blocks have been checked in). But if
it *is* being called from _reinit(), then that check will be skipped
since one of the predicted use cases for bli_pool_reinit() anticipates
that some blocks are (probably) checked out when the pool_t is
reinitialized.
- Updated existing invocations of bli_pool_finalize() to pass in either
FALSE (from bli_apool_free_block() or bli_pba_finalize_pools()) or
TRUE (from bli_pool_reinit()) for the new reinit argument.
commit 63470b49e3b9b15e00a8f666e86ccd70c6005fe9
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 29 18:52:08 2022 -0500
Fix some bugs in bli_pool.c (670)
Details:
- Add a check for premature pool exhaustion when checking in blocks via
bli_pool_checkin_block(). This detects "double-free" and other bad
conditions that don't necessarily result in a segfault.
- Make sure to copy all block pointers when growing the pool size.
Previously, checked-out block pointers (which are guaranteed to be set
to NULL) were not being copied, leading to the presence of
uninitialized data.
commit 42d0e66318b186d25eeb215b40ce26115401ed8b
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 29 17:38:02 2022 -0500
Add AddressSanitizer (-fsanitize=address) option. (669)
Details:
- Added support for AddressSanitizer (ASan), a compiler-integrated
memory error detector. The option (disabled by default) enables
compiling and linking with the -fsanitize=address flag supported by
clang, gcc, and probably others. This flag is employed during
compilation of all BLIS source files *except* for optimized kernels,
which are exempted because ASan usually requires an extra register,
which violates the constraints for many gemm microkernels.
- Minor whitespace, comment, ordering, and configure help text updates.
commit b861c71b50c6d48cb07282f44aa9dddffc1f1b3f
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 23 13:22:27 2022 -0500
Add consistent NaN/Inf handling in sumsqv. (668)
Details:
- Changed sumsqv implementation as follows:
- If there is a NaN (either real or imaginary), then return a sum of
NaN and unit scale.
- Else, if there is an Inf (either real or imaginary), then return a
sum of +Inf and unit scale.
- Otherwise behave as normal.
commit ee81efc7887374c974a78bfb3e0865776b2f97a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 22 19:15:07 2022 -0500
Parameterized test/3 drivers via command line args. (667)
Details:
- Rewrote the drivers in test/3, the Makefile, and the runme.sh script
so that most of the important parameters, including parameter combo,
datatype, storage combo, induced method, problem size range, dimension
bindings, number of repeats, and alpha/beta values can be passed in
via command line arguments. (Previously, most of these parameters were
hard-coded into the driver source, except a few that were hard-coded
into the Makefile.) If no argument is given for any particular option,
it will be assigned a sane default. Either way, the values employed at
runtime will be printed to stdout before the performance data in a
section that is commented out with '%' characters (which is used by
matlab and octave for comments), unless the -q option is given, in
which case the driver will proceed quietly and output only performance
data. Each driver also provides extensive help via the -h option, with
the help text tailored for the operation in question (e.g. gemm, hemm,
herk, etc.). In this help text, the driver reminds the user which
implementation it was linked to (e.g. blis, openblas, vendor, eigen).
Thanks to Jeff Diamond for suggesting this CLI-based reimagining of
the test/3 drivers.
- In the test/3 drivers: converted cpp macro string constants, as well
as two string literals (for the opname and pc_str) used in each test
driver, to global (or static) const char* strings, and replaced the
use of strncpy() for storing the results of the command line argument
parsing with pointer copies from the corresponding strings in argv.
This works because the argv array is guaranteed by the C99 standard
to persist throughout the life of the program. This new approach uses
less storage and executes faster. Thanks to Minh Quan Ho for
recommending this change.
- Renamed the IMP_STR cpp macro that gets defined on the command line,
via the test/3/Makefile, to IMPL_STR.
- Updated runme.sh to set the problem size ranges for single-threaded
and multithreaded execution independently from one another, as well as
on a per-system basis.
- Added a 'quiet' variable to runme.sh that can easily toggle quiet mode
for the test drivers' output.
- Very minor typecast fix in call to bli_getopt() in bli_utils.c.
- In bli_getopt(), changed the nextchar variable from being a local
static variable to a field of the getopt_t state struct. (Not sure why
it was ever declared static to begin with.)
- Other minor changes to bli_getopt() to accommodate the rewritten test
drivers' command line parsing needs.
commit 036a4f9d822df25a76a653e70be76fb02284d3d3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 22 18:36:50 2022 -0500
Refactored some rntm_t management code. (666)
Details:
- Separated the "sanitizing" code from the auto-factorization code
in bli_rntm_set_ways_from_rntm() and _rntm_set_ways_from_rntm_sup().
The santizing code now resides in bli_rntm_sanitize() while the
factorization code resides in bli_rntm_factorize() and
bli_rntm_factorize_sup(). (There are two different functions because
the conventional and sup factorization codes are currently somewhat
different.) Also note that the factorization code now relies on the
.auto_factor field to have already been set, either during
rntm_t initialization or when the rntm_t was previously updated and
santized. So rather than locally determining whether to auto-
factorize, those functions just read the .auto_factor field and
proceed accordingly.
- Refactored and removed most code from bli_thread_init_rntm_from_env().
This function now reads the environment variables needed to set nt,
jc, pc, ic, jr, and ir; sets them into the global rntm_t; and then
calls bli_rntm_sanitize() in order to make sure that the contents are
in a "good" state. Thanks to Devin Matthews for suggesting this
refactoring.
- Redefined bli_rntm_set_num_threads() and bli_rntm_set_ways() such that
if multithreading is disabled at compile time (that is, if the cpp
macro BLIS_ENABLE_MULTITHREADING is undefined), they ignore the
caller's request and instead clear the nt and ways fields.
- Redefined bli_thread_set_num_threads() and bli_thread_set_ways() such
that if multithreading is disabled at compile time (that is, if the
cpp macro BLIS_ENABLE_MULTITHREADING is undefined), they ignore the
caller's request and do nothing.
- Redefined bli_rntm_set_num_threads() and bli_rntm_set_ways() as true
functions rather than static inline functions.
- In bli_rntm.c, statically initialize the global_rntm global variable
via the BLIS_RNTM_INITIALIZER macro.
- In bli_rntm.h, defined bli_rntm_clear_auto_factor(), which sets the
.auto_factor field of the rntm_t to FALSE.
- Reorganized order of some inline function definitions in bli_rntm.h.
- Changed the default value given to the .auto_factor field by the
BLIS_RNTM_INITIALIZER macro from TRUE to FALSE.
- Call bli_rntm_clear_auto_factor() instead of
bli_rntm_set_auto_factor_only() in bli_rntm_init().
- Comment/whitespace updates.
commit a1a5a9b4cbef9208da494c45a2f933a8e82559ac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 21 18:31:01 2022 -0500
Implemented support for fat multithreading. (665)
Details:
- Allow the user to configure BLIS in such a way that multiple threading
implementations get compiled into the library, with one of those
implementations chosen at runtime. For now, there are only three
implementations available: OpenMP, pthreads, and single. (Here,
'single' merely refers to single-threaded mode.) The configure script
now allows the user to give the -t option with a comma-separated list
of values, such as '-t openmp,pthreads'. The first value in the list
will always be the default at library initialization time, and
'single' is always silently appended to the end of the list. The user
can specify which implementation should execute in one of three ways:
by setting the BLIS_THREAD_IMPL environment variable prior to launch;
by calling the bli_thread_set_thread_impl() global runtime API; or by
encoding their choice into a rntm_t that is passed into one of the
expert interfaces. Any of these three choices overrides the
initialization-time default (i.e., the first value listed to the -t
configure option). Requesting an implementation that was not compiled
into the library will result in an error message followed by
bli_abort().
- Relocated the 'auto' logic for the -t option from the top-level
Makefile to the configure script. (Currently, this logic is pretty
dumb, choosing 'openmp' for gcc and icc, and 'pthreads' for clang.)
- Defined a new 'timpl_t' enum in bli_type_defs.h, with three valid
values: BLIS_SINGLE, BLIS_OPENMP, BLIS_POSIX.
- Reorganized the thrcomm_t struct into a single defintion with two
preprocessor blocks, one each for additional fields needed by OpenMP
and pthreads.
- Added timpl_t argument to bli_thrcomm_bcast(), bli_thrcomm_barrier(),
bli_thrcomm_init(), and bli_thrcomm_cleanup(), which these functions
need since they are now wrappers that choose the implementation-
specific function corresponding to the currently enabled threading
implementation.
- Added rntm_t* to bli_thread_broadcast(), bli_thread_barrier() so that
those functions can pass the timpl_t value into bli_thrcomm_bcast()
and bli_thrcomm_barrier(), respectively.
- Defined bli_env_get_str() in bli_env.c to allow the querying of
BLIS_THREAD_IMPL (which, unlike BLIS_NUM_THREADS and friends, is
expected to be a string).
- Defined bli_thread_get_thread_impl(), bli_thread_set_thread_impl() to
get and set the current threading implementation at runtime.
- Defined bli_rntm_thread_impl() and bli_rntm_set_thread_impl() to query
and set the threading implementation within a rntm_t. Also choose
BLIS_SINGLE as the default value when initializing rntm_t structs.
- Added bli_info_get_*() functions to query whether OpenMP or pthreads
would be chosen as the default at init-time. Note that this only
tests whether OpenMP or pthreads is the first implementation in the
list passed to the threading configure option (-t) and is *not* the
same as querying which implementation is currently selected, since
that can be influenced by BLIS_THREAD_IMPL and/or
bli_thread_set_thread_impl().
- Changed l3int_t to l3int_ft.
- Updated docs/Multithreading.md to document the new behavior.
- Updated sandbox/gemmlike and addon/gemmd to work with the new fat
threading feature. This included a few bugfixes to bring the codes up
to date, as necessary.
- Comment, whitespace updates.
commit 89df7b8fa3a3e47ab2fc10ac4d65d0b9fde16942
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Sep 18 18:46:57 2022 -0500
De-templatized _sup_var1n2m.c; unified _sup_packm_a/b(). (659)
Details:
- Re-expressed the two variants in frame/3/bli_l3_sup_var1n2m.c as a
single function each that performs char* pointer arithmetic rather
than four datatype-specific functions. Did the same for the functions
in bli_l3_sup_packm_a.c and _sup_packm_b.c, and then unified the two
into a single set of functions for packing either A or B, which now
resides in bli_l3_sup_packm.c.
- Pre-grow the cntl_t tree in both bli_l3_sup_var1n2m.c variants rather
than grow them incrementally.
- Relocated empty-matrix and scale-by-beta early return handlnig from
bli_gemm_front() and bli_gemmt_front() to their _ex() counterparts.
- Comment, whitespace updates.
commit fb91337eff1ee2098f315a83888f6667b3a56f86
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 15 19:08:10 2022 -0500
Fixed a harmless pc_nt bug in 05a811e.
Details:
- Added missing curly braces around some statements in bli_rntm.c, one
of which needed them in order for the relevant code to be executed in
the intended way. The consequence of 05a811e omitting those braces was
that a statement (pc_nt = 1;) was executed more often than it needed
to be.
- Also adjusted the analagous code in bli_thread.c to match that of
bli_rntm.c.
commit e86076bf4461d1a78186fb21ba8320cfb430f62c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 15 14:22:59 2022 -0500
Test the 'gemmlike' sandbox via AppVeyor. (664)
Details:
- Added a fifth test to our .appveyor.yml that enables the 'gemmlike'
sandbox with OpenMP enabled (via clang, the 'auto' configuration
target, and building to a static library). Thanks to Jeff Diamond
for pointing out that this test would be useful.
commit 63177dca48cb7d066576d884da4a7a599ececebf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 15 11:21:26 2022 -0500
Fixed gemmlike sandbox bug introduced in 7c07b47.
Details:
- Fixed a bug in the 'gemmlike' sandbox that was introduced in 7c07b47.
This bug was the result of the fact that the gemmlike implementation
uses bli_thrinfo_sup_grow() to grow its thrinfo_t tree, but the
aforementioned commit added an optimization that kicks in when the
rntm_t .pack_a and .pack_b fields are both FALSE. Those fields were
originally added only for sup execution; for large code path, they
are intended to be ignored. But the default initial state of a rntm_t
has those fields set to FALSE, which was inadvertantly activating the
optimization (which targeted single-threaded cases only) and would
cause multithreaded use cases of 'gemmlike' to segfault. The fix took
the form of setting the .pack_a and .pack_b fields to TRUE in
bls_gemm_ex().
- Added minimal 'const' and 'const'-casting to 'gemmlike' so that gcc
stays quiet.
commit 05a811e898b371a76581abd4afa416980cce7db9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 13 19:24:05 2022 -0500
Initialize rntm_t nt/ways fields with 1 (not -1). (663)
Details:
- Changed the way that rntm_t structs are initialized, mainly so that
the global rntm_t that is set via environment variables at runtime
may be queried by the application prior to any computation taking
place. (Strictly speaking, the application may already query these
fields, but they do not always contain valid values and often contain
-1 when they are unset.) These changes also served to clarify how
these parameters are treated, and homogenized the implementations of
bli_rntm_set_ways_from_rntm(), bli_rntm_set_ways_from_rntm_sup(), and
bli_thread_init_rntm_from_env(). Special thanks to Jeff Diamond,
Leick Robinson, and Devin Matthews for pointing out that the previous
behavior was needlessly confusing and could be improved.
- The aforementioned modifications also included subtle changes as to
what counts as "setting" a loop's ways of parallelism for the purposes
of deciding whether to use the ways or the total number of threads.
Previously, setting any loop's ways, even to 1, counted in favor of
using the ways. Now, only values greater than 1 will count as
"setting", and all other values will silently be mapped to 1, with
those parameters treated as if they were untouched all along.
- Updated bli_rntm.h and bli_thread.c so that any attempt to set the
PC_NT variable (or pc_nt field of a rntm_t) will either ignore the
request or reassert the value as 1.
- Updated bli_rntm_set_ways() so that rather than clear the
num_threads field, it is set to the product of all of the per-loop
ways of parallelism.
- Removed code from test_libblis.c that handled the possibility of unset
environment variables when printing out their values.
- Removed bli_rntm_equals() inline function from bli_rntm.h, which has
long been disabled.
- Updates to docs/Multithreading.md related to the aforementioned
changes.
- Comment updates.
commit fd885cf98f4fe1d3bc46468e567776c37c670fcc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 13 11:50:23 2022 -0500
Use kernel CFLAGS for 'kernels' subdirs in addons. (658)
Details:
- Updated Makefile and common.mk so that the targeted configuration's
kernel CFLAGS are applied to source files that are found in a
'kernels' subdirectory within an enabled addon. For now, this
behavior only applies when the 'kernels' directory is at the top
level of the addon directory structure. For example, if there is an
addon named 'foobar', the source code must be located in
addon/foobar/kernels/ in order for it to be compiled with the target
configurations's kernel CFLAGS. Any other source code within
addon/foobar/ will be compiled with general-purpose CFLAGS (the same
ones that were used on all addon code prior to this commit). Thanks
to AMD (esp. Mithun Mohan) for suggesting this change and catching an
intermediate bug in the PR.
- Comment/whitespace updates.
commit cb74202db39dc8cb81fdd06f8a445f8837e27853
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 13 11:46:24 2022 -0500
Fixed incorrect sizeof(type) in edge case macros. (662)
Details:
- In bli_edge_case_macro_defs.h, the GEMM_UKR_SETUP_CT_PRE() and
GEMMTRSM_UKR_SETUP_CT_PRE() macros previously declared their temporary
ct microtiles as:
PASTEMAC(ch,ctype)
_ct[ BLIS_STACK_BUF_MAX_SIZE / sizeof( PASTEMAC(ch,type) ) ] \
__attribute__((aligned(alignment))); \
The problem here is that sizeof( PASTEMAC(ch,type) ) evaluates to
things like sizeof( BLIS_DOUBLE ), not sizeof( double ), and since
BLIS_DOUBLE is an enum, it is typically an int, which means the
sizeof() expression is evaluating to the wrong value. This was likely
a benign bug, though, since BLIS does not support any computational
datatypes that are smaller than sizeof( int ), which means the ct
array would be *over*-allocated rather than underallocated. Thanks
to moon-chilled for identifying and reporting this bug in 624.
- CREDITS file update.
commit 6e5431e8494b06bd80efcab3abf0a6456d6c0381
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Sep 10 15:16:58 2022 -0500
Fix line number issue in flattened blis.h. (660)
Details:
- Updated the top-level Makefile so that it invokes flatten-headers.py
without the -c option, which was requesting that comments be stripped
(since comment stripping is disabled by default).
- Updated flatten-headers.py to accept a new option (-l) to enable
insertion of line directives into the output file. This new option
is enabled by default.
- Also added logic to flatten-headers.py that outputs a warning if both
comment stripping and line numbers are requested since the comment
stripping will cause the line numbers to become inaccurate.
commit 4afe0cfdab0e069e027f97920ea604249e34df47
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 8 18:33:20 2022 -0500
Defined invscalv, invscalm, invscald operations. (661)
Details:
- Defined invert-scale (invscal) operation on vectors (level-1v),
matrices (level-1m), and diagonals (level-1d).
- Added test modules for invscalv and invscalm to the testsuite.
- Updated BLISObjectAPI.md and BLISTypedAPI.md API documentation to
reflect the new operations. Also updated KernelsHowTo.md accordingly.
- Renamed 'beta' to 'alpha' in scalv and scalm testsuite modules (and
input.operations files) so that the parameter name matches the
parameter used in the documentation.
commit a87eae2b11408b556e562f1b04e673c6cd1612bc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 6 18:04:09 2022 -0500
Added '-q' quiet mode option to testsuite. (657)
Details:
- Added support for a '-q' command line option to the testsuite. This
option suppresses most informational output that would normally
clutter up the screen. By default, verbose mode (the previous
status quo) will be operative, and so quiet mode must be requested.
commit dfa54139664a42d29774e140ec9e5597af869a76
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Aug 30 08:07:50 2022 +0800
Arm64 dgemmsup with extended MR&NR (655)
Details:
- Since the number of registers in NEON is large but their lengths are
short, I'm here extending both MR and NR.
- The approach is to represent the C microtile in registers optionally
in columns, so for sizes like 6x7m, the 'crr' kernel is the default
with 'rrr' supported through an in-register transpose.
- A few asm kernels are crafted for 'rv' to complete this extended size
support.
- For 'rd' I'm still relying heavily on C99 intrinsic kernels with
branching so the performance might not be optimal. (Sorry for that.)
- So far, these changes only affect the 'firestorm' subconfig.
- This commit also contains row-preferential s12x8 and d6x8 gemm
ukernels. These microkernels are templatized versions of the existing
s8x12 and d6x8 ukernels defined in bli_gemm_armv8a_asm_d6x8.c.
commit 9e5594ad5fc41df8ef2825a025d7844ac2275c27
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 11 14:36:38 2022 -0500
Temporarily disabled line directives from 6826c1c.
Details:
- Commented out the inclusion of line preprocessor directives in the
flattened header output provided by build/flatten-headers.py. This
output was added recently in 6826c1c, but was later found to have
thrown off the line numbering referenced by compiler warnings and
errors (possibly due to license comment blocks, which are stripped
from source headers as they are inlined into the monolithic header).
commit 775148bcdbb1014b4881a76306f35f5d0fedecbe
Author: jdiamondGitHub <jeff_diamondfastmail.com>
Date: Fri Aug 5 12:01:24 2022 -0500
Updated ARMv8a kernels to fix 2 prefetching issues. (649)
Details:
- The ARMv8a dgemm/sgemm microkernels had 2 prefetching issues that
impacted performance on modern ARM platforms. The most significant
issue was that only a single prefetch per C tile column was issued.
When a column of C was not cache aligned, the second cache line would
not be prefetched at all, forcing the kernel to wait for an entire
load to update elements of C. This happened with roughly 50% of the
C prefetches. The fix was to have two prefetches per column, spaced
64 bytes (1 cache line) apart.
- A secondary performance issue was that all the C prefetch instructions
were issued sequentially at the beginning of the kernel call. This
caused a noticeable performance slowdown. Interleaving the prefetch
calls every 2-3 instructions in the prologue code solved the issue.
commit bbaf29abd942de47a3a99a80a67d12bab41b27db
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 4 17:51:37 2022 -0500
Very minor variable updates to common.mk.
Details:
- Fixed a harmless bug that would have allowed C++ headers into the list
of header suffices specifically reserved for C99 headers. In practice,
this would have had no substantive effect on anything since the core
BLIS framework does not use C++ headers.
commit a48e29d799091a833213efeafaf2d342ebdafde9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 28 10:11:07 2022 -0500
CREDITS file update.
Details:
- Thanks to Kihiro Bando for assisting with issue 644.
commit 5b298935de7f20462bfad1893ed34ecd691cec5a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 27 19:14:15 2022 -0500
Removed buggy cruft from power10 subconfig.
Details:
- Removed defines for BLIS_BBN_s and BLIS_BBN_d from
bli_kernel_defs_power10.h. These were inadvertently set in ae10d949
because the power10 subconfig was registering bb packm ukernels, but
only for 6xk (power10 uses s8x16 and d8x8 ukernels) and only because
the original author (probably) copy-pasted from power9 when getting
started. That 6xk packm registration was effectively "dead code"
prior to ae10d949, but was then mistaken as not-dead code during the
ae10d949 refactor. These improper bb factors may have been causing
bugs in power10 builds. Thanks to Nicholai Tukanov for helping remind
me what the power10 subconfig was supposed to look like.
- Removed extraneous microkernel preference registrations from power10
subconfig. Preferences for single and double complex gemm were being
registered despite there being no complex gemm ukernels registered to
go with them. Similarly, there were trsm preferences registered
without any trsm ukernels registered (and BLIS doesn't actually use a
preference for the trsm ukernel anyway). These extraneous
registrations were almost surely not hurting anything, even if they
were quite misleading.
commit 56de31b00fa0f1ba866321817cd1e5d83000ff11
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 27 13:54:17 2022 -0500
Disable modification of KC in the gemmsup kernels. (648)
This led to a ~50% performance reduction for certain gemm operations (but not others?). See 644 for example.
commit 4dde947e2ec9e139c162801320c94e6a01a39708
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 26 17:29:32 2022 -0500
Fixed out-of-bounds bug in sup s6x16m haswell kernel.
Details:
- Fixed another out-of-bounds read access bug in the haswell sup
assembly kernels. This bug is similar to the one fixed in 17b0caa
and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
Kannan for reporting this bug (and a suitable fix) in 635.
- CREDITS file update.
commit 6826c1cdfba855513786d9e3d606681316453398
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 25 18:21:05 2022 -0500
Add `line` directives to flattened `blis.h`. (643)
Details:
- Modified flatten-headers.py so that line directives are inserted into
the flattened blis.h file. This facilitates easier debugging when
something is amiss in the flattened blis.h because the compiler will
be able to refer to the line number within the original constituent
header file (which is where the fix would go) rather than the line
number within the flattened header (which is not as helpful).
commit af3a41e02534befdae026377592ce437bab83023
Author: Alexander Grund <Flamefireusers.noreply.github.com>
Date: Thu Jul 21 18:05:48 2022 +0200
Add autodetection for POWER7, POWER9 & POWER10 (647)
Read from `/proc/cpuinfo` as done for ARM.
Fixes 501
commit 17b0caa2b2bff439feb6d2b39cfa16e7591882b0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 14 17:55:34 2022 -0500
Fixed out-of-bounds read in haswell gemmsup kernels.
Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
single-precision elements of C, via instructions such as:
vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)
in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:
vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)
Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
Thanks to Daniël de Kok for reporting these bugs in 635, and to
Bhaskar Nallani for proposing the fix).
- CREDITS file update.
commit cc260fd7068f0fe449d818435aa11adb14c17fed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 13 16:16:01 2022 -0500
Allow uniform max problem sizes in test/3/runme.sh.
Details:
- Tweaked test/3/runme.sh so that the test driver binaries for single-
threaded (st), single-socket (1s), and dual-socket (2s) execution can
be built using identical problem size ranges. Previously, this was not
possible because runme.sh used the maximum problem size, which was
embedded into the binary filename, to tell the three classes of
binaries apart from one another. Now, runme.sh uses the binary suffix
("st", "1s", or "2s") to tell them apart. This required only a few
changes to the logic, but it also required a change in format to the
threading config strings themselves (replacing the max problem size
with "st", "1s", or "2s"). Thanks to Jeff Diamond for inspiring this
improvement.
- Comment updates.
commit 9b1beec60be31c6ea20b85806d61551497b699e4
Author: bartoldeman <bartoldemanusers.noreply.github.com>
Date: Mon Jul 11 20:15:12 2022 -0400
Use BLIS_ENABLE_COMPLEX_RETURN_INTEL in blastest files (636)
Details:
- Fixed a crash that occurs when either cblat1 or zblat1 are linked
with a build of BLIS that was compiled with '--complex-return=intel'.
This fix involved inserting preprocessor macro guards based on
BLIS_ENABLE_COMPLEX_RETURN_INTEL into blastest/src/cblat1.c and
blastest/src/zblat1.c to correctly handle situations where BLIS is
compiled with Intel/f2c-style calling conventions for complex numbers.
- Updated blastest/src/fortran/run-f2c.sh so that future executions
will insert the aforementioned cpp macro conditional where
appropriate.
commit 98d467891b74021ace7f248cb0856bec734e39b6
Author: bartoldeman <bartoldemanusers.noreply.github.com>
Date: Mon Jul 11 19:40:53 2022 -0400
Change complex_return='intel' for ifx. (637)
Details:
- When checking the version string of the Fortran compiler for the
purposes of determining a default return convention for complex
domain values, grep for "IFORT" instead of "ifort" since that string
is common to both the 'ifx' and 'ifort' binaries provided by Intel:
$ ifx --version
ifx (IFORT) 2022.1.0 20220316
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
$ ifort --version
ifort (IFORT) 2021.6.0 20220226
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.
commit ffde54cc5c334aca8eff4d6072ba49496bf3104c
Author: jdiamondGitHub <jeff_diamondfastmail.com>
Date: Mon Jul 11 16:47:30 2022 -0500
Minor changes to .gitignore and LICENSE files. (642)
Details:
- Macs create .DS_Store files in every directory visited. Updated
.gitignore file so these files won't be reported as untracked by
'git status'.
- Added Oracle Corporation to the LICENSE file.
- Updated UT copyright on behalf of SHPC.
commit 7cba7ce3dd1533fcc4ca96ac902bdf218686139a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 8 11:15:18 2022 -0500
Minor cleanups, comment updates to bli_gks.c.
Details:
- Removed a redundant registration of 'a64fx' subconfig in
bli_gks_init().
- Reordered registration of 'armsve', 'a64fx', and 'firestorm'
subconfigs. Thanks to Jeff Diamond for his input on this reordering.
- Comment updates to bli_gks.c and arch_t enum in bli_type_defs.h.
commit 667f201b7871da68622027d02bd6b7da3262f8e8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 7 16:44:21 2022 -0500
Fixed type bug in bli_cntx_set_ukr_prefs().
Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
num_t value read from va_args() down to a bool before being stored
within the cntx_t. This bug was introduced on April 6th 2022, in
ae10d94. This caused the ukernel preferences for double real and
double complex to go unchanged while the preferences for single real
and single complex were corrupted by the former datatypes'
preference values. The bug manifested as degraded performance for
subconfigurations that registered column-preferential ukernels. The
reason is that the erroneous preferences trigger unnecessary
transpositions in the operation, which forces the gemm ukernel to
compute on matrices that are not stored according to its preference.
Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.
commit d429b6bfced21a63bf711224ac402f93f0080b52
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Jun 28 15:34:10 2022 -0500
Support clang targetting MinGW (639)
* Support clang targetting MinGW
* Fix pthread linking
commit d93df023348144e091f7b3e3053995648f348aa7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 15 14:09:49 2022 -0500
Removed unused dt arg in bli_gks_query_ind_cntx().
Details:
- Removed the num_t datatype argument from bli_gks_query_ind_cntx().
This argument stopped being needed by the function in commit e9da642.
Its only use in bli_gks_query_ind_cntx() was to be passed through to
the context initialization function for the chosen induced method,
but even then, commit log notes from e9da642 indicate that I could not
recall why the datatype argument was ever needed by the context init
function to begin with.
- Updated all invocations of bli_gks_query_ind_cntx() to omit the dt
argument. Most of these invocations resided in various standalone test
drivers (and the testsuite).
commit 56772892450cc92b3fbd6a9d0460153a43fc47ab
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 1 10:49:33 2022 -0500
Added SMU citation to README.md intro.
Details:
- Added a citation to SMU and the Matthews Research Group to the general
attribution of maintainership and development in the Introduction of
the README.md file. Thanks to Robert van de Geijn and Devin Matthews
for suggesting this change.
commit 4603324eb090dfceaad3693a70b2d60544036aa8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 19 14:07:03 2022 -0500
Init/finalize via bli_pthread_switch_t API (634).
Details:
- Defined and implemented a new pthread-like abstract datatype and API
in bli_pthread.c. The new type, bli_pthread_switch_t, is similar to
bli_pthread_once_t in some respects. The idea is that like a switch in
your home that controls a light or ceiling fan, it can either be on or
off. The switch starts in the off state. Moving from one state to the
other (on to off; off to on) causes some action (i.e., a startup or
shutdown function) to be executed. Trying to move from one state to
the same state (on to on; off to off) is safe in that it results in
no action. Unlike bli_pthread_once(), the API for bli_pthread_switch_t
contains both _on() and _off() interfaces. Also, unlike the _once()
function, the _on() and _off() functions return error codes so that
the 'int' error code returned from the startup or shutdown functions
may be passed back to the caller. Thanks to Devin Matthews for his
input and feedback on this feature.
- Replaced the previous implementation of bli_init_once() and
bli_finalize_once() -- both of which used bli_pthread_once() -- with
ones that rely upon bli_pthread_switch_on() and _switch_off(),
respectively. This also required updating the return types of
_init_apis() and _finalize_apis() to match the function pointer type
required by bli_pthread_switch_on()/_switch_off().
- Comment updates.
commit 64a9b061f6032e2b59613aecdbe7bb52161605c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 10 14:54:22 2022 -0500
Fixed misspelling of 'xpbys' in gemm macrokernel.
Details:
- Fixed a functionally harmless typo in bli_gemm_ker_var2.c where a few
instances of the substring "xpbys" were misspelled as "xbpys". The
misspellings were harmless because they were consistent, and because
they referenced only local symbols.
commit 1c733402a95ab08b20f3332c2397fd52a2627cf6
Author: Jed Brown <jedjedbrown.org>
Date: Thu Apr 28 11:58:44 2022 -0600
Fix version check for znver3, which needs gcc >= 10.3 (628)
Apple's clang-12 lacks znver3 support, unlike upstream clang-12.
commit 6431c9e13b86e4442b6aacba18a0ace12288c955
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 14 13:01:24 2022 -0500
Added missing 'const' to zen bli_gemm_small.c.
Details:
- Added missing 'const' qualifiers to signatures of functions defined in
kernels/zen/3/bli_gemm_small.c. This fixes compile-time errors when
targeting 'zen3' subconfig (which apparently is enabling AMD's
gemm_small code path by default). Thanks to Devin Matthews for
reporting this error.
commit 9fea633748ed27ef3853bba7cd955690c61092b4
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 13 15:59:06 2022 -0500
Partial addition of 'const' to all interfaces above the (micro)kernels. (625)
Details:
- Added 'const' qualifier to applicable function arguments wherever the
the pointed-to object is not internally modified. This change affects
all interfaces that reside above the level of the (micro)kernels.
- Typecast certain function return values to discard 'const' qualifier.
- Removed 'restrict' from various arguments, including cntx_t*,
auxinfo_t*, rntm_t*, thrinfo_t*, mem_t*, and others
- Removed parts of some APIs, such as bli_cntx_*(), due to limited use.
- Merged some variable declarations with their corresponding
initialization statements.
- Whitespace changes.
commit ae10d9495486f589ed0320f0151b2d195574f1cf (origin/amd)
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 6 20:31:11 2022 -0500
Simplify and rewrite reference packm kernels. (610)
Details:
- Reorganized the way kernels are stored within the cntx_t structure so
that rather than having a function pointer for every supported size of
unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
kernels per datatype: one to pack MRxk micropanels and one to pack
NRxk micropanels.
- NOTE: The "bb" (broadcast B) reference kernels have been merged into
the "standard" kernels (packm [including 1er and unpackm], gemm,
trsm, gemmtrsm). This replication factor is controlled by
BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
replication factor of 1 has been tested. armsve also needs testing
since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
and friends) in bli_kernel_macro_defs.h, but only when the macro
BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
- Still need to update documentation, including:
- docs/ConfigurationHowTo.md
- docs/KernelsHowTo.md
to reflect changes made in this commit.
commit b3e674db3c05ca586b159a71deb1b61d701ae5c9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 4 17:31:02 2022 -0500
README.md update to link to releases page.
commit 69fa915464c52f09a5971a60f521900d31a34e69
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:47:46 2022 -0500
Fixed broken "tagged releases" link in README.md.
commit 88cab8383ca90ddbb4cf13e69b7d44a1663a4425
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:12:06 2022 -0500
CHANGELOG update (0.9.0)