Blis

Latest version: v1.2.0

Safety actively analyzes 722491 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 7

10.0rc0

Author: Srinivas Yadav <43375352+srinivasyadav18users.noreply.github.com>
Date: Sat Oct 14 02:05:41 2023 -0500

Fixed HPX barrier synchronization (783)

Details:
- Fixed hpx barrier synchronization. HPX was hanging on larger cores
because blis was using non-hpx synchronization primitives. But when
using hpx-runtime only hpx-synchronization primitives should be used.
Hence, a C style wrapper hpx_barrier_t is introduced to perform hpx
barrier operations.
- Replaced hpx::for_loop with hpx::futures. Using hpx::for_loop with
hpx::barrier on n_threads greater than actual hardware thread count
causes synchronization issues making hpx hanging. This can be avoided
by using hpx::futures, which are relatively very lightweight, robust
and scalable.

commit 8fff1e31da1c87e46cacec112b0ac280ab47cd8b
Author: Field G. Van Zee <fgvanzeegmail.com>
Date: Thu Oct 12 15:51:41 2023 -0500

Fixed bug in sup threshold registration. (782)

Details:
- Fixed a bug that resulted in BLIS non-deterministically calling the
gemmsup handler, irrespective of the thresholds that are registered
via bli_cntx_set_blkszs().
- Deep dive: In bli_cntx_init_ref.c, the default values for the gemmsup
thresholds (BLIS_[MNK]T blocksizes) wre being set to zero so that no
operation ever matched the criteria for gemmsup (unless specific sup
thresholds are registered). HOWEVER, these thresholds are set via
bli_cntx_set_blkszs() which calls bli_blksz_copy_if_pos(), which was
only coping the thresholds into the gks' cntx_t if the values were
strictly positive. Thus, the zero values passed into
bli_cntx_set_blkszs() were being ignored and those threshold slots
within the gks were left uninitialized. The upshot of this is that the
reference gemmsup handler was being called for gemm problems
essentially at random (and as it turns out, very rarely the reference
gemmsup implementation would encounter a divide-by-zero error).
- The problem was fixed by changing bli_blksz_copy_if_pos() so that it
copies values that are non-negative (values >= 0 instead of > 0). The
function was also renamed to bli_blksz_copy_if_nonneg()
- Also needed to standardize use of -1 as the sole value to embed into
blksz_t structs as a signal to bli_cntx_set_blkszs() to *not* register
a value for that slot (and instead let whatever existing values
remain). This required updates to the bli_cntx_init_*() functions for
bgq, cortexa9, knc, penryn, power7, and template subconfigs, as some
of these codes were using 0 instead of -1.
- Fixes 781. Thanks to Devin Matthews for identifying, diagnosing, and
proposing a fix for this issue.

commit 1e264a42474b535431768ef925bbd518412d392e
Author: Abhishek Bagusetty <59661409+abagusettyusers.noreply.github.com>
Date: Mon Oct 2 18:29:46 2023 -0500

Update zen3 subconfig to support NVHPC compilers. (779)

Details:
- Parse $(CC_VENDOR) values of "nvc" in 'zen3' make_defs.mk file.
- Minor refactor to accommodate above edit.
- CREDITS file update.

commit c2099ed2519dcac8ee421faf999b36e1c2260be7
Author: Field G. Van Zee <fgvanzeegmail.com>
Date: Mon Oct 2 14:56:48 2023 -0500

Fixed brokenness when sba is disabled. (777)

Details:
- Previously, disabling the sba via --disable-sba-pools resulted in a
segfault due to a sanity-check-triggering abort(). The problem was
that the sba, as currently used in the l3 thread decorators, did not
yet (fully) support pools being disabled. The solution entailed
creating wrapper function, bli_sba_array_elem(), which either calls
bli_apool_array_elem() (when sba pools are enabled at configure time)
or returns a NULL sba_pool pointer (when sba pools are disabled), and
calling bli_sba_array_elem() in place of bli_apool_array_elem(). Note
that the NULL pointer returned by bli_sba_array_elem() when the sba
pools are disabled does no harm since in that situation the pointer
goes unreferenced when acquiring and releasing small blocks. Thanks to
John Mather for reporting this bug.
- Guarded the bodies of bli_sba_init() and bli_sba_finalize() with
ifdef BLIS_ENABLE_SBA_POOLS. I don't think this was actually necessary
to fix the aforementioned bug, but it seems like good practice.
- Moved the code in bli_l3_thrinfo_create() that checked that the array*
pointer is non-NULL before calling bli_sba_array_elem() (previously
bli_apool_array_elem()) into the definition of bli_sba_array_elem().
- Renamed various instances of 'pool' variables and function parameters
to 'sba_pool' to emphasize what kind of pool it represents.
- Whitespace changes.

commit 37ca4fd168525a71937d16aaf6a13c0de5b4daef
Author: Field G. Van Zee <fgvanzeegmail.com>
Date: Thu Sep 28 16:37:57 2023 -0500

Implemented [cz]symv_(), [cz]syr_(), [cz]rot_(). (778)

Details:
- Expanded existing BLAS compatibility APIs to provide interfaces to
[cz]symv_(), [cz]syr_(). This was easy since those operations were
already implemented natively in BLIS; the APIs were previously
omitted only because they were not formally part of the BLAS.
- Implemented [cz]rot_() by feeding code from LAPACK 3.11 through
f2c.
- Thanks to James Foster for pointing out that LAPACK contains these
additional symbols, which prompted these additions, as well as for
testing the [cz]rot_() functions from Julia's test infrastructure.
- CREDITS file update.

commit 6f412204004666abac266409a203cb635efbabf3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 26 18:00:54 2023 -0500

Added 'altra', 'altramax' subconfigs. (775)

Details:
- Forward-ported 'altra' and 'altramax' subconfigurations from the
older 'stable' branch lineage [1]. These subconfigs primarily target
the Ampere Altra and AltraMax (ARM) processors. They also contain
"QuickStart" directories with information and scripts to help
use BLIS on these microarchitectures. Thanks to Jeff Diamond and
Leick Robinson for developing these subconfigs and resources.
- Updated kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c according to
changes in the 'stable' lineage, mostly related to re-enabling of
assembly code branches that target general stride IO.

[1] Note that the 'stable' branch is being used to make sure that more
recent commits do not introduce unreasonable performance
regressions. As such, the name should be interpreted as shorthand
for "performance stable," not "API stable."

commit a4a63295b96ed5b32f4df6477d24db07bf431202
Author: Srinivas Yadav <43375352+srinivasyadav18users.noreply.github.com>
Date: Tue Sep 26 17:58:38 2023 -0500

Fixes to HPC runtime code path. (773)

Details:
- Fixed hpx::for_each invocation and replace with hpx::for_loop. The HPX
runtime was initialized using hpx::start, but the hpx::for_each
function was being called on a non-hpx runtime (i.e standard BLIS
runtime - single main thread). To run hpx::for_each on HPX runtime
correctly, the code now uses hpx::run_as_hpx_thread(func, args...).
- Replaced hpx::for_each with hpx::for_loop, which eliminates use of
hpx::util::counting_iterator.
- Employ hpx::execution::chunk_size(1) to make sure that a thread
resides on a particular core.
- Replaced hpx::apply() with updated version hpx::post().
- Initialize tdata->id = 0 in libblis.c to 0, as it is the main thread
and is needed for writing results to output file.
- By default, if not specified, the HPX runtime uses all N threads/cores
available in the system. But, if we want to only specify n_threads out
N threads, we use hpx::execution::experimental::num_cores(n_threads).

commit c6546c1131b1ddd45ef13f9f2b620ce2e955dbf8
Author: John Mather <54645798+jmather-sesiusers.noreply.github.com>
Date: Wed Sep 20 13:41:07 2023 -0400

Fixed broken link in Multithreading.md. (774)

Details:
- Replaced 404'd link in docs/Multithreading.md with an archive from
The Wayback Machine.
- CREDITS file update.

commit 6dcf7666eff14348e82fbc2750be4b199321e1b9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 27 14:18:57 2023 -0500

Revamped bli_init() to use TLS where feasible. (767)

Details:
- Revamped bli_init_apis() and bli_finalize_apis() to use separate
bli_pthread_switch_t objects for each of the five sub-API init
functions, with the objects for the 'ind' and 'rntm' sub-APIs being
declared with BLIS_THREAD_LOCAL. This allows some APIs to be treated
as thread-local and the rest as thread-shared. Thanks to Edward Smyth
for requesting application thread-specific rntm_t structs, which
inspired these change.
- Combined bli_thread_init_from_env() and bli_pack_init_from_env() into
a new function, bli_rntm_init_rntm_from_env(), and placed the combined
code in bli_rntm.c inside of a new bli_rntm_init() function. Then
removed the (now empty) bli_pack_init() and _finalize() function defs.
- Deprecated bli_rntm_init() for the purposes of initializing a rntm_t
(temporarily preserving it as bli_rntm_clear() in a cpp-undefined code
block) so that the function name could be used for the aforementioned
bli_rntm_init() function.
- Updated libblis_test_pobj_create() in test_libblis.c to use a static
rntm_t initializer instead of the deprecated bli_rntm_init()
function-based option.
- Minor updates to docs/Multithreading.md, including removal of
bli_rntm_init() in the example of how to initialize rntm_t structs.
- Changed the return value of bli_gks_init(), bli_ind_init(),
bli_memsys_init(), bli_thread_init(), and bli_rntm_init() (and their
finalize() counterparts) from 'void' to 'int' so that those functions
match the function type expected by bli_pthread_switch_on()/_off().
Those init/finalize functions now return 0 to indicate success, which
is needed so that the switch actually changes state from off to on
and vice versa.
- Defined bli_thread_reset(), which copies the contents of the
global_rntm_at_init() struct into the global_rntm struct (for the
current application thread).
- Guard calls to bli_pthread_mutex_lock()/_unlock() in
- bli_pack_set_pack_a() and _pack_b()
- bli_rntm_init_from_global()
- bli_thread_set_ways()
- bli_thread_set_num_threads()
- bli_thread_set_thread_impl()
- bli_thread_reset()
- bli_l3_ind_oper_set_enable()
with ifdef BLIS_DISABLE_TLS (since TLS precludes the possibility of
race conditions).
- In frame/base/bli_rntm.c, declare global_rntm, global_rntm_at_init,
and global_rntm_mutex as BLIS_THREAD_LOCAL so that separate
application threads can change the number of ways of BLIS parallelism
independently from one another.
- Access global_rntm only via a new private (not exported) function,
bli_global_rntm(). Defined a similar function for a rntm_t new to
this commit, global_rntm_at_init, which preserves the state of the
global rntm at initialization-time.
- In frame/3/bli_l3_ind.c, added a guard to the declaration of the
static variable oper_st_mutex with ifdef BLIS_DISABLE_TLS so that the
mutex is omitted altogether when TLS is enabled (which prevents the
compiler from warning about an unused variable).
- Removed redundant code from bli_thread.c:
ifdef BLIS_ENABLE_HPX
include "bli_thread_hpx.h"
endif
since this code is already present in bli_thread.h.
- Thanks to Minh Quan Ho for his review of and feedback on this commit.
- Comment updates.

commit fa6a9b24ae2ddbd5f30f657d46004843581c768c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 19 12:44:34 2023 -0500

Fixed error when using common.mk from testsuite. (768)

Details:
- Commit 2db31e0 (755) inserted logic into common.mk that attempts to
preprocess build/detect/android/bionic.h to determine whether the
__BIONIC__ macro is defined (in which case -lrt should not be included
in LDFLAGS). However, the path to bionic.h was encoded without regard
to DIST_PATH, and so utilizing common.mk anywhere that isn't the top-
level directory (such as in the testsuite directory) resulted in a
compiler error:

gcc: error: build/detect/android/bionic.h: No such file or directory
gcc: fatal error: no input files
compilation terminated.

This commit adds a $(DIST_PATH) prefix to the path to bionic.h so that
it can be located from other applications' Makefiles that use BLIS's
makefile fragments.

commit 634e532c8dcce7383d96ba33276df65c656b2198
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 9 21:54:49 2023 -0500

Set thrcomm timpl_t id inside init functions. (766)

Details:
- Previously, the timpl_t id being used when a thrcomm_t is being
initialized was set within the bli_thrcomm_init() dispatch function
after the timpl_t-specific bli_thrcomm_init_*() function returned. But
it just occurred to me that each bli_thrcomm_init_*() function already
intrinsically knows its own timpl_t value. This commit shifts the
setting of the thrcomm_t.ti field into the corresponding
bli_thrcomm_init_*() function for each timpl_t type (e.g. single,
openmp, pthreads, hpx).
- Removed long-deprecated code dating back nearly 10 years.
- Whitespace changes
- Comment updates.

commit 3cf17b4a91232709bc6a205b0e4d7ecc96579aa9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 7 13:46:20 2023 -0500

Small fixes/improvements to docs/Multithreading.md. (764)

Details:
- Added reminders that include "blis.h" must be added to source files
in order to access BLIS API function prototypes. Thanks to Barry Smith
for suggesting this improvement.
- Fixed pre-existing typos.
- CREDITS file update.

commit dbc79812c390f812c7bf030bfcf87e947a1443c4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 28 18:16:38 2023 -0500

CREDITS file update.

Details:
- Thanks to Igor Zhuravlov for PR 753 (commit 915daaa).

commit 915daaa43cd189c86d93d72cd249714f126e9425
Author: Igor Zhuravlov <zhuravlov.ipya.ru>
Date: Thu Jul 27 20:33:59 2023 +0000

Fix typos in docs + example code comments. (753)

Details:
- Fixed various typos in API documentation in docs/BLIS*API.md and
comments in the source code examples within examples/?api/*.c.

commit 2db31e057e7e9c97fc60021b5ae72a01a48d7588
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Jul 27 15:27:21 2023 -0500

Exclude -lrt on Android with Bionic libraries. (755)

Details:
- Added build/detect/android/bionic.h header to test whether the
__BIONIC__ cpp macro is defined.
- In common.mk, only add -lrt to LDFLAGS when Bionic is not present.
- CREDITS file update.

commit 22ad8c1b752364784f320168b31995945ad84a59
Author: ct-clmsn <ct.clmsngmail.com>
Date: Thu Jul 27 16:23:29 2023 -0400

Small fixes to support hpx in the testsuite (759)

Details:
- Minor changes to test_libblis.c to support hpx.

commit c91b41d022e33da82b3b06c82be047a29873d9b6
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Wed Jul 26 14:37:08 2023 -0500

Auto-detect the RISC-V ABI of the compiler and use -mabi= during RISC-V Builds (750)

Details:
- Generate a build error if there is a 32/64-bit mismatch between the
RISC-V ABI or architecture and the BLIS configuration selected.
- Handle Q, Zicsr, ZiFencei, Zba, Zbb, Zbc, Zbs and Zfh extensions in
the RISC-V architecture auto-detection. ZiFencei and Zicsr is not
detectable with built-in RISC-V macros right now.
- ZiFencei is not important for BLIS because doesn't it have
Just-In-Time compilation or self-modifying code, and Zicsr is implied
by the floating-point extensions, which are required for good
performance in BLIS.
- Move RISC-V autodetect header files to build/detect/riscv/.

commit a0b04e3c007f1207e5678bf20c07752906742fb7 (origin/aocl-blas, aocl-blas)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 26 17:59:21 2023 -0500

Rewrote regen-symbols.sh (gen-libblis-symbols.sh). (751)

Details:
- Wrote an alternative to regen-symbols.sh, gen-libblis-symbols.sh,
that generates a list of exported symbols from the monolithic blis.h
file rather than peeking inside of the shared object via nm. (This new
script lives in the 'build' directory and the older script has been
retired to build/old.) Special thanks to Devin Matthews for authoring
gen-libblis-symbols.sh.
- Added a 'symbols' target to the top-level Makefile which will refresh
build/libblis-symbols.def, with supporting changes to common.mk.
- Updates to build/libblis-symbols.def using the new symbol-generating
script.

commit 6b894c30b9bb2c2518848d74e4c8d96844f77f24
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 12 17:22:44 2023 -0500

Rewrote/fixed broken tree barrier implementation.

Details:
- Rewrote the defintion of bli_thrcomm_tree_barrier() so that it (a)
actually worked again, and (b) used atomics instead of a basic C99
spin loop. (Note that the conventional barrier implementation is
still enabled by default; the tree barrier must be toggled on
manually within the configuration.)
- Added an early return to the definition of bli_thrcomm_barrier() in
the cases where comm == NULL or comm->n_threads == 1.
- Reordered thread-related and thread-dependent header include
directives in blis.h so that the BLIS_TREE_BARRIER and
BLIS_TREE_BARRIER_ARITY macros, which would be defined in the target
configuration's in the bli_family_*.h file, would be included prior
to the inclusion of the thrcomm_t header that uses them.
- Changed the type of barrier_t.count from 'int' to 'dim_t'.
- Changed the type of barrier_t.signal from 'volatile int' to 'gint_t'.
- Special thanks to Leick Robinson for contributing these changes.
- Whitespace changes.

commit d639554894b6252a86bd3164921bce6fbb9e3b5e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 7 16:11:14 2023 -0500

Pad thrcomm_t fields to avoid false sharing.

Details:
- Inserted a cache line of padding between various fields of the
thrcomm_t and, in the case of the (presently defunct) tree barrier,
fields of the barrier_t. This additional padding ensures that these
fields, which both serve different purposes when performing a thread
barrier, are only accessed when needed (and not just due to their
spatial locality with their cache line neighbors).
- Added a new cpp macro constant, BLIS_CACHE_LINE_SIZE, to
bli_config_macro_defs. This new constant defines the size of a cache
line (in bytes) and defaults to 64.
- Special thanks to Leick Robinson for discovering this false sharing
issue and developing/submitting the patch.

commit 89b7863fc9a88903917deedc6a5ad9fd17f83713
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon May 8 16:51:18 2023 -0500

Fix 1m enablement for herk/her2k/syrk/syr2k. (743)

Details:
- Ever since 28b0982, herk, her2k, syrk, and syr2k have been implemented
in terms of the gemmt expert API. And since the decision of which
induced method to use (1m or native) is made *below* the level of the
expert API, executing any of {herk,her2k,syrk,syr2k} results in BLIS
checking the enablement status for gemmt.
- This commit applies a band-aid of sorts to this issue by modifying
bli_l3_ind_oper_get_enable() and bli_l3_ind_oper_set_enable() so that
any attempts to query or modify the internal enablement status for
herk, her2k, syrk, or syr2k instead does so for gemmt.
- This solution isn't perfect since, in theory, the user could enable 1m
for, say, herk but then disable it for syrk, and then be confused when
herk runs via native execution. But we don't anticipate that users
modify 1m enablement at the operation level, and so in practice this
solution is likely fine for now.

commit 138de3b3e88c5bf7d8718c45c88811771cf42db8
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Sun May 7 13:01:38 2023 -0700

add nvhpc compiler support (719)

Add detection of the NVIDIA nvhpc compiler (`nvc`) in `configure`, and adjust some warning options in `config.mk`. Currently, no specific options for `nvc` have been added in the relevant configurations so it may not be usable without further tweaks.

commit 0873c0f6ed03fea321d1631b3d1a385a306aa797
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 7 14:03:19 2023 -0500

Consolidate INSERT_ macro sets via variadic macros. (744)

Details:
- Consolidated INSERT_GENTFUNC_* (and corresponding GENTPROT) macro sets
using variadic macros (__VA_ARGS__), which means we no longer need a
different INSERT_ macro for each possible number of arguments the
macro might take. This change seems reasonable given that variadic
macros are a standard C99 feature and widely supported. I took care
not to use variadic macros where 0 variadic arguments are expected
since that is a non-standard extension.
- Added pre-typecast parentheses to arithmetic expressions in printf()
statements in bli_thread_range_tlb.c.

commit ef9d3e6675320a53e7cb477c16b01388e708b1da
Author: h-vetinari <h.vetinarigmx.com>
Date: Sun May 7 04:59:35 2023 +1100

Added missing include <io.h> for Windows. (747)

Details:
- This commit fixes issue 746, in which the _access() function (called
from within blastest/f2c/open.c) is undeclared when compiling on
Windows with clang 16.

commit 6fd9aabb03d172a792a7eeb106c7d965cf038421
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri May 5 14:22:52 2023 -0500

Fix bug in detecting Fortran compiler vendor (745)

`FC` was used instead of `found_fc`.

commit 8215b02f99aa77ecc7d813508c247565115319d7
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Wed Apr 12 12:59:27 2023 -0500

Apply 738 to make_defs.mk of RISC-V subconfigs. (740)

Details:
- PR 738 -- which moved -fPIC flag insertion responsibilities from
common.mk to the subconfigs' individual make_defs.mk files -- was
merged shortly before the introduction of new RISC-V subconfigs in
693. This commit brings those RISC-V subconfigs up to date with the
new -fPIC conventions.

commit 6b38c5ac07a2a27738674784e58aa699bf895447
Author: angsch <17718454+angschusers.noreply.github.com>
Date: Tue Apr 11 19:27:43 2023 +0200

Add RISC-V target (693)

Details:
- There are four RISC-V base configurations: 'rv32i', 'rv32iv', 'rv64i',
and 'rv64iv', namely the 32-bit and 64-bit implementations with and
without the 'V' vector extension. Additional extensions such as 'M'
(multiplication), 'A' (atomics), 'F' ('float' hardware support), 'D'
('double' hardware support), and 'C' (compressed-length instructions),
are automatically used when available. If they are not available, then
software equivalents (e.g., softfloat and -latomic) are used.
- './configure auto' can be invoked on a RISC-V build platform, and will
automatically detect RISC-V CPU extensions through the RISC-V C API:
https://github.com/riscv-non-isa/riscv-c-api-doc/blob/master/riscv-c-api.md
- The assembly kernels assume the presence of the vector extension
RVV 1.0.
- It is possible to build 'rv[32,64]iv' for any value of VLEN.
However, if VLEN < 128, the targets will fall back to the generic
kernels and blocksizes.
- The vector microkernels are vector-length agnostic and work with
every VLEN >=128, but are expected to work best with smaller vector
lengths, i.e., VLEN <= 512.
- The assembly kernels cover column major storage (rs_c == 1).
- The blocksizes aim at being a good generic choice for out-of-order
cores. They are not tuned to a specific RISC-V HPC core.
- The vector kernels have been tested using vlen={128,256,512}.
- The single- and double-precision assembly code routines for 'sgemm'
and 'dgemm', or for 'cgemm' and 'zgemm', are combined in their RISC-V
vector assembly source code, and are differentiated only with macros.
- The XLEN=32 and XLEN=64 versions of the RISC-V assembly code are
identical, except that callee-saved registers are saved and restored
differently. There are RISC-V assembly code include files for
handling the saving and restoring of callee-saved registers, and they
are future-proof if ever XLEN=128.
- Multiplications, such as computing array strides and offsets, are
performed in C, and later passed to the RISC-V assembly kernels. This
is so that the compiler can determine whether the 'M' (multiply)
extension is available and use multiplication instructions, or call
library helper functions instead.
- A new macro called bli_static_assert() has been added to perform
static assertions at compile-time, regardless of the C/C++ dialect of
the compiler. The original motivation of this was to ensure that
calling RISC-V assembly kernels would not silently truncate arguments
of type 'dim_t' or 'inc_t' (so-called "narrowing conversions").
- RISC-V CI tests have been added to Travis CI, using the
riscv-gnu-toolchain cross-compiler, and qemu simulator.
- Thanks to Lee Killough for collaborating on this commit.

commit 593d01761910af6a9a16ee0ac097142732f73c29
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 8 16:44:16 2023 -0500

CREDITS file update.

commit 259f68479671bbaf9c5986759aaa0004f9b05a24
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 7 16:11:34 2023 -0500

CREDITS file update.

Details:
- Added attributions associated with commits:
- 98d4678 9b1beec: bartoldeman
- 2b05948 059f151: ct-clmsn
- Reordered attirubtion for decandia50.

commit aea8e1d9243631635ca788d5e14f0f29328e637d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 3 12:17:51 2023 -0500

Optionally disable thread-local storage. (735)

Details:
- Implemented a new configure option, --disable-tls, which allows the
user to optionally disable the use of thread-local storage qualifiers
on static variables in BLIS. This option will rarely be needed, but
in some situations may allow BLIS to compile when TLS is unavailable.
Thanks to Nick Knight for suggesting this option.
- Unlike the --disable-system option, --disable-tls does not forcibly
disable threading. Instead, warnings of the possible consequences of
using threading with TLS disabled are added to:
- the output of './configure --help';
- the output of 'configure' the --disable-tls option is parsed;
- the informational header output by the testsuite.
Thanks to Minh Quan Ho for suggesting these warnings.
- Modified frame/include/bli_lang_defs.h so that BLIS_THREAD_LOCAL is
defined to nothing when BLIS_ENABLE_TLS is not defined.
- Defined bli_info_get_enable_tls(), which returns whether the cpp macro
BLIS_ENABLE_TLS was defined.
- Edited --disable-system configure status output for clarity.
- Whitespace updates.

commit 3f1432abe75cc306ef90a04381d7e0d8739fded8
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Mon Apr 3 12:10:59 2023 -0500

Add output.testsuite to .gitignore (736)

Details:
- Added `output.testsuite` to .gitignore since it was previously not
being matched by `output.testsuite.*`.

commit 38fc5237520a2f20914a9de8bb14d5999009b3fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 30 17:30:07 2023 -0500

Added mm_algorithm pdf files (bp and pb).

Details:
- Added PDF versions of the PowerPoint files added in 17cd260.

commit 17cd260cb504b2f3997c32daec77f4c828fbb32b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 29 21:47:12 2023 -0500

Added mm_algorithm pptx files (bp and pb).

Details:
- Added two PowerPoint files that contain slides depicting the classic
Goto algorithm for matrix multiplication as well as its sister
"panel-block" algorithm. These files reside in docs/diagrams.

commit 9d778e0f7c94d8752dd578101e4fc6893a1f54ef
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 29 17:36:49 2023 -0500

Move -fPIC insertion to subconfigs' make_defs.mk. (738)

* Move -fPIC insertion to subconfigs' make_defs.mk.

Details:
- Previously, common.mk was appending -fPIC to the CPICFLAGS variables
set within the various subconfigurations' make_defs.mk files. This
seemed somewhat unintuitive, and so now the -fPIC flag is assigned to
the various subconfigs' CPICFLAGS variables in the respective
make_defs.mk files.
- This also commit changes the logic in common.mk so that instead of
appending, the variable is overwritten, but now *only* in the case
of Windows (since apparently -fPIC needs to be omitted there). Thanks
to Nick Knight for catching and reporting this weirdness.

commit 04090df01175477394d1e73af2e5769751d47cd6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 27 14:13:10 2023 -0500

Fixed compile errors with `BLIS_DISABLE_BLAS_DEFS`. (730)

* Fixed compile errors with BLIS_DISABLE_BLAS_DEFS.

Details:
- This commit fixes a compile-time error related to the type definition
(prototype) of dsdot_() when BLIS_DISABLE_BLAS_DEFS is defined by the
application (or the configuration), which is actually a symptom of a
larger design issue when disabling BLAS prototypes. The macro was
intended to allow applications to bring their own BLAS prototypes and
suppress the inclusion of duplicate (or possibly conflicting)
prototypes within blis.h. However, prototypes are still needed during
compilation even if they are ultimately omitted from blis.h. The
problem is that almost every source file in BLIS--including the BLAS
compatibility layer--only includes one header (blis.h), and if we
were to include a new header in the BLAS source files (to isolate
only the BLAS prototypes), we would also have to make the build system
aware of the location of those headers. Thanks to Edward Smyth of AMD
for reporting this issue.
- The solution I settled upon was to remove all cpp guards from all BLAS
headers (by changing them to if 1, for easy search-and-replace
anchoring in the future if we ever need to re-insert guards) and
modifying bli_blas.h so that the BLAS prototypes are included if
either (a) BLIS_ENABLE_BLAS_DEFS is defined, or (b)
BLIS_ENABLE_BLAS_DEFS is *not* defined but BLIS_IS_BUILDING_LIBRARY
*is* defined. (Thanks to Devin Matthews for steering me away from an
inferior solution.)
- This commit also spins off the actual BLAS prototypes/definitions to
a separate file, bli_blas_defs.h.
- CREDITS file update.

commit 5f841307f668f65b7ed5a479bd8374d2581208cf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 24 20:05:13 2023 -0500

Omit -fPIC if shared library build is disabled. (732)

Details:
- Updated common.mk so that when --disable-shared option is given to
configure:
1. The -fPIC compiler flag is omitted from the individual
configuration family members' CPICFLAGS variables (which are
initialized in each subconfig's make_defs.mk file); and
2. The BUILD_SYMFLAGS variable, which contains compiler flags needed
to control the symbol export behavior, is left blank.
- The net result of these changes is that flags specific to shared
library builds are only used when a shared library is actually
scheduled to be built. Thanks to Nick Knight for reporting this issue.
- CREDITS file update.

commit 72c37eb80f964b7840377076e5009aec5b29d320 (origin/riscv)
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Mar 23 16:01:55 2023 -0500

Updated configure to pass all shellcheck checks. (729)

Details:
- Modified configure so that it passes all 'shellcheck' checks,
disabling ones which we violate but which are just stylistic, or are
special cases in our code.
- Miscellaneous other minor changes, such as rearranged redirections in
long sed/perl pipes to look more natural.
- Whitespace tweaks.

commit 60f36347c16e6336215cd52b4e5f3c0f96e7c253
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 22 20:37:30 2023 -0600

Fixed bugs in scal2v ref kernel when alpha == 1. (728)

Details:
- Fixed a typo bug in ref_kernels/1/bli_scal2v_ref.c where the
conditional that was supposed to be checking for cases when alpha is
equal to 1.0 (so that copyv could be used instead of scal2v) was
instead erroneously comparing alpha against 0.0.
- Fixed another bug in the same function whereby BLIS_NO_CONJUGATE was
erroneously being passed into copyv instead of the kernel's conjx
parameter. This second bug was inert, however, due to the first bug
since the "alpha == 0.0" case was already being handled, resulting in
the code block never executing.

commit fab18dca46618799bb0b4f652820b33d36a5d4d4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 22 16:50:00 2023 -0600

Use 'void*' datatypes in kernel APIs. (727)

Details:
- Migrated all kernel APIs to use void* pointers instead of float*,
double*, scomplex*, and dcomplex* pointers. This allows us to define
many fewer kernel function pointer types, which also makes it much
easier to know which function pointer type to use at any given time.
(For example, whereas before there was ?axpyv_ker_ft, ?axpyv_ker_vft,
and axpyv_ker_vft, now there is just axpyv_ker_ft, which is equivalent
so what axpyv_ker_vft used to be.)
- Refactored how kernel function prototypes and kernel function types
are defined so as to reduce redundant code. Specifically, the
function signatures (excluding cntx_t* and, in the case of level-3
microkernels, auxinfo_t*) are defined in new headers named, for
example, bli_l1v_ker_params.h. Those signatures are reused via macro
instantiation when defining both kernel prototypes and kernel function
types. This will hopefully make it a little easier to update, add, and
manage kernel APIs going forward.
- Updated all reference kernels according to the aforementioned switch
to void* pointers.
- Updated all optimzied kernels according to the aforementioned switch
to void* pointers. This sometimes required renaming variables,
inserting typecasting so that pointer arithmetic could continue to
function as intended, and related tweaks.
- Updated sandbox/gemmlike according to the aforementioned switch to
void* pointers.
- Renamed:
- frame/1/bli_l1v_ft_ker.h -> frame/1/bli_l1v_ker_ft.h
- frame/1f/bli_l1f_ft_ker.h -> frame/1f/bli_l1f_ker_ft.h
- frame/1m/bli_l1m_ft_ker.h -> frame/1m/bli_l1m_ker_ft.h
- frame/3/bli_l1m_ft_ukr.h -> frame/3/bli_l1m_ukr_ft.h
- frame/3/bli_l3_sup_ft_ker.h -> frame/3/bli_l3_sup_ker_ft.h
to better align with naming of neighboring files.
- Added the missing "void* params" argument to bli_?packm_struc_cxk() in
frame/1m/packm/bli_packm_struc_cxk.c. This argument is being passed
into the function from bli_packm_blk_var1(), but wasn't being "caught"
by the function definition itself. The function prototype for
bli_?packm_struc_cxk() also needed updating.
- Reordered the last two parameters in bli_?packm_struc_cxk().
(Previously, the "void* params" was passed in after the
"const cntx_t* cntx", although because of the above bug the params
argument wasn't actually present in the function definition.)

commit 93c63d1f469c4650df082d0fa2f29c46db0e25f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 20 11:14:23 2023 -0600

Use 'const' pointers in kernel APIs. (722)

Details:
- Qualified all input-only data pointers in the various kernel APIs with
the 'const' keyword while also removing 'restrict' from those kernel
APIs. (Use of 'restrict' was maintained in kernel implementations,
where appropriate.) This affected the function pointer types defined
for all of the kernels, their prototypes, and the reference and
optimized kernel definitions' signatures.
- Templatized the definitions of copys_mxn and xpbys_mxn static inline
functions.
- Minor whitespace and style changes (e.g. combining local variable
declaration and initialization into a single statement).
- Removed some unused kernel code left in 'old' directories.
- Thanks to Nisanth M P for helping to validate changes to the power10
microkernels.

commit 4e18cd34f909c5045597f411340ede3a5e0bc5e1
Author: RuQing Xu <ruqing.xuphys.s.u-tokyo.ac.jp>
Date: Sun Feb 19 04:18:41 2023 +0900

Restored ArmSVE general storage case. (708)

Details:
- Restored general storage case in armsve kernels.
- Reason for doing this: Though real `g`-storage is difficult to
speedup, `g`-codepath here can provide a good support for
transposed-storage. i.e. at least good for `GEMM_UKR_SETUP_CT_AMBI`.
- By experience, this solution is only *a little* slower than in-reg
transpose. Plus in-reg transpose is only possible for a fixed VL in
our case.

commit 0ba6e9eafb1e667373d9dbc2aa045557921f33e2
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Sat Feb 18 13:15:42 2023 -0600

Refined emacs handling of indentation. (717)

Details:
- This refines the emacs autoformatting to be better in line with
contribution guidelines.
- Removed a stray shebang in a .mk file which confuses emacs about the
file mode, which should be makefile-mode. (emacs also removes stray
whitespace at the ends of lines.)

commit 059f15105b1643fe56084f883c22b3cadf368b39
Author: ct-clmsn <ct.clmsngmail.com>
Date: Sat Feb 18 14:13:23 2023 -0500

Updated hpx namespace for make_count_shape. (725)

Details:
- The hpx namespace for *counting_shape changed. This PR updates the use
of counting_shape in blis to comply with the change in hpx.
- Co-authored-by: ctaylor <ctaylortactcomplabs.com>

commit 0b421eff130b5c896edcc09e7358d18564d177e9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Feb 18 13:11:41 2023 -0600

Added an 'arm64' entry to `.travis.yml`. (726)

Details:
- Added a new 'arm64' entry to the .travis.yml file in an attempt to get
Travis CI to compile both NEON and SVE kernels, even if only NEON
kernels are exercised in the testing. With this new 'arm64' entry, the
'cortexa57' entry becomes redundant and may be removed. Thanks to
RuQing Xu for this suggestion.
- Previously, the macro BLIS_SIMD_MAX_SIZE was *not* being set in
bli_kernels_arm64.h, which meant that the default value of 64 was
being used. This caused a runtime consistency check to fail in
bli_gks.c (in Travis CI), one which requires that

mr * nr * dt_size > BLIS_STACK_BUF_MAX_SIZE

for all datatype sizes dt_size, where BLIS_STACK_BUF_MAX_SIZE is
defined as

BLIS_SIMD_MAX_NUM_REGISTERS * BLIS_SIMD_MAX_SIZE * 2

This commit increases BLIS_SIMD_MAX_SIZE to 128 for the 'arm64'
configuration, thus overriding the default and (hopefully) avoiding
the aforementioned consistency check failures.
- Appended '|| cat ./output.testsuite' to all 'make' commands in
travis/do_testsuite.sh. Thanks to RuQing Xu for this suggestion.
- Whitespace changes.

commit b1d3fc7e5b0927086e336a23f16ea59aa3611ccb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 10 15:34:47 2023 -0600

Redirect grep stderr to /dev/null. (723)

Details:
- In common.mk, added a redirection of stderr to /dev/null for the grep
command being used to gather a list of header files included from
bli_cntx_ref.c. The redirection is desirable because as of grep 3.8,
regular expressions with "stray" backslashes trigger warnings [1].
But removing the backslash seems to break the BLIS build system when
using pre-3.8 versions of grep, so this seems to be easiest way to
satisfy the BLIS build system for both pre- and post-3.8 grep
environments.

[1] https://lists.gnu.org/archive/html/info-gnu/2022-09/msg00001.html

commit e3d352f1fcc93e6a46fde1aa4a7f0a18fb27bd42
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Feb 8 06:11:41 2023 +0530

Added runtime selection of 'power' config family. (718)

Details:
- Created a 'power' umbrella configuration family, which, when targeted
at configure-time, will build both 'power9' and 'power10' subconfigs.
(With this feature, a BLIS shared library could be compiled on a
power9 system and run on power10 and vice-versa. Unoptimised code
will execute if it is linked and run on any other generic system.)
- This new configuration family will only work with gcc, since that is
the only compiler supported by both power9 and power10 subconfigs in
BLIS.
- Documented power9 and power10 as supported microarchitectures in the
docs/HardwareSupport.md document.

commit e730c685d09336b3bd09e86c94330c4eba967f3e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 6 15:31:54 2023 -0600

Define `BLIS_VERSION_STRING` in `blis.h`. (720)

Details:
- Previously, the version string was communicated from configure to
config.mk (via the config.mk.in template), where it was included via
the top-level Makefile, where it was then used to define the
preprocessor macro BLIS_VERSION_STRING via a command line argument to
the compiler (via -D). This macro is then used within bli_info.c to
initialize a static string which can then be queried via the
bli_info_get_version_str() function. However, there are some
applications that may find utility in being able to access the version
string by inspecting the monolithic (flattened) blis.h header file
that is created at compile time and installed alongside the library.
This commit moves the definition of BLIS_VERSION_STRING into
bli_config.h (via the bli_config.h.in template) so that it is
embedded in blis.h. The version string is now available in three
places:
- the static/shared library, which is installed in the 'lib'
subdirectory of the install prefix (query-able via the
bli_info_get_version_str() function);
- the config.mk makefile fragment, which is installed in the 'share'
subdirectory of the install prefix (in the VERSION variable);
- the blis.h header file, which is installed in the 'include'
subdirectory of the install prefix (via the BLIS_VERSION_STRING
macro constant).
Thanks to Mohsen Aznaveh and Tim Davis for providing the idea for this
change.
- CREDITS file update.

commit dc5d00a6ce0350cd82859d8c24f23d98f205d8db
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Fri Jan 27 17:36:47 2023 -0600

Typecast printf() args to avoid compiler warnings. (716)

Details:
- In bli_thread_range_tlb.c, typecast integer arguments passed to
printf() -- which are typically disabled unless debugging -- to type
"long" to guarantee a match to the "%ld" format specifiers used in
those calls. This avoids spurious warnings with certain compilers in
certain toolchain environments, such as 32-bit RISC-V (rv32iv).

commit ecbcf4008815035c695822fcaf106477debff89a
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Wed Jan 18 20:35:50 2023 -0600

Use here-document for 'configure --help' output. (714)

Details:
- Changed the configure script function that outputs "--help" text to do
so via so-called "here-document" syntax for improved readability and
maintainability. The change eliminates hundreds of echo statements and
makes it easier to change existing configure options' help text, along
with other benefits such as eliminating the need to escape double-
quote characters (").

commit c334ec278f5e2a101625629b2e13bbf1b38dede5
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jan 18 13:10:19 2023 -0600

Merge tlb- and slab/rr-specific gemm macrokernels. (711)

Details:
- Merged the tlb-specific gemm macrokernel (_var2b) with the slab/rr-
specific one (var2) so that a single function can be compiled with
either tlb or slab/rr support, depending on the value of the
BLIS_ENABLE_JRIR_TLB, _SLAB, and _RR. This is done by incorporating
information from both approaches: the start/end/inc for the JR and IR
loops from slab or rr partitioning; and the number of assigned
microtiles, plus the starting IR dimension offset for all iterations
after the first (ir_next). With these changes, slab, rr, and tlb can
all be parameterized by initializing a similar set of variables prior
to the jr loop.
- Removed the wrap-around logic that sets the "b_next" field of the
auxinfo_t struct, which executes during the last IR iteration of the
last JR iteration. The potential benefit of this code is so minor
(and hinges on the microkernel making use of the b_next field) that
it's arguably not worth including. The code also does the wrong
thing for some threads whenever JR_NT > 1, since only thread 0 (in the
JR group) would even compute with the first micropanel of B.
- Re-expressed the definition of bli_is_last_iter_slrr so that slab and
tlb use the same code rather than rr and tlb.
- Adjusted the initialization of the gemm control tree accordingly.

commit 5793a77937aee9847a5692c8e44b36a6380800a1
Author: HarshDave12 <122850830+HarshDave12users.noreply.github.com>
Date: Tue Jan 17 21:55:02 2023 +0530

Fixed mis-mapped instruction for VEXTRACTF64X2. (713)

Details:
- This commit fixes a typo in the macro definition for the extended
inline assembly macro VEXTRACTF64X2 in bli_x86_asm_macros.h. The macro
was previously defined (incorrectly) in terms of the vextractf64x4
instruction rather than vextractf64x2.
- CREDITS file update.

commit 16d2e9ea9ca0853197b416eba701b840a8587bca
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 13 20:03:01 2023 -0600

Defined lt, lte, gt, gte + misc. other updates. (712)

Details:
- Changed invertsc operation to be a non-destructive operation; that is,
it now takes separate input and output operands. This change applies
to both the object and typed APIs.
- Defined an alternative square root operation, sqrtrsc, which, when
operating on complex scalars, assumes the imaginary part of the input
to be zero.
- Changed the semantics of addm, subm, copym, axpym, scal2m, and xpbym
so that when the source matrix has an implicit unit diagonal, the
operation leaves the diagonal of the destination matrix untouched.
Previously, the operations would interpret an implicit unit diagonal
on the source matrix as a request to manifest the unit diagonal
*explicitly* on output (either as something to copy in the case of
copym, or something to compute with in the cases of addm, subm, axpym,
scal2m, and xpbym). It turns out that this behavior was too cute by
half and could cause unintended headaches for practical use cases.
(This change in behavior also required small modifications to the trmv
and trsv testsuite modules so that they would properly test matrices
with unit diagonals.)
- Added missing dependencies for copym to gemv, ger, hemv, trmv, and
trsv testsuite modules.
- Implemented level-0-like ltsc, ltesc, gtsc, gtesc operations in
frame/util, which use lt, lte, gt, and gte level-0 scalar macros.
- Trivial variable rename in bli_part.c to harmonize with other
variable naming conventions.

commit 9a366b14fe52c469f4664ef5dd93d85be8d97baa
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 12 13:07:22 2023 -0600

Implement cntx_t pointer caching in gks. (709)

Details:
- Refactored the gks cntx_t query functions so that: (1) there is a
clearer pattern of similarity between functions that query a native
context and those that query its induced (1m) counterpart; and (2)
queried cntx_t pointers (for both native and induced cntx_t pointers)
are cached (by default), or deep-queried upon each invocation,
depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is defined.
- Refactored query-related functions in bli_arch.c to cache the queried
arch_t value (by default), or deep-query the arch_t value upon each
invocation, depending on whether cpp macro BLIS_ENABLE_GKS_CACHING is
defined.
- Tweaked the behavior of bli_gks_query_ind_cntx_impl() (formerly named
bli_gks_query_ind_cntx()) so that the induced method cntx_t struct is
repopulated each time the function is called. (It is still only
allocated once on first call.) This was mostly done in preparation for
some future in which the arch_t value might change at runtime. In such
a scenario, the induced method context would need to be recalculated
any time the native context changes.
- Added preprocessor logic to bli_config_macro_defs.h to handle enabling
or disabling of cntx_t pointer caching (via BLIS_ENABLE_GKS_CACHING).
- For now, cntx_t pointer caching is enabled by default and does not
correspond to any official configure option. Disabling can be done
by inserting a define for BLIS_DISABLE_GKS_CACHING into the
appropriate bli_family_*.h header file within the configuration of
interest.
- Thanks to Harihara Sudhan S (AMD) for suggesting that cntxt_t pointers
(and not just arch_t values) be cached.
- Comment updates.

commit b895ec9f1f66fb93972589c06bff171337153a31
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Jan 11 09:02:32 2023 +0530

Fixing type-mismatch errors in power10 sandbox (701)

Details:
- This commit fixes a mismatch between the function type signature of
bli_gemm_ex() required by BLIS and the version of the function defined
within the power10 sandbox. It also performs typecasting upon calling
bli_gemm_front() to attain type consistency with the type signature
defined by BLIS for bli_gemm_front().

commit 38d88d5c131253066cad4f98eea06fa9299cae3b
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jan 10 21:24:58 2023 -0600

Define new global scalar (obj_t) constants. (703)

Details:
- This commit defines the following new global scalar constants:
- BLIS_ONE_I: This constant encodes the imaginary unit.
- BLIS_MINUS_ONE_I: This constant encodes the negative imaginary unit.
- BLIS_NAN: This constant encodes a not-a-number value. Both real and
imaginary parts are set to NaN for complex datatypes.

commit cdb22b8ffa5b31a0c16ac1a7bcecefeb5216f669
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Jan 11 08:50:57 2023 +0530

Disable power10 kernels other than sgemm, dgemm. (705)

Details:
- There is a power10 sandbox which uses microkernels for datatypes other
than float and double (or scomplex/dcomplex). In a regular power10-
configured build (that is, with the sandbox disabled), there were
compile errors for some of these other non-sgemm/non-dgemm
microkernels. This commit protects those kernels with a new cpp macro
guard (which is defined in sandbox/power10/bli_sandbox.h) that
prevents that kernel code from being compiled for normal, non-sandbox
power10 builds.

commit d220f9c436c0dae409974724d42ab6c52f12a726
Author: Nisanth M P <nisanthmp.01gmail.com>
Date: Wed Jan 11 08:43:03 2023 +0530

Fix k = 0 edge case in power10 microkernels (706)

Details:
- When power10 sgemm and dgemm microkernels are called with k = 0, they
become caught in infinite loops and segfault. This is fixed now via an
early exit in the case of k = 0.

commit 2e1ba9d13c23a06a7b6f8bd326af428f7ea68c31
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 10 21:05:54 2023 -0600

Tile-level partitioning in jr/ir loops (ex-trsm). (695)

Details:
- Reimplemented parallelization of the JR loop in gemmt (which is
recycled for herk, her2k, syrk, and syr2k). Previously, the
rectangular region of the current MC x NC panel of C would be
parallelized separately from from the diagonal region of that same
submatrix, with the rectangular portion being assigned to threads via
slab or round-robin (rr) partitioning (as determined at configure-
time) and the diagonal region being assigned via round-robin. This
approach did not work well when extracting lots of parallelism from
the JR loop and was often suboptimal even for smaller degrees of
parallelism. This commit implements tile-level load balancing (tlb) in
which the IR loop is effectively subjugated in service of more
equitably dividing work in the JR loop. This approach is especially
potent for certain situations where the diagonal region of the MC x NR
panel of C are significant relative to the entire region. However, it
also seems to benefit many problem sizes of other level-3 operations
(excluding trsm, which has an inherent algorithmic dependency in the
IR loop that prevents the application of tlb). For now, tlb is
implemented as _var2b.c macrokernels for gemm (which forms the basis
for gemm, hemm, and symm), gemmt (which forms the basis of herk,
her2k, syrk, and syr2k), and trmm (which forms the basis of trmm and
trmm3). Which function pointers (_var2() or _var2b()) are embedded in
the control tree will depend on whether the BLIS_ENABLE_JRIR_TLB cpp
macro is defined, which is controlled by the value passed to the
existing --thread-part-jrir=METHOD (or -r METHOD) configure option.
This script adds 'tlb' as a valid option alongside the previously
supported values of 'slab' and 'rr'. ('slab' is still the default.)
Thanks to Leick Robinson for abstractly inspiring this work, and to
Minh Quan Ho for inquiring (in PR 562, and before that in Issue 437)
about the possibility of improved load balance in macrokernel loops,
and even prototyping what it might look like, long before I fully
understood the problem.
- In bli_thread_range_weighted_sub(), tweaked the the way we compute the
area of the current MC x NC trapezoidal panel of C by better taking
into account the microtile structure along the diagonal. Previously,
it was an underestimate, as it assumed MR = NR = 1 (that is, it
assumed that the microtile column of C that overlapped with microtiles
exactly coincided with the diagonal). Now, we only assume MR = NR.
This is still a slight underestimate when MR != NR, so the additional
area is scaled by 1.5 in a hackish attempt to compensate for this, as
well as other additional effects that are difficult to model (such as
the increased cost of writing to temporary tiles before finally
updating C). The net effect of this better estimation of the
trapezoidal area should be (on average) slightly larger regions
assigned to threads that have little or no overlap with the diagonal
region (and correspondingly slightly smaller regions in the diagonal
region), which we expect will lead to slightly better load balancing
in most situations.
- Spun off the contents of bli_thread.[ch] that relate to computing
thread ranges into one of three source/header file pairs:
- bli_thread_range.[ch], which define functions that are not specific
to the jr/ir loops;
- bli_thread_range_slab_rr.[ch], which define functions that implement
slab or round-robin partitioning for the jr/ir loops;
- bli_thread_range_tlb.[ch], which define functions that implement
tlb for the jr/ir loops.
- Fixed the computation of a_next in the last iteration of the IR loop
in bli_gemmt_l_ker_var2(). Previously, it always "wrapped" back around
to the first micropanel of the current MC x KC packed block of A.
However, this is almost never actually the micropanel that is used
next. A new macro, bli_gemmt_l_wrap_a_upanel(), computes a_next
correctly, with a similarly named bli_gemmt_u_wrap_a_upanel() for use
in the upper-stored case (which *does* actually always choose the
first micropanel of A as its a_next at the end of the IR loop).
- Removed adjustments for a_next/b_next (a2/b2) for the diagonal-
intersecting case of gemmt_l_ker_var2() and the above-diagonal case
of gemmt_u_ker_var2() since these cases will only coincide with the
last iteration of the IR loop in very small problems.
- Defined bli_is_last_iter_l() and bli_is_last_iter_u(), the latter of
which explicitly considers whether the current microtile is the last
tile that intersects the diagonal. (The former does the same, but the
computation coincides with the original bli_is_last_iter().) These
functions are now used in gemmt to test when a_next (or a2) should
"wrap" (as discussed above). Also defined bli_is_last_iter_tlb_l()
and bli_is_last_iter_tlb_u(), which are similar to the aforementioned
functions but are used when employing tlb in gemmt.
- Redefined macros in bli_packm_thrinfo.h, which test whether an
iteration of work is assigned to a thread, as static inline functions
in bli_param_macro_defs.h (and then deleted bli_packm_thrinfo.h).
In the process of redefining these macros, I also renamed them from
bli_packm_my_iter_rr/sl() to bli_is_my_iter_rr/sl().
- Renamed
bli_thread_range_jrir_rr() -> bli_thread_range_rr()
bli_thread_range_jrir_sl() -> bli_thread_range_sl()
bli_thread_range_jrir() -> bli_thread_range_slrr()
- Renamed
bli_is_last_iter() -> bli_is_last_iter_slrr()
- Defined
bli_info_get_thread_jrir_tlb()
and renamed:
- bli_info_get_thread_part_jrir_slab() ->
bli_info_get_thread_jrir_slab()
- bli_info_get_thread_part_jrir_rr() ->
bli_info_get_thread_jrir_rr()
- Modified bli_rntm_set_ways_for_op() to redirect IR loop parallelism
into the JR loop when tlb is enabled for non-trsm level-3 operations.
- Added a sanity check to prevent bli_prune_unref_mparts() from being
used on packed objects. This prohibition is necessary because the
current implementation does not take into account the atomicity of
packed micropanel widths relative to the diagonal of structured
matrices. That is, the function prunes greedily without regard to
whether doing so would prune off part of a micropanel *which has
already been packed* and assigned to a thread for inclusion in the
computation.
- Further restricted early returns in bli_prune_unref_mparts() to
situations where the primary matrix is not only of general structure
but also dense (in terms of its uplo_t value). The addition of the
matrix's dense-ness to the conditional is required because gemmt is
somewhat unusual in that its C matrix has general structure but is
marked as lower- or upper-stored via its uplo_t. By only checking
for general structure, attempts to prune gemmt C matrices would
incorrectly result in early returns, even though that operation
effectively treats the matrix as symmetric (and stored in only one
triangle).
- Fixed a latent bug in bli_thread_range_rr() wherein incorrect ranges
were computed when 1 < bf. Thankfully, this bug was not yet
manifesting since all current invocations used bf == 1.
- Fixed a latent bug in some unexercised code in bli_?gemmt_l_ker_var2()
that would perform incorrect pruning of unreferenced regions above
where the diagonal of a lower-stored matrix intersects the right edge.
Thankfully, the bug was not harming anything since those unreferenced
regions were being pruned prior to the macrokernel.
- Rewrote slab/rr-based gemmt macrokernels so that they no longer carved
C into rectangular and diagonal regions prior to parallelizing each
separately. The new macrokernels use a unified loop structure where
quadratic (slab) partitioning is used.
- Updated all level-3 macrokernels to have a more uniform coding style,
such as wrt combining variable declarations with initializations as
well as the use of const.
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
bli_thrinfo_thread_id(), respectively. This change probably should
have been included in aeb5f0c.
- Removed old prototypes in bli_gemmt_var.h and bli_trmm_var.h that
corresponded to functions that were removed in aeb5f0c.
- Other very minor cleanups.
- Comment updates.

commit b6735ca26b9d459d9253795dc5841ae8de9e84c9
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jan 6 14:10:01 2023 -0600

Refactor structure awareness in packm_blk_var1.c. (707)

Details:
- Factored some of the structure awareness out of the loop in
bli_packm_blk_var1(). So instead of having a single loop with
conditionals in the body to handle various kinds of structure (and
stored/unstored submatrix placement), we now have a conditional branch
to handle various structure/storage scenarios with a loop in each
section. This change was originally motivated to choose slab or round-
robin partitioning (in the context of triangular matrices) based on
the structure of the entire block (or panel) being packed rather than
each micropanel individually. Previously, the code would attempt to
limit rr to the portion of the block that intersects the diagonal and
use slab for the remainder. However, that approach was not well-thought
out and in many situations this would lead to inferior load balancing
when compared to using round-robin for the entire block (or panel).
This commit has the added benefit of incurring less overhead during
the packing process now that each of the new loops is simpler.

commit f956b79922da412791e4c8b8b846b3aafc0a5ee0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Dec 31 20:18:08 2022 -0600

Switch to l3 sup decorator in gemmlike sandbox. (704)

Details:
- Modified the gemmlike sandbox to call bli_l3_sup_thread_decorator()
rather than a local analogue of that code. This reduces redundant
logic and makes it easier for the sandbox to inherit future
improvements to the framework's threading code.
- Moved addon/gemmd to addon/old/gemmd. This code has fallen out of date
and is taking too much effort to maintain. We will very likely
reimplement it completely once future changes are made to the
framework proper.

commit 538150c5845ad903773ca797c740048174116aa4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Dec 25 22:28:09 2022 -0600

Applied race condition fix to sup thread decorator.

Details:
- Applied the race condition bugfix in commit 7d23dc2 to the
corresponding sup code in bli_l3_sup_decor.c. Note that in the case
of sup, the race condition would have only manifested when optional
packing was enabled at runtime (typically via setting BLIS_PACK_A
and/or BLIS_PACK_B environment variables).
- Both the fix in this commit and the fix in 7d23dc2 address bugs
that were introduced when the thrinfo_t trees/communicators were
restructured in the October omnibus commit (aeb5f0c).

commit 7d23dc2a064a371dc9883e2c2c7236a70912428c
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Dec 25 19:09:14 2022 -0600

Fix a race condition which manifested as incorrect results (rarely). (702)

The problem occurs when there are at least two teams of threads packing different parts of a matrix, and where each team has at least two threads; call them team A and team B. The problematic sequence is:

1. The chief of team A checks out a block B and broadcasts the pointer to its teammates.
2. Team A completely packs their data and perform a barrier amongst themselves.
3. Team A commences computing with the packed data.
4. The chief of team A finishes computing before its teammates, then calls bli_thrinfo_free on its thrinfo_t struct (which contains the mem_t object referencing the buffer B). This causes buffer B to be checked back in to the pba.
5. The chief of team B checks out the *same* block B that was just checked back in and broadcasts the pointer to its teammates.
6. DATA RACE: now the remaining threads of team A are reading *while* team B are writing to the same buffer B. If team A write new data before team B are done computing then an incorrect result is generated.

The solution is to place a global barrier before the call to bli_thrinfo_free at the end of the computation.

Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>

commit 3accacf57d11e9b109339754f91bf22329b6cb6a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 16 10:26:33 2022 -0600

Skip 1m optimization when forcing hemm_l/symm_l. (697)

Details:
- Fixed a bug in right-sided hemm when:
- using the 1m method,
- defining BLIS_DISABLE_HEMM_RIGHT in the active subconfiguration,
and
- the storage of C matches the gemm microkernel IO preference PRIOR to
the right-sidedness being detected and recast in terms of the left-
side code path.
It turns out that bli_gemm_ind_recast_1m_params() was applying its
optimization (recasting a complex-domain macrokernel calling a 1m
virtual microkernel to a real-domain macrokernel calling the real-
domain microkernel) in situations in which it should not have. The
optimization was silently assuming that the storage of C always
matched that of the microkernel preference, since the front-end (in
this case, bli_hemm_front()) would have already had a chance to
transpose the operation to bring the two into agreement. However, by
disabling right-sided hemm, we deprive BLIS of that flexibility (as a
transposed left-sided case would necessarily have to become a right-
sided case), and thus the assumption was no longer holding in all
cases. Thanks to Nisanth M P for reporting this bug in Issue 621.
- The aforementioned bug, and its bugfix, also apply to symm when
BLIS_DISABLE_SYMM_RIGHT is defined.
- Comment updates.
- CREDITS file update.

commit 4833ba224eba54df3f349bcb7e188bcc53442449
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 12 20:26:02 2022 -0600

Fixed perf of mt sup with packing, and mt gemmlike. (696)

Details:
- Brought the gemmsup code path up to date relative to the latest
thrinfo_t semantics introduced in the October Omnibus commit
(aeb5f0c). This was done by passing the prenode (instead of the
current node) into the packm variant within bli_l3_sup_packm.c as well
as creating the prenodes and attaching them to the thrinfo_t tree in
bli_l3_sup_thrinfo_create(). These changes erase the performance
degradation introduced in the omnibus when running multithreaded sup
with optional packing enabled. Special thanks to Devin Matthews for
sussing out this fix in short order.
- Fixed the gemmlike sandbox in a manner similar to that of sup with
packing, described above. This also involved passing the prenode into
the local gemmlike packm variant. (Recall that gemmlike recycles the
use of bli_l3_sup_thrinfo_create(), so it automatically inherits that
part of the sup fix described above.)
- Updated bls_l3_packm_var[123].c to use bli_thrinfo_n_way() and
bli_thrinfo_work_id() instead of bli_thrinfo_num_threads() and
bli_thrinfo_thread_id(), respectively.

commit db10dd8e11a12d85017f84455558a82c0093b1da
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 29 19:10:31 2022 -0600

Fixed _gemm_small() prototype; disabled gemm_small.

Details:
- Fixed a mismatch between the prototype for bli_gemm_small() in
bli_gemm_front.h and the actual definition of bli_gemm_small() in
kernels/zen/3/bli_gemm_small.c. The former was erroneously declaring
the cntl_t* argument as 'const'. Thanks to Jeff Diamond for reporting
this issue.
- Commented out BLIS_ENABLE_SMALL_MATRIX, BLIS_ENABLE_SMALL_MATRIX_TRSM
macro definitions in config/zen3/bli_family_zen3.h. AMD's small matrix
implementation should probably remain disabled in vanilla BLIS, at
least for now.

commit f0337b784d164ae505ca0e11277a1155680500d1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Nov 13 21:36:47 2022 -0600

Trival whitespace/comment tweaks.

Details:
- Trivial whitespace and comment changes, most of which ideally would
have been part of the previous commit pertaining to HPX (2b05948).

commit 2b05948ad2c9785bc53f376d53a7141cbc917447
Author: ct-clmsn <ct.clmsngmail.com>
Date: Sun Nov 13 17:40:22 2022 -0500

blis support for hpx (682)

Implement threading backend via HPX.

HPX is an asynchronous many task runtime system used in high performance computing applications. The runtime implements the ISO C++ parallelism specification and provides a user-space thread implementation.

This PR provides BLIS a thread backend implementation using HPX and resolves feature request 681. The configuration script, makefiles, and testsuite have been updated to support an HPX build option. The addition of HPX support provides other developers an exemplar for integrating other C++ threading backends into BLIS.

Co-authored-by: ctaylor <ctaylorpennywise.cm.cluster>
Co-authored-by: Devin Matthews <damatthewssmu.edu>

commit e1ea25da43508925e33d4e57e420cfc0a9de793f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 11 12:07:51 2022 -0600

Fixed subtle barrier_fpa bug in bli_thrcomm.c. (690)

Details:
- In bli_thrcommo.c, correctly initialize the BLIS_OPENMP element of the
barrier function pointer array (barrier_fpa) to NULL when
BLIS_ENABLE_OPENMP is *not* defined. Similarly, initialize the
BLIS_POSIX element of barrier_fpa to NULL when BLIS_ENABLE_PTHREADS is
not enabled. This bug was introduced in a1a5a9b and was likely the
result of an incomplete edit. The effects of the bug would have
likely manifested when querying a thrcomm_t that was initialized with
a timpl_t value corresponding to a threading implementation that was
omitted from the -t option at configure-time.

commit dc6e5f3f5770074ba38554541b8b64711a68c084
Author: leekillough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Nov 3 18:33:08 2022 -0500

Enhance emacs formatting of C files to remove trailing whitespace and ensure a newline at the end of file

commit 713d078075a4a563a43d83fd0880ab5091c2e4a4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 3 20:00:11 2022 -0500

Delete mpi_test garbage. (689)

Details:
- tlrmchlsmth: "What even is this? No comments, no commit message, not
used by anything. Trash."

commit 8d813f7f12732d52c95570ae884d5defbfd19234
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 3 19:10:47 2022 -0500

Some decluttering of the top-level directory.

Details:
- Relocated 'mpi_test' directory to test/mpi_test.
- Relocated 'so_version' and 'version' files from top-level directory to
'build' directory.
- Updated build/bump-version.sh script to accommodate relocation of
'version' file to 'build' directory.
- Updated configure script to accommodate relocation of 'so_version'
file to 'build' directory.
- Updated INSTALL file to replace pointers to blis-devel mailing list
with a pointer to docs/Discord.md.
- Updated RELEASING file to contain a reminder to consider whether the
so_version file should be updated prior to the release.

commit 6774bf08c92fc6983706a91bbb93b960e8eef285
Author: Lee Killough <15950023+leekilloughusers.noreply.github.com>
Date: Thu Nov 3 15:20:47 2022 -0500

Fix typo in configure --help text. (686)

Details:
- Fixed a misspelling in the --help description for the --int-size (-i)
configure option.

commit 872898d817f35702e7678ff7f3eeff0f12e641f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 21:53:22 2022 -0500

Fixed trmm[3]/trsm performance bug in cf7d616. (685)

Details:
- Fixed a performance bug in the packing of micropanels that intersect
the diagonal of triangular matrices (i.e., those found in trmm, trmm3,
and trsm). This bug was introduced in cf7d616 and stemmed from an
ill-formed boolean conditional expression in bli_packm_blk_var1().
This conditional would chose when to use round-robin parallel work
allocation, but checked for the triangularity of the submatrix being
packed while failing also to check for whether the current micropanel
actually intersected the diagonal. The net result of this bug was that
*all* micropanels of a triangular matrix, no matter where the upanels
resided within the matrix, were assigned to threads via a round-robin
policy. This affected some microarchitectures and threading
configurations much worse than others, but it seems that overall the
effect was universally negative, likely because of the reduced spatial
locality during the packing with round-robin. Thanks to Leick Robinson
for his tireless efforts in helping track down this issue.

commit edcc2f9940449f7d9cefcfc02159d27b013e7995
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 19:04:49 2022 -0500

Support --nosup, --sup configure options. (684)

Details:
- Added --nosup and --sup as alternative ways of requesting that sup be
disabled or enabled. These are analagous to --disable-sup-handling and
--enable-sup-handling, respectively. (I got tired of typing out
--disable-sup-handling and needed a shorthand notation.)
- Tweaked message output by configure when sup is enable/disabled for
clarity and specificity.
- Whitespace changes.

commit 5eea6ad9eb25f37685d1ae4ae08c73cd1daca297
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 2 17:07:54 2022 -0500

Add mention of Wilkinson Prize to README.md. (683)

Details:
- Added blurbs and links to Wilkinson Prize to README.md.
- Added mention of both Best Paper and Wilkinson Prizes to the top of
README.md.
- Other minor tweaks.

commit 29f79f030e939969d4f3876c4fdaac7b0c5daa63
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 31 18:57:45 2022 -0500

Fixed performance bug caused by redundant packing. (680)

Details:
- Fixed a performance bug whereby multiple threads were redundantly
packing the same (rather than separate) micropanels. This bug was
caused by different parts of the code using the num_threads/thread_id
field of the thrinfo_t vs. the n_way/work_id fields. The fix was to
standardize on the latter and provide a "fake" thrinfo_t sub-prenode
in the thrinfo tree which consists of single-member thread teams. The
single team with multiple threads node is still required since it and
only it can be used to perform barriers and broadcasts (e.g. of the
packed buffer pointer).

commit aeb5f0cc19665456e990a7ffccdb09da2e3f504b
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 27 12:39:11 2022 -0500

Omnibus PR - Oct 2023 (678)

Details:
- This is an "omnibus" commit, consisting of multiple medium-sized
commits that affect non-trivial aspects of BLIS. The major highlights:
- Relocated the pba, sba pool (from the rntm_t), and mem_t (from the
cntl_t) to the thrinfo_t object. This allows the rntm_t to be
effectively const (although it is sometimes copied internally and
modified to reflect different ways of parallelism). Moving the mem_t
sets the stage for sharing a global control tree amongst all
threads.
- De-templatized the macrokernels for gemmt, trmm, and trsm to match
the macrokernel for gemm, which has been de-templatized since
54fa28b.
- Reimplemented bli_l3_determine_kc() by separating out the logic for
adjusting KC based on MR/NR for triangular A and/or B into a new
function, bli_l3_adjust_kc(). For now, this function is still called
from bli_l3_determine_kc(), but in the future we plan to have it
called once when constructing the control tree.
- Refactored the level-3 thread decorator into two parts:
- One part deals only with launching threads, each one calling a
generic thread entry function. This code resides in frame/thread
and constitutes the definition of bli_thread_launch(). Note that
it is specific to the threading implementation (OpenMP, pthreads,
single, etc.)
- The other part deals with passing the matrix operands and related
information into bli_thread_launch(). This is the "l3 decorator"
and now resides in frame/3. It is agnostic to the threading
implementation.
- Modified the "level" of the thread control tree passed in at each
operation. Previously, each operation (e.g. bli_gemm_blk_var1()) was
passed in a communicator representing the active thread teams which
would share the available work. Now, the *parent* thread comm is
passed in. The operation then grabs the child comm and uses it to
partition the work. The difference is in bli_trsm_blk_var1(), where
there are now two children nodes for this single operation (i.e. the
thread control tree is split one level above where the control tree
is). The sub-prenode is used for the trsm subproblem while the
normal sub-node is used for the gemm part. Importantly, the parent
comm is used for the barrier between them.
- Removed cntl_t* arguments from bli_*_front() functions. These will be
added back in the future when the control tree's creation is moved so
that it happens much sooner (provided that bli_*_front() have not been
absorbed into their respective bli_*_ex() functions).
- Renamed various bli_thread_*() query functions to bli_thrinfo_*(),
for consistency. This includes _num_threads(), _thread_id(), _n_way(),
_work_id(), _sba_pool(), _pba(), _mem(), _barrier(), _broadcast(), and
_am_chief().
- Removed extraneous barrier from _blk_var3() of gemm and trsm.
- Fixed a typo in bli_type_defs.h where BLIS_BLAS_INT_TYPE_SIZE was
misspelled.

commit c803b03e52a7a6997a8d304a8cfa9acf7c1c555b
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Oct 26 18:20:00 2022 -0500

Add check to disable armsve on Apple M1.

commit 2dd692b710b6a9889f7ebdd7934a2108be5c5530
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Oct 26 18:10:26 2022 -0500

Fix auto-detection of firestorm (Apple M1).

commit 88105dbecf0f9dfbfa30215743346e8bd6afb971
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 21 15:16:12 2022 -0500

Added Discord documentation (677)

Details:
- Added a docs/Discord.md markdown document that walks the reader
through creating a Discord account, obtaining the invite link, and
using the link to join the BLIS Discord server.
- Updated README.md to reference the new Discord.md document in multiple
places, including via the official Discord logo (used with explicit
permission from representatives at Discord Inc.).

commit 23f5b8df3e802a27bacd92571184ec57bbdfa646
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 17 20:21:21 2022 -0500

Shuffled checked properties in bli_l3_check.c. (676)

Details:
- Added certain checks for matrix structure to the level-3 operations'
_check() functions, and slightly reorganized existing checks.

commit 9453e0f163503f64a290256b4be53d8882224863
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 3 19:46:20 2022 -0500

CREDITS file update.

Details:
- This attribution was intended to go in PR 647.

commit 76a23bd8c33e161221891935a489df9a9fb9c8c0
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 3 15:55:07 2022 -0500

Reinstate sanity check in bli_pool_finalize. (671)

Details:
- Added a reinit argument to bli_pool_finalize(). This bool will signal
whether or not the function is being called from bli_pool_reinit(). If
it is not being called from _reinit(), we can safely check to confirm
that .top_index == 0 (i.e., all blocks have been checked in). But if
it *is* being called from _reinit(), then that check will be skipped
since one of the predicted use cases for bli_pool_reinit() anticipates
that some blocks are (probably) checked out when the pool_t is
reinitialized.
- Updated existing invocations of bli_pool_finalize() to pass in either
FALSE (from bli_apool_free_block() or bli_pba_finalize_pools()) or
TRUE (from bli_pool_reinit()) for the new reinit argument.

commit 63470b49e3b9b15e00a8f666e86ccd70c6005fe9
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 29 18:52:08 2022 -0500

Fix some bugs in bli_pool.c (670)

Details:
- Add a check for premature pool exhaustion when checking in blocks via
bli_pool_checkin_block(). This detects "double-free" and other bad
conditions that don't necessarily result in a segfault.
- Make sure to copy all block pointers when growing the pool size.
Previously, checked-out block pointers (which are guaranteed to be set
to NULL) were not being copied, leading to the presence of
uninitialized data.

commit 42d0e66318b186d25eeb215b40ce26115401ed8b
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 29 17:38:02 2022 -0500

Add AddressSanitizer (-fsanitize=address) option. (669)

Details:
- Added support for AddressSanitizer (ASan), a compiler-integrated
memory error detector. The option (disabled by default) enables
compiling and linking with the -fsanitize=address flag supported by
clang, gcc, and probably others. This flag is employed during
compilation of all BLIS source files *except* for optimized kernels,
which are exempted because ASan usually requires an extra register,
which violates the constraints for many gemm microkernels.
- Minor whitespace, comment, ordering, and configure help text updates.

commit b861c71b50c6d48cb07282f44aa9dddffc1f1b3f
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 23 13:22:27 2022 -0500

Add consistent NaN/Inf handling in sumsqv. (668)

Details:
- Changed sumsqv implementation as follows:
- If there is a NaN (either real or imaginary), then return a sum of
NaN and unit scale.
- Else, if there is an Inf (either real or imaginary), then return a
sum of +Inf and unit scale.
- Otherwise behave as normal.

commit ee81efc7887374c974a78bfb3e0865776b2f97a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 22 19:15:07 2022 -0500

Parameterized test/3 drivers via command line args. (667)

Details:
- Rewrote the drivers in test/3, the Makefile, and the runme.sh script
so that most of the important parameters, including parameter combo,
datatype, storage combo, induced method, problem size range, dimension
bindings, number of repeats, and alpha/beta values can be passed in
via command line arguments. (Previously, most of these parameters were
hard-coded into the driver source, except a few that were hard-coded
into the Makefile.) If no argument is given for any particular option,
it will be assigned a sane default. Either way, the values employed at
runtime will be printed to stdout before the performance data in a
section that is commented out with '%' characters (which is used by
matlab and octave for comments), unless the -q option is given, in
which case the driver will proceed quietly and output only performance
data. Each driver also provides extensive help via the -h option, with
the help text tailored for the operation in question (e.g. gemm, hemm,
herk, etc.). In this help text, the driver reminds the user which
implementation it was linked to (e.g. blis, openblas, vendor, eigen).
Thanks to Jeff Diamond for suggesting this CLI-based reimagining of
the test/3 drivers.
- In the test/3 drivers: converted cpp macro string constants, as well
as two string literals (for the opname and pc_str) used in each test
driver, to global (or static) const char* strings, and replaced the
use of strncpy() for storing the results of the command line argument
parsing with pointer copies from the corresponding strings in argv.
This works because the argv array is guaranteed by the C99 standard
to persist throughout the life of the program. This new approach uses
less storage and executes faster. Thanks to Minh Quan Ho for
recommending this change.
- Renamed the IMP_STR cpp macro that gets defined on the command line,
via the test/3/Makefile, to IMPL_STR.
- Updated runme.sh to set the problem size ranges for single-threaded
and multithreaded execution independently from one another, as well as
on a per-system basis.
- Added a 'quiet' variable to runme.sh that can easily toggle quiet mode
for the test drivers' output.
- Very minor typecast fix in call to bli_getopt() in bli_utils.c.
- In bli_getopt(), changed the nextchar variable from being a local
static variable to a field of the getopt_t state struct. (Not sure why
it was ever declared static to begin with.)
- Other minor changes to bli_getopt() to accommodate the rewritten test
drivers' command line parsing needs.

commit 036a4f9d822df25a76a653e70be76fb02284d3d3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 22 18:36:50 2022 -0500

Refactored some rntm_t management code. (666)

Details:
- Separated the "sanitizing" code from the auto-factorization code
in bli_rntm_set_ways_from_rntm() and _rntm_set_ways_from_rntm_sup().
The santizing code now resides in bli_rntm_sanitize() while the
factorization code resides in bli_rntm_factorize() and
bli_rntm_factorize_sup(). (There are two different functions because
the conventional and sup factorization codes are currently somewhat
different.) Also note that the factorization code now relies on the
.auto_factor field to have already been set, either during
rntm_t initialization or when the rntm_t was previously updated and
santized. So rather than locally determining whether to auto-
factorize, those functions just read the .auto_factor field and
proceed accordingly.
- Refactored and removed most code from bli_thread_init_rntm_from_env().
This function now reads the environment variables needed to set nt,
jc, pc, ic, jr, and ir; sets them into the global rntm_t; and then
calls bli_rntm_sanitize() in order to make sure that the contents are
in a "good" state. Thanks to Devin Matthews for suggesting this
refactoring.
- Redefined bli_rntm_set_num_threads() and bli_rntm_set_ways() such that
if multithreading is disabled at compile time (that is, if the cpp
macro BLIS_ENABLE_MULTITHREADING is undefined), they ignore the
caller's request and instead clear the nt and ways fields.
- Redefined bli_thread_set_num_threads() and bli_thread_set_ways() such
that if multithreading is disabled at compile time (that is, if the
cpp macro BLIS_ENABLE_MULTITHREADING is undefined), they ignore the
caller's request and do nothing.
- Redefined bli_rntm_set_num_threads() and bli_rntm_set_ways() as true
functions rather than static inline functions.
- In bli_rntm.c, statically initialize the global_rntm global variable
via the BLIS_RNTM_INITIALIZER macro.
- In bli_rntm.h, defined bli_rntm_clear_auto_factor(), which sets the
.auto_factor field of the rntm_t to FALSE.
- Reorganized order of some inline function definitions in bli_rntm.h.
- Changed the default value given to the .auto_factor field by the
BLIS_RNTM_INITIALIZER macro from TRUE to FALSE.
- Call bli_rntm_clear_auto_factor() instead of
bli_rntm_set_auto_factor_only() in bli_rntm_init().
- Comment/whitespace updates.

commit a1a5a9b4cbef9208da494c45a2f933a8e82559ac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 21 18:31:01 2022 -0500

Implemented support for fat multithreading. (665)

Details:
- Allow the user to configure BLIS in such a way that multiple threading
implementations get compiled into the library, with one of those
implementations chosen at runtime. For now, there are only three
implementations available: OpenMP, pthreads, and single. (Here,
'single' merely refers to single-threaded mode.) The configure script
now allows the user to give the -t option with a comma-separated list
of values, such as '-t openmp,pthreads'. The first value in the list
will always be the default at library initialization time, and
'single' is always silently appended to the end of the list. The user
can specify which implementation should execute in one of three ways:
by setting the BLIS_THREAD_IMPL environment variable prior to launch;
by calling the bli_thread_set_thread_impl() global runtime API; or by
encoding their choice into a rntm_t that is passed into one of the
expert interfaces. Any of these three choices overrides the
initialization-time default (i.e., the first value listed to the -t
configure option). Requesting an implementation that was not compiled
into the library will result in an error message followed by
bli_abort().
- Relocated the 'auto' logic for the -t option from the top-level
Makefile to the configure script. (Currently, this logic is pretty
dumb, choosing 'openmp' for gcc and icc, and 'pthreads' for clang.)
- Defined a new 'timpl_t' enum in bli_type_defs.h, with three valid
values: BLIS_SINGLE, BLIS_OPENMP, BLIS_POSIX.
- Reorganized the thrcomm_t struct into a single defintion with two
preprocessor blocks, one each for additional fields needed by OpenMP
and pthreads.
- Added timpl_t argument to bli_thrcomm_bcast(), bli_thrcomm_barrier(),
bli_thrcomm_init(), and bli_thrcomm_cleanup(), which these functions
need since they are now wrappers that choose the implementation-
specific function corresponding to the currently enabled threading
implementation.
- Added rntm_t* to bli_thread_broadcast(), bli_thread_barrier() so that
those functions can pass the timpl_t value into bli_thrcomm_bcast()
and bli_thrcomm_barrier(), respectively.
- Defined bli_env_get_str() in bli_env.c to allow the querying of
BLIS_THREAD_IMPL (which, unlike BLIS_NUM_THREADS and friends, is
expected to be a string).
- Defined bli_thread_get_thread_impl(), bli_thread_set_thread_impl() to
get and set the current threading implementation at runtime.
- Defined bli_rntm_thread_impl() and bli_rntm_set_thread_impl() to query
and set the threading implementation within a rntm_t. Also choose
BLIS_SINGLE as the default value when initializing rntm_t structs.
- Added bli_info_get_*() functions to query whether OpenMP or pthreads
would be chosen as the default at init-time. Note that this only
tests whether OpenMP or pthreads is the first implementation in the
list passed to the threading configure option (-t) and is *not* the
same as querying which implementation is currently selected, since
that can be influenced by BLIS_THREAD_IMPL and/or
bli_thread_set_thread_impl().
- Changed l3int_t to l3int_ft.
- Updated docs/Multithreading.md to document the new behavior.
- Updated sandbox/gemmlike and addon/gemmd to work with the new fat
threading feature. This included a few bugfixes to bring the codes up
to date, as necessary.
- Comment, whitespace updates.

commit 89df7b8fa3a3e47ab2fc10ac4d65d0b9fde16942
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Sep 18 18:46:57 2022 -0500

De-templatized _sup_var1n2m.c; unified _sup_packm_a/b(). (659)

Details:
- Re-expressed the two variants in frame/3/bli_l3_sup_var1n2m.c as a
single function each that performs char* pointer arithmetic rather
than four datatype-specific functions. Did the same for the functions
in bli_l3_sup_packm_a.c and _sup_packm_b.c, and then unified the two
into a single set of functions for packing either A or B, which now
resides in bli_l3_sup_packm.c.
- Pre-grow the cntl_t tree in both bli_l3_sup_var1n2m.c variants rather
than grow them incrementally.
- Relocated empty-matrix and scale-by-beta early return handlnig from
bli_gemm_front() and bli_gemmt_front() to their _ex() counterparts.
- Comment, whitespace updates.

commit fb91337eff1ee2098f315a83888f6667b3a56f86
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 15 19:08:10 2022 -0500

Fixed a harmless pc_nt bug in 05a811e.

Details:
- Added missing curly braces around some statements in bli_rntm.c, one
of which needed them in order for the relevant code to be executed in
the intended way. The consequence of 05a811e omitting those braces was
that a statement (pc_nt = 1;) was executed more often than it needed
to be.
- Also adjusted the analagous code in bli_thread.c to match that of
bli_rntm.c.

commit e86076bf4461d1a78186fb21ba8320cfb430f62c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 15 14:22:59 2022 -0500

Test the 'gemmlike' sandbox via AppVeyor. (664)

Details:
- Added a fifth test to our .appveyor.yml that enables the 'gemmlike'
sandbox with OpenMP enabled (via clang, the 'auto' configuration
target, and building to a static library). Thanks to Jeff Diamond
for pointing out that this test would be useful.

commit 63177dca48cb7d066576d884da4a7a599ececebf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 15 11:21:26 2022 -0500

Fixed gemmlike sandbox bug introduced in 7c07b47.

Details:
- Fixed a bug in the 'gemmlike' sandbox that was introduced in 7c07b47.
This bug was the result of the fact that the gemmlike implementation
uses bli_thrinfo_sup_grow() to grow its thrinfo_t tree, but the
aforementioned commit added an optimization that kicks in when the
rntm_t .pack_a and .pack_b fields are both FALSE. Those fields were
originally added only for sup execution; for large code path, they
are intended to be ignored. But the default initial state of a rntm_t
has those fields set to FALSE, which was inadvertantly activating the
optimization (which targeted single-threaded cases only) and would
cause multithreaded use cases of 'gemmlike' to segfault. The fix took
the form of setting the .pack_a and .pack_b fields to TRUE in
bls_gemm_ex().
- Added minimal 'const' and 'const'-casting to 'gemmlike' so that gcc
stays quiet.

commit 05a811e898b371a76581abd4afa416980cce7db9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 13 19:24:05 2022 -0500

Initialize rntm_t nt/ways fields with 1 (not -1). (663)

Details:
- Changed the way that rntm_t structs are initialized, mainly so that
the global rntm_t that is set via environment variables at runtime
may be queried by the application prior to any computation taking
place. (Strictly speaking, the application may already query these
fields, but they do not always contain valid values and often contain
-1 when they are unset.) These changes also served to clarify how
these parameters are treated, and homogenized the implementations of
bli_rntm_set_ways_from_rntm(), bli_rntm_set_ways_from_rntm_sup(), and
bli_thread_init_rntm_from_env(). Special thanks to Jeff Diamond,
Leick Robinson, and Devin Matthews for pointing out that the previous
behavior was needlessly confusing and could be improved.
- The aforementioned modifications also included subtle changes as to
what counts as "setting" a loop's ways of parallelism for the purposes
of deciding whether to use the ways or the total number of threads.
Previously, setting any loop's ways, even to 1, counted in favor of
using the ways. Now, only values greater than 1 will count as
"setting", and all other values will silently be mapped to 1, with
those parameters treated as if they were untouched all along.
- Updated bli_rntm.h and bli_thread.c so that any attempt to set the
PC_NT variable (or pc_nt field of a rntm_t) will either ignore the
request or reassert the value as 1.
- Updated bli_rntm_set_ways() so that rather than clear the
num_threads field, it is set to the product of all of the per-loop
ways of parallelism.
- Removed code from test_libblis.c that handled the possibility of unset
environment variables when printing out their values.
- Removed bli_rntm_equals() inline function from bli_rntm.h, which has
long been disabled.
- Updates to docs/Multithreading.md related to the aforementioned
changes.
- Comment updates.

commit fd885cf98f4fe1d3bc46468e567776c37c670fcc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 13 11:50:23 2022 -0500

Use kernel CFLAGS for 'kernels' subdirs in addons. (658)

Details:
- Updated Makefile and common.mk so that the targeted configuration's
kernel CFLAGS are applied to source files that are found in a
'kernels' subdirectory within an enabled addon. For now, this
behavior only applies when the 'kernels' directory is at the top
level of the addon directory structure. For example, if there is an
addon named 'foobar', the source code must be located in
addon/foobar/kernels/ in order for it to be compiled with the target
configurations's kernel CFLAGS. Any other source code within
addon/foobar/ will be compiled with general-purpose CFLAGS (the same
ones that were used on all addon code prior to this commit). Thanks
to AMD (esp. Mithun Mohan) for suggesting this change and catching an
intermediate bug in the PR.
- Comment/whitespace updates.

commit cb74202db39dc8cb81fdd06f8a445f8837e27853
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 13 11:46:24 2022 -0500

Fixed incorrect sizeof(type) in edge case macros. (662)

Details:
- In bli_edge_case_macro_defs.h, the GEMM_UKR_SETUP_CT_PRE() and
GEMMTRSM_UKR_SETUP_CT_PRE() macros previously declared their temporary
ct microtiles as:

PASTEMAC(ch,ctype)
_ct[ BLIS_STACK_BUF_MAX_SIZE / sizeof( PASTEMAC(ch,type) ) ] \
__attribute__((aligned(alignment))); \

The problem here is that sizeof( PASTEMAC(ch,type) ) evaluates to
things like sizeof( BLIS_DOUBLE ), not sizeof( double ), and since
BLIS_DOUBLE is an enum, it is typically an int, which means the
sizeof() expression is evaluating to the wrong value. This was likely
a benign bug, though, since BLIS does not support any computational
datatypes that are smaller than sizeof( int ), which means the ct
array would be *over*-allocated rather than underallocated. Thanks
to moon-chilled for identifying and reporting this bug in 624.
- CREDITS file update.

commit 6e5431e8494b06bd80efcab3abf0a6456d6c0381
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Sep 10 15:16:58 2022 -0500

Fix line number issue in flattened blis.h. (660)

Details:
- Updated the top-level Makefile so that it invokes flatten-headers.py
without the -c option, which was requesting that comments be stripped
(since comment stripping is disabled by default).
- Updated flatten-headers.py to accept a new option (-l) to enable
insertion of line directives into the output file. This new option
is enabled by default.
- Also added logic to flatten-headers.py that outputs a warning if both
comment stripping and line numbers are requested since the comment
stripping will cause the line numbers to become inaccurate.

commit 4afe0cfdab0e069e027f97920ea604249e34df47
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 8 18:33:20 2022 -0500

Defined invscalv, invscalm, invscald operations. (661)

Details:
- Defined invert-scale (invscal) operation on vectors (level-1v),
matrices (level-1m), and diagonals (level-1d).
- Added test modules for invscalv and invscalm to the testsuite.
- Updated BLISObjectAPI.md and BLISTypedAPI.md API documentation to
reflect the new operations. Also updated KernelsHowTo.md accordingly.
- Renamed 'beta' to 'alpha' in scalv and scalm testsuite modules (and
input.operations files) so that the parameter name matches the
parameter used in the documentation.

commit a87eae2b11408b556e562f1b04e673c6cd1612bc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 6 18:04:09 2022 -0500

Added '-q' quiet mode option to testsuite. (657)

Details:
- Added support for a '-q' command line option to the testsuite. This
option suppresses most informational output that would normally
clutter up the screen. By default, verbose mode (the previous
status quo) will be operative, and so quiet mode must be requested.

commit dfa54139664a42d29774e140ec9e5597af869a76
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Aug 30 08:07:50 2022 +0800

Arm64 dgemmsup with extended MR&NR (655)

Details:
- Since the number of registers in NEON is large but their lengths are
short, I'm here extending both MR and NR.
- The approach is to represent the C microtile in registers optionally
in columns, so for sizes like 6x7m, the 'crr' kernel is the default
with 'rrr' supported through an in-register transpose.
- A few asm kernels are crafted for 'rv' to complete this extended size
support.
- For 'rd' I'm still relying heavily on C99 intrinsic kernels with
branching so the performance might not be optimal. (Sorry for that.)
- So far, these changes only affect the 'firestorm' subconfig.
- This commit also contains row-preferential s12x8 and d6x8 gemm
ukernels. These microkernels are templatized versions of the existing
s8x12 and d6x8 ukernels defined in bli_gemm_armv8a_asm_d6x8.c.

commit 9e5594ad5fc41df8ef2825a025d7844ac2275c27
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 11 14:36:38 2022 -0500

Temporarily disabled line directives from 6826c1c.

Details:
- Commented out the inclusion of line preprocessor directives in the
flattened header output provided by build/flatten-headers.py. This
output was added recently in 6826c1c, but was later found to have
thrown off the line numbering referenced by compiler warnings and
errors (possibly due to license comment blocks, which are stripped
from source headers as they are inlined into the monolithic header).

commit 775148bcdbb1014b4881a76306f35f5d0fedecbe
Author: jdiamondGitHub <jeff_diamondfastmail.com>
Date: Fri Aug 5 12:01:24 2022 -0500

Updated ARMv8a kernels to fix 2 prefetching issues. (649)

Details:
- The ARMv8a dgemm/sgemm microkernels had 2 prefetching issues that
impacted performance on modern ARM platforms. The most significant
issue was that only a single prefetch per C tile column was issued.
When a column of C was not cache aligned, the second cache line would
not be prefetched at all, forcing the kernel to wait for an entire
load to update elements of C. This happened with roughly 50% of the
C prefetches. The fix was to have two prefetches per column, spaced
64 bytes (1 cache line) apart.
- A secondary performance issue was that all the C prefetch instructions
were issued sequentially at the beginning of the kernel call. This
caused a noticeable performance slowdown. Interleaving the prefetch
calls every 2-3 instructions in the prologue code solved the issue.

commit bbaf29abd942de47a3a99a80a67d12bab41b27db
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 4 17:51:37 2022 -0500

Very minor variable updates to common.mk.

Details:
- Fixed a harmless bug that would have allowed C++ headers into the list
of header suffices specifically reserved for C99 headers. In practice,
this would have had no substantive effect on anything since the core
BLIS framework does not use C++ headers.

commit a48e29d799091a833213efeafaf2d342ebdafde9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 28 10:11:07 2022 -0500

CREDITS file update.

Details:
- Thanks to Kihiro Bando for assisting with issue 644.

commit 5b298935de7f20462bfad1893ed34ecd691cec5a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 27 19:14:15 2022 -0500

Removed buggy cruft from power10 subconfig.

Details:
- Removed defines for BLIS_BBN_s and BLIS_BBN_d from
bli_kernel_defs_power10.h. These were inadvertently set in ae10d949
because the power10 subconfig was registering bb packm ukernels, but
only for 6xk (power10 uses s8x16 and d8x8 ukernels) and only because
the original author (probably) copy-pasted from power9 when getting
started. That 6xk packm registration was effectively "dead code"
prior to ae10d949, but was then mistaken as not-dead code during the
ae10d949 refactor. These improper bb factors may have been causing
bugs in power10 builds. Thanks to Nicholai Tukanov for helping remind
me what the power10 subconfig was supposed to look like.
- Removed extraneous microkernel preference registrations from power10
subconfig. Preferences for single and double complex gemm were being
registered despite there being no complex gemm ukernels registered to
go with them. Similarly, there were trsm preferences registered
without any trsm ukernels registered (and BLIS doesn't actually use a
preference for the trsm ukernel anyway). These extraneous
registrations were almost surely not hurting anything, even if they
were quite misleading.

commit 56de31b00fa0f1ba866321817cd1e5d83000ff11
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 27 13:54:17 2022 -0500

Disable modification of KC in the gemmsup kernels. (648)

This led to a ~50% performance reduction for certain gemm operations (but not others?). See 644 for example.

commit 4dde947e2ec9e139c162801320c94e6a01a39708
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 26 17:29:32 2022 -0500

Fixed out-of-bounds bug in sup s6x16m haswell kernel.

Details:
- Fixed another out-of-bounds read access bug in the haswell sup
assembly kernels. This bug is similar to the one fixed in 17b0caa
and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh
Kannan for reporting this bug (and a suitable fix) in 635.
- CREDITS file update.

commit 6826c1cdfba855513786d9e3d606681316453398
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 25 18:21:05 2022 -0500

Add `line` directives to flattened `blis.h`. (643)

Details:
- Modified flatten-headers.py so that line directives are inserted into
the flattened blis.h file. This facilitates easier debugging when
something is amiss in the flattened blis.h because the compiler will
be able to refer to the line number within the original constituent
header file (which is where the fix would go) rather than the line
number within the flattened header (which is not as helpful).

commit af3a41e02534befdae026377592ce437bab83023
Author: Alexander Grund <Flamefireusers.noreply.github.com>
Date: Thu Jul 21 18:05:48 2022 +0200

Add autodetection for POWER7, POWER9 & POWER10 (647)

Read from `/proc/cpuinfo` as done for ARM.
Fixes 501

commit 17b0caa2b2bff439feb6d2b39cfa16e7591882b0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 14 17:55:34 2022 -0500

Fixed out-of-bounds read in haswell gemmsup kernels.

Details:
- Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2()
kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four
single-precision elements of C, via instructions such as:

vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4)

in situations where only two elements are guaranteed to exist. (These
bugs may not have manifested in earlier tests due to the leading
dimension alignment that BLIS employs by default.) The issue was fixed
by replacing lines like the one above with:

vmovsd(mem(rcx), xmm0)
vfmadd231ps(xmm0, xmm3, xmm4)

Thus, we use vmovsd to explicitly load only two elements of C into
registers, and then operate on those values using register addressing.
Thanks to Daniël de Kok for reporting these bugs in 635, and to
Bhaskar Nallani for proposing the fix).
- CREDITS file update.

commit cc260fd7068f0fe449d818435aa11adb14c17fed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 13 16:16:01 2022 -0500

Allow uniform max problem sizes in test/3/runme.sh.

Details:
- Tweaked test/3/runme.sh so that the test driver binaries for single-
threaded (st), single-socket (1s), and dual-socket (2s) execution can
be built using identical problem size ranges. Previously, this was not
possible because runme.sh used the maximum problem size, which was
embedded into the binary filename, to tell the three classes of
binaries apart from one another. Now, runme.sh uses the binary suffix
("st", "1s", or "2s") to tell them apart. This required only a few
changes to the logic, but it also required a change in format to the
threading config strings themselves (replacing the max problem size
with "st", "1s", or "2s"). Thanks to Jeff Diamond for inspiring this
improvement.
- Comment updates.

commit 9b1beec60be31c6ea20b85806d61551497b699e4
Author: bartoldeman <bartoldemanusers.noreply.github.com>
Date: Mon Jul 11 20:15:12 2022 -0400

Use BLIS_ENABLE_COMPLEX_RETURN_INTEL in blastest files (636)

Details:
- Fixed a crash that occurs when either cblat1 or zblat1 are linked
with a build of BLIS that was compiled with '--complex-return=intel'.
This fix involved inserting preprocessor macro guards based on
BLIS_ENABLE_COMPLEX_RETURN_INTEL into blastest/src/cblat1.c and
blastest/src/zblat1.c to correctly handle situations where BLIS is
compiled with Intel/f2c-style calling conventions for complex numbers.
- Updated blastest/src/fortran/run-f2c.sh so that future executions
will insert the aforementioned cpp macro conditional where
appropriate.

commit 98d467891b74021ace7f248cb0856bec734e39b6
Author: bartoldeman <bartoldemanusers.noreply.github.com>
Date: Mon Jul 11 19:40:53 2022 -0400

Change complex_return='intel' for ifx. (637)

Details:
- When checking the version string of the Fortran compiler for the
purposes of determining a default return convention for complex
domain values, grep for "IFORT" instead of "ifort" since that string
is common to both the 'ifx' and 'ifort' binaries provided by Intel:

$ ifx --version
ifx (IFORT) 2022.1.0 20220316
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

$ ifort --version
ifort (IFORT) 2021.6.0 20220226
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

commit ffde54cc5c334aca8eff4d6072ba49496bf3104c
Author: jdiamondGitHub <jeff_diamondfastmail.com>
Date: Mon Jul 11 16:47:30 2022 -0500

Minor changes to .gitignore and LICENSE files. (642)

Details:
- Macs create .DS_Store files in every directory visited. Updated
.gitignore file so these files won't be reported as untracked by
'git status'.
- Added Oracle Corporation to the LICENSE file.
- Updated UT copyright on behalf of SHPC.

commit 7cba7ce3dd1533fcc4ca96ac902bdf218686139a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 8 11:15:18 2022 -0500

Minor cleanups, comment updates to bli_gks.c.

Details:
- Removed a redundant registration of 'a64fx' subconfig in
bli_gks_init().
- Reordered registration of 'armsve', 'a64fx', and 'firestorm'
subconfigs. Thanks to Jeff Diamond for his input on this reordering.
- Comment updates to bli_gks.c and arch_t enum in bli_type_defs.h.

commit 667f201b7871da68622027d02bd6b7da3262f8e8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 7 16:44:21 2022 -0500

Fixed type bug in bli_cntx_set_ukr_prefs().

Details:
- Fixed a bug in bli_cntx_set_ukr_prefs() which erroneously typecast the
num_t value read from va_args() down to a bool before being stored
within the cntx_t. This bug was introduced on April 6th 2022, in
ae10d94. This caused the ukernel preferences for double real and
double complex to go unchanged while the preferences for single real
and single complex were corrupted by the former datatypes'
preference values. The bug manifested as degraded performance for
subconfigurations that registered column-preferential ukernels. The
reason is that the erroneous preferences trigger unnecessary
transpositions in the operation, which forces the gemm ukernel to
compute on matrices that are not stored according to its preference.
Thanks to Devin Matthews, Jeff Diamond, and Leick Robinson for their
extensive efforts and assistance in tracking down this issue.
- Augmented the informational header that is output by the testsuite to
include ukernel preferences for gemm, gemmtrsm_[lu], and trsm_[lu].
- CREDITS file update.

commit d429b6bfced21a63bf711224ac402f93f0080b52
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Jun 28 15:34:10 2022 -0500

Support clang targetting MinGW (639)

* Support clang targetting MinGW

* Fix pthread linking

commit d93df023348144e091f7b3e3053995648f348aa7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 15 14:09:49 2022 -0500

Removed unused dt arg in bli_gks_query_ind_cntx().

Details:
- Removed the num_t datatype argument from bli_gks_query_ind_cntx().
This argument stopped being needed by the function in commit e9da642.
Its only use in bli_gks_query_ind_cntx() was to be passed through to
the context initialization function for the chosen induced method,
but even then, commit log notes from e9da642 indicate that I could not
recall why the datatype argument was ever needed by the context init
function to begin with.
- Updated all invocations of bli_gks_query_ind_cntx() to omit the dt
argument. Most of these invocations resided in various standalone test
drivers (and the testsuite).

commit 56772892450cc92b3fbd6a9d0460153a43fc47ab
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 1 10:49:33 2022 -0500

Added SMU citation to README.md intro.

Details:
- Added a citation to SMU and the Matthews Research Group to the general
attribution of maintainership and development in the Introduction of
the README.md file. Thanks to Robert van de Geijn and Devin Matthews
for suggesting this change.

commit 4603324eb090dfceaad3693a70b2d60544036aa8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 19 14:07:03 2022 -0500

Init/finalize via bli_pthread_switch_t API (634).

Details:
- Defined and implemented a new pthread-like abstract datatype and API
in bli_pthread.c. The new type, bli_pthread_switch_t, is similar to
bli_pthread_once_t in some respects. The idea is that like a switch in
your home that controls a light or ceiling fan, it can either be on or
off. The switch starts in the off state. Moving from one state to the
other (on to off; off to on) causes some action (i.e., a startup or
shutdown function) to be executed. Trying to move from one state to
the same state (on to on; off to off) is safe in that it results in
no action. Unlike bli_pthread_once(), the API for bli_pthread_switch_t
contains both _on() and _off() interfaces. Also, unlike the _once()
function, the _on() and _off() functions return error codes so that
the 'int' error code returned from the startup or shutdown functions
may be passed back to the caller. Thanks to Devin Matthews for his
input and feedback on this feature.
- Replaced the previous implementation of bli_init_once() and
bli_finalize_once() -- both of which used bli_pthread_once() -- with
ones that rely upon bli_pthread_switch_on() and _switch_off(),
respectively. This also required updating the return types of
_init_apis() and _finalize_apis() to match the function pointer type
required by bli_pthread_switch_on()/_switch_off().
- Comment updates.

commit 64a9b061f6032e2b59613aecdbe7bb52161605c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 10 14:54:22 2022 -0500

Fixed misspelling of 'xpbys' in gemm macrokernel.

Details:
- Fixed a functionally harmless typo in bli_gemm_ker_var2.c where a few
instances of the substring "xpbys" were misspelled as "xbpys". The
misspellings were harmless because they were consistent, and because
they referenced only local symbols.

commit 1c733402a95ab08b20f3332c2397fd52a2627cf6
Author: Jed Brown <jedjedbrown.org>
Date: Thu Apr 28 11:58:44 2022 -0600

Fix version check for znver3, which needs gcc >= 10.3 (628)

Apple's clang-12 lacks znver3 support, unlike upstream clang-12.

commit 6431c9e13b86e4442b6aacba18a0ace12288c955
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 14 13:01:24 2022 -0500

Added missing 'const' to zen bli_gemm_small.c.

Details:
- Added missing 'const' qualifiers to signatures of functions defined in
kernels/zen/3/bli_gemm_small.c. This fixes compile-time errors when
targeting 'zen3' subconfig (which apparently is enabling AMD's
gemm_small code path by default). Thanks to Devin Matthews for
reporting this error.

commit 9fea633748ed27ef3853bba7cd955690c61092b4
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 13 15:59:06 2022 -0500

Partial addition of 'const' to all interfaces above the (micro)kernels. (625)

Details:
- Added 'const' qualifier to applicable function arguments wherever the
the pointed-to object is not internally modified. This change affects
all interfaces that reside above the level of the (micro)kernels.
- Typecast certain function return values to discard 'const' qualifier.
- Removed 'restrict' from various arguments, including cntx_t*,
auxinfo_t*, rntm_t*, thrinfo_t*, mem_t*, and others
- Removed parts of some APIs, such as bli_cntx_*(), due to limited use.
- Merged some variable declarations with their corresponding
initialization statements.
- Whitespace changes.

commit ae10d9495486f589ed0320f0151b2d195574f1cf (origin/amd)
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 6 20:31:11 2022 -0500

Simplify and rewrite reference packm kernels. (610)

Details:
- Reorganized the way kernels are stored within the cntx_t structure so
that rather than having a function pointer for every supported size of
unrolled packm kernel (2xk, 3xk, 4xk, etc.), we store only two packm
kernels per datatype: one to pack MRxk micropanels and one to pack
NRxk micropanels.
- NOTE: The "bb" (broadcast B) reference kernels have been merged into
the "standard" kernels (packm [including 1er and unpackm], gemm,
trsm, gemmtrsm). This replication factor is controlled by
BLIS_BB[MN]_[sdcz] etc. Power9/10 needs testing since only a
replication factor of 1 has been tested. armsve also needs testing
since the MR value isn't available as a macro.
- Simplified the bli_cntx_*() APIs to conform to the new unified kernel
array within the cntx_t. Updated existing bli_cntx_init_<subconfig>()
function definitions for all subconfigurations.
- Consolidated all kernel id types (e.g. l1vkr_t, l1mkr_t, l3ukr_t,
etc.) into one kernel id type: ukr_t.
- Various edits, updates, and rewrites of reference kernels pursuant to
the aforementioned changes.
- Define compile-time macro constants (BLIS_MR_[sdcz], BLIS_NR_[sdcz],
and friends) in bli_kernel_macro_defs.h, but only when the macro
BLIS_IN_REF_KERNEL is defined by the build system.
- Loose ends:
- Still need to update documentation, including:
- docs/ConfigurationHowTo.md
- docs/KernelsHowTo.md
to reflect changes made in this commit.

commit b3e674db3c05ca586b159a71deb1b61d701ae5c9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 4 17:31:02 2022 -0500

README.md update to link to releases page.

commit 69fa915464c52f09a5971a60f521900d31a34e69
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:47:46 2022 -0500

Fixed broken "tagged releases" link in README.md.

commit 88cab8383ca90ddbb4cf13e69b7d44a1663a4425
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:12:06 2022 -0500

CHANGELOG update (0.9.0)

1.3

Modified config/zen/make_defs.mk, now CKVECFLAGS := -mavx2 -mfpmath=sse -mfma -march=znver1

Change-Id: Ia0942d285a21447cd0c470de1bc021fe63e80d81

commit 3bdab823fa93342895bf45d812439324a37db77c
Merge: 70f12f20 e2a02ebd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 28 14:07:24 2019 -0600

Merge branch 'master' into dev

commit e2a02ebd005503c63138d48a2b7d18978ee29205
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 28 13:58:59 2019 -0600

Updates (from ls5) to test/3m4m/runme.sh.

Details:
- Lonestar5-specific updates to runme.sh.

commit f0dcc8944fa379d53770f5cae5d670140918f00c
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Feb 27 17:27:23 2019 -0600

Add symbol export macro for all functions (302)

* initial export of blis functions

* Regenerate def file for master

* restore bli_extern_defs exporting for now

commit 540ec1b479712d5e1da637a718927249c15d867f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Feb 24 19:09:10 2019 -0600

Updated level-3 BLAS to call object API directly.

Details:
- Updated the BLAS compatibility layer for level-3 operations so that
the corresponding BLIS object API is called directly rather than first
calling the typed BLIS API. The previous code based on the typed BLIS
API calls is still available in a deactivated cpp macro branch, which
may be re-activated by defining BLIS_BLAS3_CALLS_TAPI. (This does not
yet correspond to a configure option. If it seems like people might
want to toggle this behavior more regularly, a configure option can be
added in the future.)
- Updated the BLIS typed API to statically "pre-initialize" objects via
new initializor macros. Initialization is then finished via calls to
static functions bli_obj_init_finish_1x1() and bli_obj_init_finish(),
which are similar to the previously-called functions,
bli_obj_create_1x1_with_attached_buffer() and
bli_obj_create_with_attached_buffer(), respectively. (The BLAS
compatibility layer updates mentioned above employ this new technique
as well.)
- Transformed certain routines in bli_param_map.c--specifically, the
ones that convert netlib-style parameters to BLIS equivalents--into
static functions, now in bli_param_map.h. (The remaining three classes
of conversation routines were left unchanged.)
- Added the aforementioned pre-initializor macros to bli_type_defs.h.
- Relocated bli_obj_init_const() and bli_obj_init_constdata() from
bli_obj_macro_defs.h to bli_type_defs.h.
- Added a few macros to bli_param_macro_defs.h for testing domains for
real/complexness and precisions for single/double-ness.

commit 8e023bc914e9b4ac1f13614feb360b105fbe44d2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 22 16:55:30 2019 -0600

Updates to 3m4m/matlab scripts.

Details:
- Minor updates to matlab graph-generating scripts.
- Added a plot_all.m script that is more of a scratchpad for copying and
pasting function invocations into matlab to generate plots that are
presently of interest to us.

commit b06244d98cc468346eb1a8eb931bc05f35ff280c
Merge: e938ff08 4c7e6680
Author: praveeng <praveen.gamd.com>
Date: Thu Feb 21 12:56:15 2019 +0530

Merge branch 'ut-austin-amd' of ssh://git.amd.com:29418/cpulibraries/er/blis into ut-austin-amd

commit e938ff08cea3d108c84524eb129d9e89d701ea90
Author: praveeng <praveen.gamd.com>
Date: Thu Feb 21 12:44:38 2019 +0530

deleted test.txt

Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3

commit ed13ad465dcba350ad3d5e16c9cc7542e33f3760
Author: mkv <Mallikarjuna-Reddy.K-Vamd.com>
Date: Thu Feb 21 01:04:16 2019 -0500

added test file for initial commit

commit 4c7e6680832b497468cf50c2399e3ac4de0e3450
Author: praveeng <praveen.gamd.com>
Date: Thu Feb 21 12:44:38 2019 +0530

deleted test.txt

Change-Id: I3871f5fe76e548bc29ec2733745b29964e829dd3

commit 95e070581c54ed2edc211874faec56055ea298c8
Author: mkv <Mallikarjuna-Reddy.K-Vamd.com>
Date: Thu Feb 21 01:04:16 2019 -0500

added test file for initial commit

commit 70f12f209bc1901b5205902503707134cf2991a0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 20 16:10:10 2019 -0600

Changed unsafe-loop to unsafe-math optimizations.

Details:
- Changed -funsafe-loop-optimizations (re-)introduced in 7690855 for
make_defs.mk files' CRVECFLAGS to -funsafe-math-optimizations (to
account for a miscommunication in issue 300). Thanks to Dave Love
for this suggestion and Jeff Hammond for his feedback on the topic.

commit 7690855c5106a56e5b341a350f8db1c78caacd89
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 19:16:01 2019 -0600

Restored -funsafe-loop-optimizations to subconfigs.

Details:
- Restored use of -funsafe-loop-optimizations in the definitions of
CRVECFLAGS (when using gcc), but only for sub-configurations (and
not configuration families such as amd64, intel64, and x86_64).
This more or less reverts 5190d05 and 6cf1550.

commit 44994d1490897b08cde52a615a2e37ddae8b2061
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 18:35:30 2019 -0600

Disable TBM, XOP, LWP instructions in AMD configs.

Details:
- Added -mno-tbm -mno-xop -mno-lwp to CKVECFLAGS in bulldozer,
piledriver, steamroller, and excavator configurations to explicitly
disable AMD's bulldozer-era TBM, XOP, and LWP instruction sets in an
attempt to fix the invalid instruction error that has plagued Travis
CI builds since 6a014a3. Thanks to Devin Matthews for pointing out
that the offending instruction was part of TBM (issue 300).
- Restored -O3 to piledriver configuration's COPTFLAGS.

commit 1e5b530744c1906140d47f43c5cad235eaa619cf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 18:04:38 2019 -0600

Reverted piledriver COPTFLAGS from -O3 to -O2.

Details:
- Debugging continues; changing COPTFLAGS for piledriver subconfig from
-O3 to -O2, its original value prior to 6a014a3.

commit 6cf155049168652c512aefdd16d74e7ff39b98df
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 17:29:51 2019 -0600

Removed -funsafe-loop-optimizations from all configs.

Details:
- Error persists. Removed -funsafe-loop-optimizations from all remaining
sub-configurations.

commit 5190d05a27c5fa4c7942e20094f76eb9a9785c3e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 17:07:35 2019 -0600

Removed -funsafe-loop-optimizations from piledriver.

Details:
- Error persists; continuing debugging from bf0fb78c by removing
-funsafe-loop-optimizations from piledriver configuration.

commit bf0fb78c5e575372060d22f5ceeb5b332e8978ec
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 16:51:38 2019 -0600

Removed -funsafe-loop-optimizations from families.

Details:
- Removed -funsafe-loop-optimizations from the configuration families
affected by 6a014a3, specifically: intel64, amd64, and x86_64.
This is part of an attempt to debug why the sde, as executed by
Travis CI, is crashing via the following error:

TID 0 SDE-ERROR: Executed instruction not valid for specified chip
(ICELAKE): 0x9172a5: bextr_xop rax, rcx, 0x103

commit 6a014a3377a2e829dbc294b814ca257a2bfcb763
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 14:52:29 2019 -0600

Standardized optimization flags in make_defs.mk.

Details:
- Per Dave Love's recommendation in issue 300, this commit defines
COPTFLAGS := -03
and
CRVECFLAGS := $(CKVECFLAGS) -funsafe-loop-optimizations
in the make_defs.mk for all Intel- and AMD-based configurations.

commit 565fa3853b381051ac92cff764625909d105644d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 18 11:43:58 2019 -0600

Redirect trsm pc, ir parallelism to ic, jr loops.

Details:
- trsm parallelization was temporarily simplifed in 075143d to entirely
ignore any parallelism specified via the pc or ir loops. Now, any
parallelism specified to the pc loop will be redirected to the ic
loop, and any parallelism specified to the ir loop will be redirected
to the jr loop. (Note that because of inter-iteration dependencies,
trsm cannot parallelize the ir loop. Parallelism via the pc loop is
at least somewhat feasible in theory, but it would require tracking
dependencies between blocks--something for which BLIS currently lacks
the necessary supporting infrastructure.)

commit a023c643f25222593f4c98c2166212561d030621
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 14 20:18:55 2019 -0600

Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto'

commit 075143dfd92194647da9022c1a58511b20fc11f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 14 18:52:45 2019 -0600

Added support for IC loop parallelism to trsm.

Details:
- Parallelism within the IC loop (3rd loop around the microkernel) is
now supported within the trsm operation. This is done via a new branch
on each of the control and thread trees, which guide execution of a
new trsm-only subproblem from within bli_trsm_blk_var1(). This trsm
subproblem corresponds to the macrokernel computation on only the
block of A that contains the diagonal (labeled as A11 in algorithms
with FLAME-like partitioning), and the corresponding row panel of C.
During the trsm subproblem, all threads within the JC communicator
participate and parallelize along the JR loop, including any
parallelism that was specified for the IC loop. (IR loop parallelism
is not supported for trsm due to inter-iteration dependencies.) After
this trsm subproblem is complete, a barrier synchronizes all
participating threads and then they proceed to apply the prescribed
BLIS_IC_NT (or equivalent) ways of parallelism (and any BLIS_JR_NT
parallelism specified within) to the remaining gemm subproblem (the
rank-k update that is performed using the newly updated row-panel of
B). Thus, trsm now supports JC, IC, and JR loop parallelism.
- Modified bli_trsm_l_cntl_create() to create the new "prenode" branch
of the trsm_l cntl_t tree. The trsm_r tree was left unchanged, for
now, since it is not currently used. (All trsm problems are cast in
terms of left-side trsm.)
- Updated bli_cntl_free_w_thrinfo() to be able to free the newly shaped
trsm cntl_t trees. Fixed a potentially latent bug whereby a cntl_t
subnode is only recursed upon if there existed a corresponding
thrinfo_t node, which may not always exist (for problems too small
to employ full parallelization due to the minimum granularity imposed
by micropanels).
- Updated other functions in frame/base/bli_cntl.c, such as
bli_cntl_copy() and bli_cntl_mark_family(), to recurse on sub-prenodes
if they exist.
- Updated bli_thrinfo_free() to recurse into sub-nodes and prenodes
when they exist, and added support for growing a prenode branch to
bli_thrinfo_grow() via a corresponding set of help functions named
with the _prenode() suffix.
- Added a bszid_t field thrinfo_t nodes. This field comes in handy when
debugging the allocation/release of thrinfo_t nodes, as it helps trace
the "identity" of each nodes as it is created/destroyed.
- Renamed
bli_l3_thrinfo_print_paths() -> bli_l3_thrinfo_print_gemm_paths()
and created a separate bli_l3_thrinfo_print_trsm_paths() function to
print out the newly reconfigured thrinfo_t trees for the trsm
operation.
- Trival changes to bli_gemm_blk_var?.c and bli_trsm_blk_var?.c
regarding variable declarations.
- Removed subpart_t enum values BLIS_SUBPART1T, BLIS_SUBPART1B,
BLIS_SUBPART1L, BLIS_SUBPART1R. Then added support for two new labels
(semantically speaking): BLIS_SUBPART1A and BLIS_SUBPART1B, which
represent the subpartition ahead of and behind, respectively,
BLIS_SUBPART1. Updated check functions in bli_check.c accordingly.
- Shuffled layering/APIs for bli_acquire_mpart_[mn]dim() and
bli_acquire_mpart_t2b/b2t(), _l2r/r2l().
- Deprecated old functions in frame/3/bli_l3_thrinfo.c.

commit 78bc0bc8b6b528c79b11f81ea19250a1db7450ed
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Thu Feb 14 13:29:02 2019 -0600

Power9 sub-configuration (298)

Formally registered power9 sub-configuration.

Details:
- Added and registered power9 sub-configuration into the build system.
Thanks to Nicholai Tukanov and Devangi Parikh for these contributions.
- Note: The sub-configuration does not yet have a corresponding
architecture-specific kernel set registered, and so for now the
sub-config is using the generic kernel set.

commit 6b832731261f9e7ad003a9ea4682e9ca973ef844
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 12 16:01:28 2019 -0600

Generalized ref kernels' pragma omp simd usage.

Details:
- Replaced direct usage of _Pragma( "omp simd" ) in reference kernels
with PRAGMA_SIMD, which is defined as a function of the compiler being
used in a new bli_pragma_macro_defs.h file. That definition is cleared
when BLIS detects that the -fopenmp-simd command line option is
unsupported. Thanks to Devin Matthews and Jeff Hammond for suggestions
that guided this commit.
- Updated configure and bli_config.h.in so that the appropriate anchor
is substituted in (when the corresponding pragma omp simd support is
present).

commit b1f5ce8622b682b79f956fed83f04a60daa8e0fc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 5 17:38:50 2019 -0600

Minor updates to scripts in test/mixeddt/matlab.

commit 38203ecd15b1fa50897d733daeac6850d254e581
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Mon Feb 4 15:28:28 2019 -0500

Added thunderx2 system in the mixeddt test scripts

Details:
- Added thunderx2 (tx2) as a system in the runme.sh in test/mixeddt

commit dfc91843ea52297bf636147793029a0c1345be04
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Mon Feb 4 15:23:40 2019 -0500

Fixed gcc flags for thunderx2 subconfiguration

Details:
- Fixed -march flag. Thunderx2 is an armv8.1a architecture not armv8a.

commit c665eb9b888ec7e41bd0a28c4c8ac4094d0a01b5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 28 16:22:23 2019 -0600

Minor updates to docs, Makefiles.

Details:
- Changed all occurrances of
micro-kernel -> microkernel
macro-kernel -> macrokernel
micro-panel -> micropanel
in all markdown documents in 'docs' directory. This change is being
made since we've reached the point in adoption and acceptance of
BLIS's insights where words such as "microkernel" are no longer new,
and therefore now merit being unhyphenated.
- Updated "Implementation Notes" sections of KernelsHowTo.md, which
still contained references to nonexistent cpp macros such as
BLIS_DEFAULT_MR_? and BLIS_PACKDIM_MR_?.
- Added 'run-fast' and 'check-fast' targets to testsuite/Makefile.
- Minor updates to Testsuite.md, including suggesting use of
'make check' and 'make check-fast' when running from the local
testsuite directory.
- Added a comment to top-level Makefile explaining the purpose behind
the TESTSUITE_WRAPPER variable, which at first glance appears to serve
no purpose.

commit 1aa280d0520ed5eaea3b119b4e92b789ecad78a4
Author: M. Zhou <5723047+cdluminateusers.noreply.github.com>
Date: Sun Jan 27 21:40:48 2019 +0000

Amend OS detection for kFreeBSD. (295)

commit fffc23bb35d117a433886eb52ee684ff5cf6997f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 25 13:35:31 2019 -0600

CREDITS file update.

commit 26c5cf495ce22521af5a36a1012491213d5a4551
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 24 18:49:31 2019 -0600

Fixed bug in skx subconfig related to bdd46f9.

Details:
- Fixed code in the skx subconfiguration that became a bug after
committing bdd46f9. Specifically, the bli_cntx_init_skx() function
was overwriting default blocksizes for the scomplex and dcomplex
microkernels despite the fact that only single and double real
microkernels were being registered. This was not a problem prior to
bdd46f9 since all microkernels used dynamically-queried (at runtime)
register blocksizes for loop bounds. However, post-bdd46f9, this
became a bug because the reference ukernels for scomplex and dcomplex
were written with their register blocksizes hard-coded as constant
loop bounds, which conflicted the the erroneous scomplex and dcomplex
values that bli_cntx_init_skx() was setting in the context. The
lesson here is that going forward, all subconfigurations must not set
any blocksizes for datatypes corresponding to default/reference
microkernels. (Note that a blocksize is left unchanged by the
bli_cntx_set_blkszs() function if it was set to -1.)

commit 180f8e42e167b83a757340ad4bd4a5c7a1d6437b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 24 18:01:15 2019 -0600

Fixed undefined behavior trsm ukr bug in bdd46f9.

Details:
- Fixed a bug that mainfested anytime a configuration was used in which
optimized microkernels were registered and the trsm operation (or
kernel) was invoked. The bug resulted from the optimized microkernels'
register blocksizes conflicting with the hard-coded values--expressed
in the form of constant loop bounds--used in the new reference trsm
ukernels that were introduced in bdd46f9. The fix was easy: reverting
back to the implementation that uses variable-bound loops, which
amounted to changing an if 0 to if 1 (since I preserved the older
implementation in the file alongside the new code based on constant-
bound loops). It should be noted that this fix must be permanent,
since the trsm kernel code with constant-bound loops can never work
with gemm ukernels that use different register blocksizes.

commit bdd46f9ee88057d52610161966a11c224e5a026c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 24 17:23:18 2019 -0600

Rewrote reference kernels to use pragma omp simd.

Details:
- Rewrote level-1v, -1f, and -3 reference kernels in terms of simplified
indexing annotated by the pragma omp simd directive, which a compiler
can use to vectorize certain constant-bounded loops. (The new kernels
actually use _Pragma("omp simd") since the kernels are defined via
templatizing macros.) Modest speedup was observed in most cases using
gcc 5.4.0, which may improve with newer versions. Thanks to Devin
Matthews for suggesting this via issue 286 and 259.
- Updated default blocksizes defined in ref_kernels/bli_cntx_ref.c to
be 4x16, 4x8, 4x8, and 4x4 for single, double, scomplex and dcomplex,
respectively, with a default row preference for the gemm ukernel. Also
updated axpyf, dotxf, and dotxaxpyf fusing factors to 8, 6, and 4,
respectively, for all datatypes.
- Modified configure to verify that -fopenmp-simd is a valid compiler
option (via a new detect/omp_simd/omp_simd_detect.c file).
- Added a new header in which prefetch macros are defined according to
which compiler is detected (via macros such as __GNUC__). These
prefetch macros are not yet employed anywhere, though.
- Updated the year in copyrights of template license headers in
build/templates and removed AMD as a default copyright holder.

commit 63de2b0090829677755eb5cdb27e73bc738da32d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 23 12:16:27 2019 -0600

Prevent redef of ftnlen in blastest f2c_types.h.

Details:
- Guard typedef of ftnlen in f2c_types.h with a ifndef HAVE_BLIS_H
directive to prevent the redefinition of that type. Thanks to Jeff
Diamond for reporting this compiler warning (and apologies for the
delay in committing a fix).

commit eec2e183a7b7d67702dbd1f39c153f38148b2446
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 21 12:12:18 2019 -0600

Added escaping to '/' in os_name in configure.

Details:
- Add os_name to the list of variables into which the '/' character is
escaped. This is meant to address (or at least make progress toward
addressing) 293. Thanks to Isuru Fernando for spotting this as the
potential fix, and also thanks to M. Zhou for the original report.

commit adf5c17f0839fdbc1f4a1780f637928b1e78e389
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jan 18 15:14:45 2019 -0600

Formally registered thunderx2 subconfiguration.

Details:
- Added a separate subconfiguration for thunderx2, which now uses
different optimization flags than cortexa57/cortexa53.

commit 094cfdf7df6c2764c25fcbfce686ba29b933942c
Author: M. Zhou <5723047+cdluminateusers.noreply.github.com>
Date: Fri Jan 18 18:46:13 2019 +0000

Port BLIS to GNU Hurd OS. (294)

Prevent blis.h from misidentifying Hurd as OSX.

commit 5d7d616e8e591c2f3c7c2d73220eb27ea484f9c9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 15 20:52:51 2019 -0600

README.md update re: mixeddt TOMS paper.

commit 58c7fb4788177487f73a3964b7a910fe4dc75941
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 8 17:00:27 2019 -0600

Added more matlab scripts for mixeddt paper.

Details:
- Added a variant set of matlab scripts geared to producing plots that
reflect performance data gathered with and without extra memory
optimizations enabled. These scripts reside (for now) in
test/mixeddt/matlab/wawoxmem.

commit 34286eb914b48b56cdda4dfce192608b9f86d053
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 8 11:41:20 2019 -0600

Minor update to docs/HardwareSupport.md.

commit 108b04dc5b1b1288db95f24088d1e40407d7bc88
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 20:16:31 2019 -0600

Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto' to reflect removal of
bli_malloc_pool() and bli_free_pool().

commit 706cbd9d5622f4690e6332a89cf41ab5c8771899
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 18:28:19 2019 -0600

Minor tweaks/cleanups to bli_malloc.c, _apool.c.

Details:
- Removed malloc_ft and free_ft function pointer arguments from the
interface to bli_apool_init() after deciding that there is no need to
specify the malloc()/free() for blocks within the apool. (The apool
blocks are actually just array_t structs.) Instead, we simply call
bli_malloc_intl()/_free_intl() directly. This has the added benefit
of allowing additional output when memory tracing is enabled via
--enable-mem-tracing. Also made corresponding changes elsewhere in
the apool API.
- Changed the inner pools (elements of the array_t within the apool_t)
to use BLIS_MALLOC_POOL and BLIS_FREE_POOL instead of BLIS_MALLOC_INTL
and BLIS_FREE_INTL.
- Disabled definitions of bli_malloc_pool() and bli_free_pool() since
there are no longer any consumers of these functions.
- Very minor comment / printf() updates.

commit 579145039d945adbcad1177b1d53fb2d3f2e6573
Author: Minh Quan Ho <1337056+hominhquanusers.noreply.github.com>
Date: Mon Jan 7 23:00:15 2019 +0100

Initialize error messages at compile time (289)

* Initialize error messages at compile time

- Assigning strings directly to the bli_error_string array, instead of
snprintf() at execution-time.

* Retired bli_error_init(), _finalize().

Details:
- Removed functions obviated by changes in 80e8dc6: bli_error_init(),
bli_error_finalize(), and bli_error_init_msgs(), as well as calls to
the former two in bli_init.c.

* Regenerated symbols in build/libblis-symbols.def.

Details:
- Reran ./build/regen-symbols.sh after running
'configure --enable-cblas auto'.

commit aafbca086e36b6727d7be67e21fef5bd9ff7bfd9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 12:38:21 2019 -0600

Updated external package language in README.md.

Details:
- Updated/added comments about Fedora, OpenSUSE, and GNU Guix under the
newly-renamed "External GNU/Linux packages" section. Thanks to Dave
Love for providing these revisions.

commit daacfe68404c9cc8078e5e7ba49a8c7d93e8cda3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 7 12:12:47 2019 -0600

Allow running configure with python 3.4.

Details:
- Relax version blacklisting of python3 to allow 3.4 or later instead
of 3.5 or later. Thanks to Dave Love for pointing out that 3.4 was
sufficient for the purpose of BLIS's build system. (It should be
noted that we're not sure which, if any, python3 versions prior to
3.4 are insufficient, and that the only thing stopping us from
determining this is the fact that these earlier versions of python3
are not readily available for us to test with.)
- Updated docs/BuildSystem.md to be explicit about current python2 vs
python3 version requirements.

commit cdbf16aa93234e0d6a80f0d0e385ec81e7b75465
Author: prangana <pradeep.raoamd.com>
Date: Fri Jan 4 15:59:21 2019 +0530

Update version 1.3

Change-Id: I32a7d24af860e87a60396614075236afb65a28a9

commit cf9c1150515b8e9cc4f12e0d4787b3471b12ba4a
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Thu Jan 3 09:51:46 2019 +0530

This commit adds a macro, which is to be enabled when BLIS is working on single instance mode

Change-Id: I7f3fd654b78e64c4e6e24e9f0e245b1a30c492b0

commit ad8d9adb09a7dd267bbdeb2bd1fbbf9daf64ee76
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 3 16:08:24 2019 -0600

README.md, CREDITS update.

Details:
- Added "What's New" and "What People Are Saying About BLIS" sections to
README.md.
- Added missing github handles to various individuals' entries in the
CREDITS file.

commit 7052fca5aef430241278b67d24cef6fe33106904
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 13:48:40 2019 -0600

Apply f272c289 to bli_fmalloc_noalign().

Details:
- Perform the same check for NULL return values and error message output
in bli_fmalloc_noalign() as is performed by bli_fmalloc_align(). (This
change was intended for f272c289.)

commit 528e3ad16a42311a852a8376101959b4ccd801a5
Merge: 3126c52e f272c289
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 13:39:19 2019 -0600

Merge branch 'amd'

commit 3126c52ea795ffb7d30b16b7f7ccc2a288a6158d
Merge: 61441b24 8091998b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 13:37:37 2019 -0600

Merge branch 'amd'

commit f272c2899a6764eedbe05cea874ee3bd258dbff3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 2 12:34:15 2019 -0600

Add error message to malloc() check for NULL.

Details:
- Output an error message if and when the malloc()-equivalent called by
bli_fmalloc_align() ever returns NULL. Everything was already in place
for this to happen, including the error return code, the error string
sprintf(), the error checking function bli_check_valid_malloc_buf()
definition, and its prototype. Thanks to Minh Quan Ho for pointing out
the missing error message.
- Increased the default block_ptrs_len for each inner pool stored in the
small block allocator from 10 to 25. Under normal execution, each
thread uses only 21 blocks, so this change will prevent the sba from
needing to resize the block_ptrs array of any given inner pool as
threads initially populate the pool with small blocks upon first
execution of a level-3 operation.
- Nix stray newline echo in configure.

commit eb97f778a1e13ee8d3b3aade05e479c4dfcfa7c0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 25 20:17:09 2018 -0600

Added missing AMD copyrights to previous commit.

Details:
- Forgot to add AMD copyrights to several touched files that did not
already have them in 2f31743.

commit 2f3174330fb29164097d664b7c84e05c7ced7d95
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 25 19:35:01 2018 -0600

Implemented a pool-based small block allocator.

Details:
- Implemented a sophisticated data structure and set of APIs that track
the small blocks of memory (around 80-100 bytes each) used when
creating nodes for control and thread trees (cntl_t and thrinfo_t) as
well as thread communicators (thrcomm_t). The purpose of the small
block allocator, or sba, is to allow the library to transition into a
runtime state in which it does not perform any calls to malloc() or
free() during normal execution of level-3 operations, regardless of
the threading environment (potentially multiple application threads
as well as multiple BLIS threads). The functionality relies on a new
data structure, apool_t, which is (roughly speaking) a pool of
arrays, where each array element is a pool of small blocks. The outer
pool, which is protected by a mutex, provides separate arrays for each
application thread while the arrays each handle multiple BLIS threads
for any given application thread. The design minimizes the potential
for lock contention, as only concurrent application threads would
need to fight for the apool_t lock, and only if they happen to begin
their level-3 operations at precisely the same time. Thanks to Kiran
Varaganti and AMD for requesting this feature.
- Added a configure option to disable the sba pools, which are enabled
by default; renamed the --[dis|en]able-packbuf-pools option to
--[dis|en]able-pba-pools; and rewrote the --help text associated with
this new option and consolidated it with the --help text for the
option associated with the sba (--[dis|en]able-sba-pools).
- Moved the membrk field from the cntx_t to the rntm_t. We now pass in
a rntm_t* to the bli_membrk_acquire() and _release() APIs, just as we
do for bli_sba_acquire() and _release().
- Replaced all calls to bli_malloc_intl() and bli_free_intl() that are
used for small blocks with calls to bli_sba_acquire(), which takes a
rntm (in addition to the bytes requested), and bli_sba_release().
These latter two functions reduce to the former two when the sba pools
are disabled at configure-time.
- Added rntm_t* arguments to various cntl_t and thrinfo_t functions, as
required by the new usage of bli_sba_acquire() and _release().
- Moved the freeing of "old" blocks (those allocated prior to a change
in the block_size) from bli_membrk_acquire_m() to the implementation
of the pool_t checkout function.
- Miscellaneous improvements to the pool_t API.
- Added a block_size field to the pblk_t.
- Harmonized the way that the trsm_ukr testsuite module performs packing
relative to that of gemmtrsm_ukr, in part to avoid the need to create
a packm control tree node, which now requires a rntm_t that has been
initialized with an sba and membrk.
- Re-enable explicit call bli_finalize() in testsuite so that users who
run the testsuite with memory tracing enabled can check for memory
leaks.
- Manually imported the compact/minor changes from 61441b24 that cause
the rntm to be copied locally when it is passed in via one of the
expert APIs.
- Reordered parameters to various bli_thrcomm_*() functions so that the
thrcomm_t* to the comm being modified is last, not first.
- Added more descriptive tracing for allocating/freeing small blocks and
formalized via a new configure option: --[dis|en]able-mem-tracing.
- Moved some unused scalm code and headers into frame/1m/other.
- Whitespace changes to bli_pthread.c.
- Regenerated build/libblis-symbols.def.

commit 61441b24f3244a4b202c29611a4899dd5c51d3a1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 20 19:38:11 2018 -0600

Make local copy of user's rntm_t in level-3 ops.

Details:
- In the case that the caller passes in a non-NULL rntm_t pointer into
one of the expert APIs for a level-3 operation (e.g. bli_gemm_ex()),
make a local copy of the rntm_t and use the address of that local copy
in all subsequent execution (which may change the contents of the
rntm_t). This prevents a potentially confusing situation whereby a
user-initialized rntm_t is used once (in, say, gemm), and then found
by the user to be in a different state before it is used a second
time.

commit e809b5d2f1023b4249969e2f516291c9a3a00b80
Merge: 76016691 0476f706
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 20 16:27:26 2018 -0600

Merge branch 'master' into amd

commit 1f4eeee5175a8fc9ac312847c796ce6db5fe75b9
Author: sraut <Biplab.Rautamd.com>
Date: Wed Dec 19 21:21:10 2018 +0530

Fixed BLAS test failures of small matrix SYRK for single and double precision.

Details:
- SYRK for small matrix was implemented by reusing small GEMM routine. This was
resulting in output written to the full C matrix, and C being symmetric the
lower and upper triangles of C matrix contained same results. BLAS SYRK API
spec demands either lower or upper triangle of C matrix to be written with
results. So, this was resulting in BLAS test failures, even though testsuite
of BLIS was passing small SYRK operation.
- To fix BLAS test failures of small matrix SYRK, separate kernel routines are
implemented for small SYRK for both single and double precision. The newly
added small SYRK routines are in file kernels/zen/3/bli_syrk_small.c.
Now the intermediate results of matrix C are written to a scratch buffer.
Final results are written from scratch buffer to matrix C using SIMD
copy to either lower or upper traingle part of matrix C.
- Source and header files frame/3/syrk/bli_syrk_front.c and
frame/3/syrk/bli_syrk_front.h are changed to invoke new small SYRK routines.

Change-Id: I9cfb1116c93d150aefac673fca033952ecac97cb

commit 6d267375c3a0543f20604d74cc678ad91db3b6f1
Author: sraut <Biplab.Rautamd.com>
Date: Wed Dec 19 14:22:21 2018 +0530

This commit improves the performance of multi-instance DGEMM when these multiple threads are binded to a CCX.
Multi-Instance: Each thread runs a sequential DGEMM.
Change-Id: I306920c8061b6dad61efac1dae68727f4ac27df6

commit 0476f706b93e83f6b74a3d7b7e6e9cc9a1a52c3b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 18 14:56:20 2018 -0600

CHANGELOG update (0.5.1)

1.0rc0

Author: Field G. Van Zee <fgvanzeegmail.com>
Date: Fri Nov 3 15:52:57 2023 -0500

CREDITS file update.

commit 05388ddb66f8bf2d62009b162d64bf2d99226b83
Author: Aaron Hutchinson <113382047+Aaron-Hutchinsonusers.noreply.github.com>
Date: Fri Nov 3 13:30:31 2023 -0700

Added 'sifive_x280' subconfig, kernel set. (737)

Details:
- Added a new 'sifive_x280' subconfiguration for SiFive's x280 RISC-V
instruction set architecture. The subconfig registers kernels from a
correspondingly new kernel set, also named 'sifive_x280'.
- Added the aforementioned kernel set, which includes intrinsics- and
assembly-based implementations of most level-1v kernels along with
level-1f kernels axpy2v dotaxpyv, packm kernels, and level-3 gemm,
gemmtrsm_l, and gemmtrsm_u microkernels (plus supporting files).
- Registered the 'sifive_x280' subconfig as belonging to a singleton
family by the same name.
- Added an entry to '.travis.yml' to test the new subconfig via qemu.
- Updates to 'travis/do_riscv.sh' script to support the 'sifive_x280'
subconfig and to reflect updated tarball names.
- Special thanks to Lee Killough, Devin Matthews, and Angelika Schwarz
for their engagement on this commit.

0.9.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:12:06 2022 -0500

Version file update (0.9.0)

commit 99bb9002f1aff598d347eae2821a3f7bdd1f48e8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 1 08:10:59 2022 -0500

ReleaseNotes.md update in advance of next version.

commit bee7678b2558a691ac850819dbe33fefe4fdbee3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 31 14:09:39 2022 -0500

CREDITS file update.

commit cf06364327bd2d21d606392371ff3c5962bee5ba
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 29 16:18:25 2022 -0500

Fixed typo in BLAS gemm3m call to _check().

Details:
- Fixed an unresolved symbol issue leftover from 590 whereby ?gemm3m_()
as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which
does not exist. It should have simply called the _check() function for
gemm.

commit 1ec020b33ece1681c0041e2549eed2bd4c6cf356
Author: Dipal M Zambare <71366780+dzambareusers.noreply.github.com>
Date: Wed Mar 30 02:45:36 2022 +0530

AMD kernel updates; frame-specific AMD updates. (597)

Details:
- Allow building BLIS with certain framework files (each with the '_amd'
suffix) that have been customized by AMD for Zen-based hardware. These
customized files were derived from portable versions of the same files
(i.e., those without the '_amd' suffix). Whether the portable or AMD-
specific files are compiled is now controlled by a new configure
option, --[en|dis]able-amd-frame-tweaks. This option is disabled by
default in vanilla BLIS, though AMD may choose to enable it by default
in their fork. For now, the added AMD-specific files are:
- bli_gemv_unf_var2_amd.c
- bla_copy_amd.c
- bla_gemv_amd.c
These files reside in 'amd' subdirectories found within the directory
housing their generic counterparts.
- Register optimized real-domain copyv, setv, and swapv kernels in
bli_cntx_init_zen.c.
- Various minor updates to level-1v kernels in 'zen' kernel set.
- Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to
the 'zen' kernel set
- If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim,
call gemv instead and return early.
- Combined variable declarations with their initialization in various
level-2 and level-3 BLAS compatibility files, and also inserted
'const' qualifer in those same declaration statements.
- Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ .
- Added copyv and swapv test drivers to 'test' directory.
- Whitespace, comment changes.

commit 0db2bd5341c5c3ed5f1cc2bffa90952735efa45f
Author: Bhaskar Nallani <Nallani.Bhaskaramd.com>
Date: Fri Mar 25 05:11:55 2022 +0530

Added BLAS/CBLAS APIs for gemm3m. (590)

Details:
- Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply
invoke the 1m implementation unconditionally. (Note that these APIs
bypass sup handling.)
- Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h.
- Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h.
- Relocated:
frame/compat/cblas/src/cblas_?gemmt.c
files into
frame/compat/cblas/src/extra/
- Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ .
- Minor reorganization of prototypes and cpp macro directives in
bli_blas.h, cblas.h, and cblas_f77.h.
- Trival whitespace change to cblas_zgemm.c.

commit d6810000e961fe807dc5a7db81180a8355f3eac0
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Mar 14 10:29:54 2022 -0500

Update Multithreading.md

Add notes about `BLIS_IR_NT` (should typically be 1) and `BLIS_JR_NT` (should typically be small, e.g. <= 4). [ci skip]

commit f1dbb0e514f53a3240d3a6cbdc3306b01a2206f5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 11 13:38:28 2022 -0600

Trival whitespace change; commit log addendum.

Details:
- A co-attribution to Mithun Mohan was inadvertently omitted from the
commit log for headline change in the previous commit, 7c07b47.

commit 7c07b477e432adbbce5812ed9341ba3092b03976
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 11 13:28:50 2022 -0600

Avoid gemmsup barriers when not packing A or B. (622)

Details:
- Implemented a multithreaded optimization for the special (and common)
case of employing the gemmsup code path when the user requests
(implicitly or explicitly) that neither A nor B be packed during
computation. This optimization takes the form of a greatly reduced
code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a
broadcast and two barriers, and results in higher performance when
obtaining two-way or higher parallelism within BLIS. Thanks to
Bhaskar Nallani of AMD for proposing this change via issue 605.
- Added an early return branch to bli_thrinfo_create_for_cntl() that
detects and quickly handles cases where no parallelism is being
obtained within BLIS (i.e., single-threaded execution). Note that
this special case handling was/is already present in
bli_thrinfo_sup_create_for_cntl().
- CREDITS file update.

commit cad10410b2305bc0e328c5f2517ab02593b53428
Author: Ivan Korostelev <ivan23korgmail.com>
Date: Thu Mar 10 09:58:14 2022 -0600

POWER10: edge cases in microkernel (620)

Use new API for POWER10 gemm microkernel

commit 71851a0549276b17db18a0a0c8ab4f54493bf033
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 8 17:38:09 2022 -0600

Fixed level-3 performance bug in haswell ukernels.

Details:
- Fixed a performance regression affecting nearly all level-3 operations
that use the 'haswell' sgemm and dgemm microkernels. This regression
was introduced in 54fa28b, caused by an ill-formed conditional
expression in the assembly code that controls whether cache lines of C
should be prefetched as rows or as columns. Essentially, the two
branches were reversed, causing incomplete prefetching to occur for
both row- and column-stored instances of matrix C. Thanks to Devin
Matthews for his help finding and fixing this bug.

commit 84732bf95634ac606c5f2661d9474318e366c386
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 28 12:19:31 2022 -0600

Revamp how tools are handled/checked by configure.

Details:
- Consolidate handling of tools that are specifiable via CC, CXX, FC,
PYTHON, AR, and RANLIB into one bash function, select_tool_w_env().
- If the user specifies a tool via an environment variable (e.g.
CC=gcc) and that tool does not seem valid, print an error message
and abort configure, unless the tool is optional (e.g. CXX or FC),
in which case a warning message is printed instead.
- The definition of "seems valid" above amounts to:
- responding to at least one of a basic set of command line options
(e.g. --version, -V, -h) if the os_name is Linux (since GNU tools
tend to respond to flags such as --version) or if the tool in
question is CC, CXX, FC, or PYTHON (which tend to respond to the
expected flags regardless of OS)
- the binary merely existing for AR and RANLIB on Darwin/OSX/BSD.
(These OSes tend to have non-GNU versions of ar and ranlib, which
typically do not respond to --version and friends.)
- This PR addresses 584. Thanks to Devin Matthews for suggesting some
of the changes in this commit.

commit d5146582b1f1bcdccefe23925d3b114d40cd7e31
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Feb 23 03:35:46 2022 +0900

ArmSVE Ensure Non-zero Block Size (615)

Fixes 613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.

commit 4d8352309784403ed6719528968531ffb4483947
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Feb 23 01:03:47 2022 +0900

Add armsve to arm64 Metaconfig (614)

Availability of the `armsve` subconfig is controlled by the compiler version (gcc/clang). Tested for SVE and non-SVE. Fixes 612.

commit c9700f369aa84fc00f36c4b817ffb7dab72b865d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 15 15:36:52 2022 -0600

Renamed SIMD-related macro constants for clarity.

Details:
- Renamed the following macros defined in bli_kernel_macro_defs.h:

BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS
BLIS_SIMD_SIZE -> BLIS_SIMD_MAX_SIZE

Also updated all instances of these macros elsewhere, including
subconfigurations, source code, and documentation. Thanks to Devin
Matthews for suggesting this change.

commit ee9ff988c49f16696679d4c6cd3dcfcac7295be7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 15 15:01:51 2022 -0600

Move edge cases to gemmtrsm ukrs; doc updates.

Details:
- Moved edge-case handling into the gemmtrsm microkernel. This required
changing the microkernel API to take m and n dimension parameters as
well as updating all existing gemmtrsm microkernel function pointer
types, function signatures, and related definitions to take m and n
dimensions. Also updated all existing gemmtrsm kernels in the
'kernels' directory (which for now is limited to haswell and penryn
kernel sets, plus native and 1m-based reference kernels in
'ref_kernels') to take m and n dimensions, and implemented edge-case
handling within those microkernels via a collection of new C
preprocessor macros defined within bli_edge_case_macro_defs.h. Note
that the edge-case handling for gemm-like operations had already
been relocated into the gemm microkernel in 54fa28b.
- Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in
bli_edge_case_macro_defs.h to allow for easier reading.
- Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up
the bullet under "Implementation Notes for gemm" that covers alignment
issues. (Thanks to Ivan Korostelev for pointing out the confusing and
outdated language in issue 591.)
- Other minor tweaks to KernelsHowTo.md.

commit 25061593460767221e1066f9d720fa6676bbed8f
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Feb 13 20:11:55 2022 -0600

Don't use `-Wl,-flat-namespace`.

Flat namespaces can cause problems due to conflicting system libraries,
etc., so just mark `xerbla_` as a weak symbol on macOS instead.

commit 5a4d3f5208d3d8cc1827f8cc90414c764b7ebab3
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Feb 13 17:28:30 2022 -0600

Use -flat_namespace option to link on macOS

Fixes 611.

commit 26742910a087947780a089360e2baf82ea109e01
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Feb 13 16:53:45 2022 -0600

Update CC_VENDOR logic

Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]

commit 2f3872e01d51545c687ae2c8b2650e00552111a7
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Feb 7 17:14:49 2022 +0900

ArmSVE Adopts Label Wrapper

For clang (& armclang?) compilation.

Hopefully solves 609 .

commit 72089bb2917b78d99cf4f27c69125bf213ee54e6
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Feb 5 16:56:04 2022 +0900

ArmSVE Use Predicate in M-Direction

No need to query MR during kernel runtime.

commit 9cc897f37455d52fbba752e3801f1a9d4a5bfdc1
Author: Ruqing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Feb 3 16:40:02 2022 +0000

Fix SVE Compil.

commit b5df1811f1bc8212b2cda6bb97b79819afe236a8
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Feb 3 02:31:29 2022 +0900

Armv8a, ArmSVE: Simplify Gen-C

commit 35195bb5cea5d99eb3eaf41e3815137d14ceb52d
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 31 10:29:50 2022 -0600

Add armclang detection to configure.

armclang is treated as regular clang. Fixes 606. [ci skip]

commit 0be9282cdccf73342d8571d3f7971a9b0af72363
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 26 17:46:24 2022 -0600

Updated zen3 macro constant names.

Details:
- In config/zen3/bli_family_zen3.h, renamed:
BLIS_SMALL_MATRIX_A_THRES_M_GEMMT -> _M_SYRK
BLIS_SMALL_MATRIX_A_THRES_N_GEMMT -> _N_SYRK
Thanks to Jeff Diamond for helping spot the stale _SYRK naming.

commit 0ab20c0e72402ba0b17fe2c3ed3e16bf2ace0fd3
Author: Jeff Hammond <jehammondnvidia.com>
Date: Thu Jan 13 07:29:56 2022 -0800

the Apple local label thing is required by Clang in general

egaudry and I both saw this issue on Linux with Clang 10.

Compiling obj/thunderx2/kernels/armv8a/3/sup/bli_gemmsup_rv_armv8a_asm_d4x8m.o ('thunderx2' CFLAGS for kernels)
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c:171:49: fatal error: invalid symbol redefinition
" \n\t"
^
<inline asm>:90:5: note: instantiated into assembly here
.SLOOPKITER:
^
1 error generated.

Signed-off-by: Jeff Hammond <jehammondnvidia.com>

commit 81f93be0561c705ae6823d19e40849facc40bef7
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 10 10:19:47 2022 -0600

Fix row-/column-major pref. in 16x8 haswell sgemm ukr (unused)

commit 268ce1f29a717d18304713ecc25a2eafe41838c7
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 10 10:17:17 2022 -0600

Relax alignment constraints

Remove alignment of temporary AB buffer in edge case handling macros unless alignment is specifically requested (e.g. Core2, SDB/IVB). Fixes 595.

commit 3f2440b0226d5e23a43d12105d74aa917cd6c610
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 6 14:57:36 2022 -0600

Added m, n dims to gemmd/gemmlike ukernel calls.

Details:
- Updated the gemmd addon and the gemmlike sandbox code to use the new
microkernel calling sequence, which now includes m and n dimensions so
that the microkernel has all the information necessary to handle edge
cases. Thanks to Jeff Diamond for catching this, which ideally would
have been included in commit 54fa28b.
- Retired var2 of both gemmd and gemmlike to 'attic' directories and
removed their corresponding prototypes. In both cases, var2 was a
variant of the block-panel algorithm where edge-case handling was
abstracted away to a microkernel wrapper. (Since this is now the
official behavior of BLIS microkernels, I saw no need to have it
included as a separate code path.)
- Comment updates.

commit 864bfab4486ac910ef9a366e9ade4b45a39747fc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 4 15:10:34 2022 -0600

CREDITS file update.

commit 466b68a3ad118342dc49a8130b7b02f5e7748521
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun Jan 2 14:59:41 2022 -0600

Add unique tag to branch labels for Apple ARM64.

Add `%=` tag to branch labels, which expands to a unique identifier for each inline assembly block. This prevents duplicate symbol errors on Apple Silicon (594). Fixes 594. [ci skip] since we can't test Apple Silicon anyways...

commit 08174a2f6ebbd8ed5aa2bc4edc45da80962f06bb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jan 1 21:35:19 2022 +0900

Evict <arm_sve.h> Requirement for SVE GEMM

For 8<= GCC < 10 compatibility.

commit 54fa28bd847b389215cffb57a83dc9b3dce79c86
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Dec 24 08:00:33 2021 -0600

Move edge cases to gemm ukr; more user-custom mods. (583)

Details:
- Moved edge-case handling into the gemm microkernel. This required
changing the microkernel API to take m and n dimension parameters.
This required updating all existing gemm microkernel function pointer
types, function signatures, and related definitions to take m and n
dimensions. We also updated all existing kernels in the 'kernels'
directory to take m and n dimensions, and implemented edge-case
handling within those microkernels via a collection of new C
preprocessor macros defined within bli_edge_case_macro_defs.h. Also
removed the assembly code that formerly would handle general stride
IO on the microtile, since this can now be handled by the same code
that does edge cases.
- Pass the obj_t.ker_fn (of matrix C) into bli_gemm_cntl_create() and
bli_trsm_cntl_create(), where this function pointer is used in lieu of
the default macrokernel when it is non-NULL, and ignored when it is
NULL.
- Re-implemented macrokernel in bli_gemm_ker_var2.c to be a single
function using byte pointers rather that one function for each
floating-point datatype. Also, obtain the microkernel function pointer
from the .ukr field of the params struct embedded within the obj_t
for matrix C (assuming params is non-NULL and contains a non-NULL
value in the .ukr field). Communicate both the gemm microkernel
pointer to use as well as the params struct to the microkernel via
the auxinfo_t struct.
- Defined gemm_ker_params_t type (for the aforementioned obj_t.params
struct) in bli_gemm_var.h.
- Retired the separate _md macrokernel for mixed datatype computation.
We now use the reimplemented bli_gemm_ker_var2() instead.
- Updated gemmt macrokernels to pass m and n dimensions into microkernel
calls.
- Removed edge-case handling from trmm and trsm macrokernels.
- Moved most of bli_packm_alloc() code into a new helper function,
bli_packm_alloc_ex().
- Fixed a typo bug in bli_gemmtrsm_u_template_noopt_mxn.c.
- Added test/syrk_diagonal and test/tensor_contraction directories with
associated code to test those operations.

commit 961d9d509dd94f3a66f7095057e3dc8eb6d89839
Author: Kiran <kiran.varagantiamd.com>
Date: Wed Dec 8 03:00:38 2021 +0530

Re-add BLIS_ENABLE_ZEN_BLOCK_SIZES macro for 'zen'.

Details:
- Added previously-deleted cpp macro block to bli_cntx_init_zen.c
targeting the Naples microarchitecture that enabled different cache
blocksizes when the number of threads exceeds 16. This commit
represents PR 573.

commit cf7d616a2fd58e293b496770654040818bf5609c
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Dec 2 17:10:03 2021 -0600

Enable user-customized packm ukernel/variant. (549)

Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
.ker_params. These fields store pointers to functions and data that
will allow the user to more flexibly create custom operations while
recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to
replace the diagonal offset with panel offsets, and changed strides
of both C and P to inc/ldim semantics. Updated object API to the packm
variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
definition since it has been replaced by the .pack_fn pointer in the
obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
instead of typed pointers, allowing a single function to be used
regardless of datatype. This obviated having a separate implementation
in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a
new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
always packed as column-stored row panels -- that is, identically to
that of left-hand matrix operands ("A"). This means that while we pack
matrix A normally, we actually pack B in a transposed state. This
allowed us to simplify a lot of code throughout the framework, and
also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
described above. bli_packm_init()--which is now called from within
bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
a bool that indicates whether packing should be performed (or
skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
which, among other things, defaults the new .pack_fn field of the
obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
refocuses the view of an object so that it "forgets" any offsets from
its original pointer. This function also sets the object's root field
to itself. Calls to bli_obj_reset_origin() for each matrix operand
appear in the _front() functions, after the obj_t's are aliased. This
resetting of the underlying matrices' origins is needed in preparation
for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static
inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
libblis_test_pobj_create() to create local packed objects. Previously,
these packed objects were created by calling lower-level functions.

commit e229e049ca08dfbd45794669df08a71dba892925
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 1 17:36:22 2021 -0600

Added recu-sed.sh script to 'build' directory.

Details:
- Added a recursive sed script to the 'build' directory.

commit 12c66a4acc77bf4927b01e2358e2ac10b61e0a53
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 19 14:43:53 2021 -0600

Minor updates to README.md, docs/Addons.md.

Details:
- Add additional mentions of addons to README.md, including in the
"What's New" section.
- Removed mention of sandboxes from the long list of advantages
provided by BLIS.
- Very minor description update to opening line of Addons.md.

commit a4bc03b990fe0572001eb6409efd12cd70677dcf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 19 13:29:00 2021 -0600

Brief mention/link to Addons.md in README.md.

Details:
- Add a blurb about the new addons feature to the "Documentation for
BLIS developers" section of the README.md, which also links to the
Addons.md document.

commit b727645eb7a8df39dee74068f734da66322fe0b3
Merge: 9be97c15 7bde468c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 19 13:22:09 2021 -0600

Merge branch 'dev'

commit 9be97c150e19fa58bca30cb993a6509ae21e2025
Author: Madan mohan Manokar <86282872+madanm3users.noreply.github.com>
Date: Thu Nov 18 00:46:46 2021 +0530

Support all four dts in test/test_her[2][k].c (578)

Details:
- Replaced the hard-coded calls to double-precision real syr, syr2,
syrk, and syrk in the corresponding standalone test drivers in the
'test' directory with conditional branches that will call the
appropriate BLAS interface depending on which datatype is enabled.
Thanks to Madan mohan Manokar for this improvement.
- CREDITS file update.

commit 26e4b6b29312b472c3cadf95ccdf5240764777f4
Author: Dipal M Zambare <71366780+dzambareusers.noreply.github.com>
Date: Thu Nov 18 00:32:00 2021 +0530

Added support for AMD's Zen3 microarchitecture.

Details:
- Added a new 'zen3' subconfiguration targeting support for the AMD Zen3
microarchitecture (561). Thanks to AMD for this contribution.
- Restructured clang and AOCC support for zen, zen2, and zen3
make_defs.mk files. The clang and AOCC version detection now happens
in configure, not in the subconfigurations' makefile fragments. That
is, we've added logic to configure that detects the version of
clang/AOCC, outputs an appropriate variable to config.mk
(ie: CLANG_OT_*, AOCC_OT_*), and then checks for it within the
makefile fragment (as is currently done for the GCC_OT_* variables).
- Added configure support for a GCC_OT_10_1_0 variable (and associated
substitution anchor) to communicate whether the gcc version is older
than 10.1.0, and use this variable to check for recent enough versions
of gcc to use -march=znver3 in the zen3 subconfig.
- Inlined the contents of config/zen/amd_config.mk into the zen and zen2
make_defs.mk so that the files are self-contained, harmonizing the
format of all three Zen-based subconfigurations' make_defs.mk files.
- Added indenting (with spaces) of GNU make conditionals for easier
reading in zen, zen2, and zen3 make_defs.mk files.
- Adjusted the range of models checked by bli_cpuid_is_zen() (which was
previously 0x00 ~ 0xff and is now 0x00 ~ 0x2f) so that it is
completely disjoint from the models checked by bli_cpuid_is_zen2()
(0x30 ~ 0xff). This is normally necessary because Zen and Zen2
microarchitectures share the same family (23, or 0x17), and so the
model code is the only way to differentiate the two. But in our case,
fixing the model range for zen *wasn't* actually necessary since we
checked for zen2 first, and therefore the wide zen range acted like
the 'else' of an 'if-else' statement. That said, the change helps
improve clarity for the reader by encoding useful knowledge, which
was obtained from https://en.wikichip.org/wiki/amd/cpuid .
- Added zen2.def and zen3.def files to the collection in travis/cpuid.
Note that support for zen, zen2, and zen3 is now present, and while
all the three microarchitectures have identical instruction sets from
the perspective of BLIS microkernels, they each correspond to
different subconfigurations and therefore merit separate testing.
Thanks to Devin Matthews for his guidance in hacking these files as
slight modifications of zen.def.
- Enabled testing of zen2 and zen3 via the SDE in travis/do_sde.sh.
Now, zen, zen2, and zen3 are tested through the SDE via Travis CI
builds.
- Updated travis/do_sde.sh to grab the SDE tarball from a new ci-utils
repository on GitHub rather than on Intel's website. This change was
made in an attempt to circumvent recent troubles with Travis CI not
being able to download the SDE directly from Intel's website via curl.
Thanks to Devin Matthews for suggesting the idea.
- Updated travis/do_sde.sh to grab the latest version (8.69.1) of the
Intel SDE from the flame/ci-utils repository.
- Updated .travis.yml to use gcc 9. The file was previously using gcc 8,
which did not support -march=znver2.
- Created amd64_legacy umbrella family in config_registry for targeting
older (bulldozer, piledriver, steamroller, and excavator)
microarchitectures and moved those same subconfigs out of the amd64
umbrella family. However, x86_64 retains amd64_legacy as a constituent
member.
- Fixed a bug in configure related to the building of the so-called
config list. When processing the contents of config_registry,
configure creates a series of structures and lists that allow for
various mappings related to configuration families, subconfigs, and
kernel sets. Two of those lists are built via substitution of
umbrella families with their subconfig members, and one of those
lists was improperly performing the substitution in a way that would
erroneously match on partial umbrella family names. That code was
changed to match the code that was already doing the substitution
properly, via substitute_words(). Also added comments noting the
importance of using substitute_words() in both instances.
- Comment updates.

commit 74c0c622216aba0c24aa2c3a923811366a160cf5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 16 16:06:33 2021 -0600

Reverted cbc88fe.

Details:
- Reverted the annotation of some markdown code blocks with 'bash'
after realizing that the in-browser syntax highlighting was not
worthwhile.

commit cbc88feb51b949ce562d044cf9f99c4e46bb8a39
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 16 16:02:39 2021 -0600

Marked some markdown shell code blocks as 'bash'.

Details:
- Annotated the code blocks that represent shell commands and output as
'bash' in README.md and BuildSystem.md.

commit 78cd1b045155ddf0b9ec6e2ab815f2b216ad9a9e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 16 15:53:40 2021 -0600

Added 'Example Code' section to README.md.

Details:
- Inserted a new 'Example Code' section into the README.md immediately
after the 'Getting Started' section. Thanks to Devin Matthews for
recommending this addition.
- Moved the 'Performance' section of the README down slightly so that it
appears after the 'Documentation' section.

commit 7bde468c6f7ecc4b5322d2ade1ae9c0b88e6b9f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 13 16:39:37 2021 -0600

Added support for addons.

Details:
- Implemented a new feature called addons, which are similar to
sandboxes except that there is no requirement to define gemm or any
other particular operation.
- Updated configure to accept --enable-addon=<name> or -a <name> syntax
for requesting an addon be included within a BLIS build. configure now
outputs the list of enabled addons into config.mk. It also outputs the
corresponding include directives for the addons' headers to a new
companion to the bli_config.h header file named bli_addon.h. Because
addons may wish to make use of existing BLIS types within their own
definitions, the addons' headers must be included sometime after that
of bli_config.h (which currently is included before bli_type_defs.h).
This is why the include directives needed to go into a new top-level
header file rather than the existing bli_config.h file.
- Added a markdown document, docs/Addons.md, to explain addons, how to
build with them, and what assumptions their authors should keep in
mind as they create them.
- Added a gemmlike-like implementation of sandwich gemm called 'gemmd'
as an addon in addon/gemmd. The code uses a 'bao_' prefix for local
functions, including the user-level object and typed APIs.
- Updated .gitignore so that git ignores bli_addon.h files.

commit 7bc8ab485e89cfc6032932e57929e208a28f4be5
Author: Meghana-vankadari <74656386+Meghana-vankadariusers.noreply.github.com>
Date: Fri Nov 12 04:16:14 2021 +0530

Added BLAS/CBLAS APIs for axpby, gemm_batch. (566)

Details:
- Expanded the BLAS compatibility layer to include support for
?axpby_() and ?gemm_batch_(). The former is a straightforward
BLAS-like interface into the axpbyv operation while the latter
implements a batched gemm via loops over bli_?gemm(). Also
expanded the CBLAS compatibility layer to include support for
cblas_?axpby() and cblas_?gemm_batch(), which serve as wrappers to
the corresponding (new) BLAS-like APIs. Thanks to Meghana Vankadari
for submitting these new APIs via 566.
- Fixed a long-standing bug in common.mk that for some reason never
manifested until now. Previously, CBLAS source files were compiled
*without* the location of cblas.h being specified via a -I flag.
I'm not sure why this worked, but it may be due to the fact that
the cblas.h file resided in the same directory as all of the CBLAS
source, and perhaps compilers implicitly add a -I flag for the
directory that corresponds to the location of the source file being
compiled. This bug only showed up because some CBLAS-like source code
was moved into an 'extra' subdirectory of that frame/compat/cblas/src
directory. After moving the code, compilation for those files failed
(because the cblas.h header file, presumably, could not be found in
the same location). This bug was fixed within common.mk by explicitly
adding the cblas.h directory to the list of -I flags passed to the
compiler.
- Added test_axpbyv.c and test_gemm_batch.c files to 'test' directory,
and updated test/Makefile to build those drivers.
- Fixed typo in error message string in cblas_sgemm.c.

commit 28b0982ea70c21841fb23802d38f6b424f8200e1
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Nov 10 12:34:50 2021 -0600

Refactored her[2]k/syr[2]k in terms of gemmt. (531)

Details:
- Renamed herk macrokernels and supporting files and functions to gemmt,
which is possible since at the macrokernel level they are identical.
Then recast herk/her2k/syrk/syr2k in terms of gemmt within the expert
level-3 oapi (bli_l3_oapi_ex.c) while also redefining them as literal
functions rather than cpp macros that instantiate multiple functions.
Thanks to Devin Matthews for his efforts on this issue (531).
- Check that the maximum stack buffer size is sufficiently large
relative to the register blocksizes for each datatype, and do so when
the context is initialized rather than when an operation is called.
Note that with this change, users who pass in their own contexts into
the expert interfaces currently will *not* have any checks performed.
Thanks to Devin Matthews for suggesting this change.

commit cfa3db3f3465dc58dbbd842f4462e4b49e7768b4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 3 18:13:56 2021 -0500

Fixed bug in mixed-dt gemm introduced in e9da642.

Details:
- Fixed a bug that broke certain mixed-datatype gemm behavior. This
bug was introduced recently in e9da642 when the code that performs
the operation transposition (for microkernel IO preference purposes)
was moved up so that it occurred sooner. However, when I moved that
code, I failed to notice that there was a cpp-protected "if"
conditional that applied to the entire code block that was moved. Once
the code block was relocated, the orphaned if-statement was now
(erroneously) glomming on to the next thing that happened to be in the
function, which happened to be the call to bli_rntm_set_ways_for_op(),
causing a rather odd memory exhaustion error in the sba due to the
num_threads field of the rntm_t still being -1 (because the rntm_t
field were never processed as they should have been). Thanks to
ArcadioN09 (Snehith) for reporting this error and helpfully including
relevant memory trace output.

commit f065a8070f187739ec2b34417b8ab864a7de5d7e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 28 16:05:43 2021 -0500

Removed support for 3m, 4m induced methods.

Details:
- Removed support for all induced methods except for 1m. This included
removing code related to 3mh, 3m1, 4mh, 4m1a, and 4m1b as well as any
code that existed only to support those implementations. These
implementations were rarely used and posed code maintenance challenges
for BLIS's maintainers going forward.
- Removed reference kernels for packm that pack 3m and 4m micropanels,
and removed 3m/4m-related code from bli_cntx_ref.c.
- Removed support for 3m/4m from the code in frame/ind, then reorganized
and streamlined the remaining code in that directory. The *ind(),
*nat(), and *1m() APIs were all removed. (These additional API layers
no longer made as much sense with only one induced method (1m) being
supported.) The bli_ind.c file (and header) were moved to frame/base
and bli_l3_ind.c (and header) and bli_l3_ind_tapi.h were moved to
frame/3.
- Removed 3m/4m support from the code in frame/1m/packm.
- Removed 3m/4m support from trmm/trsm macrokernels and simplified some
pointer arithmetic that was previously expressed in terms of the
bli_ptr_inc_by_frac() static inline function (whose definition was
also removed).
- Removed the following subdirectories of level-0 macro headers from
frame/include/level0: ri3, rih, ri, ro, rpi. The level-0 scalar macros
defined in these directories were used exclusively for 3m and 4m
method codes.
- Simplified bli_cntx_set_blkszs() and bli_cntx_set_ind_blkszs() in
light of 1m being the only induced method left within BLIS.
- Removed dt_on_output field within auxinfo_t and its associated
accessor functions.
- Re-indexed the 1e/1r pack schemas after removing those associated with
variants of the 3m and 4m methods. This leaves two bits unused within
the pack format portion of the schema bitfield. (See bli_type_defs.h
for more info.)
- Spun off the basic and expert interfaces to the object and typed APIs
into separate files: bli_l3_oapi.c and bli_l3_oapi_ex.c; bli_l3_tapi.c
and bli_l3_tapi_ex.c.
- Moved the level-3 operation-specific _check function calls from the
operations' _front() functions to the corresponding _ex() function of
the object API. (This change roughly maintains where the _check()
functions are called in the call stack but lays the groundwork for
future changes that may come to the level-3 object APIs.) Minor
modifications to bli_l3_check.c to allow the check() functions to be
called from the expert interface APIs.
- Removed support within the testsuite for testing the aforementioned
induced methods, and updated the standalone test drivers in the 'test'
directory so reflect the retirement of those induced methods.
- Modified the sandbox contract so that the user is obliged to define
bli_gemm_ex() instead of bli_gemmnat(). (This change was made in light
of the *nat() functions no longer existing.) Also updated the existing
'power10' and 'gemmlike' sandboxes to come into compliance with the
new sandbox rules.
- Updated BLISObjectAPI.md, BLISTypedAPI.md, Testsuite.md documentation
to reflect the retirement of 3m/4m, and also modified Sandboxes.md to
bring the document into alignment with new conventions.
- Updated various comments; removed segments of commented-out code.

commit e8caf200a908859fa5f5ea2049911a9bdaa3d270
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 18 13:04:15 2021 -0500

Updated do_sde.sh to get SDE from GitHub.

Details:
- Updated travis/do_sde.sh so that the script downloads the SDE tarball
from a new ci-utils repository on GitHub rather than from Intel's
website. This change is being made in an attempt to circumvent Travis
CI's recent troubles with downloading the SDE from Intel's website via
curl. Thanks to Devin Matthews for suggesting the idea.

commit 290ff4b1c26737b074d5abbf76966bc22af8c562
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 14 16:09:43 2021 -0500

Disable SDE testing of old AMD microarchitectures.

Details:
- Skip testing on piledriver, steamroller, and excavator platforms
in travis/do_sde.sh.

commit 514fd101742dee557e5eb43d0023a221ae8a7172
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 14 13:50:28 2021 -0500

Fixed substitution bug in configure.

Details:
- Fixed a bug in configure related to the building of the so-called
config list. When processing the contents of config_registry,
configure creates a series of structures and list that allow for
various mappings related to configuration families, subconfigs,
and kernel sets. Two of those lists are built via subsitituion
of umbrella families with their subconfig members, and one of
those lists was improperly performing the subtitution in a way
that would erroneously match on partial umbrella family names.
That code was changed to match the code that was already doing
the subtitution properly, via substitute_words().
- Added comments noting the importance of using substitute_words()
in both instances.

commit e9da6425e27a9d63c9fef92afc2dd750c601ccd7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 13 14:15:38 2021 -0500

Allow use of 1m with mixing of row/col-pref ukrs.

Details:
- Fixed a bug that broke the use of 1m for dcomplex when the single-
precision real and double-precision real ukernels had opposing I/O
preferences (row-preferential sgemm ukernel + column-preferential
dgemm ukernel, or vice versa). The fix involved adjusting the API
to bli_cntx_set_ind_blkszs() so that the induced method context init
function (e.g., bli_cntx_init_<subconfig>_ind()) could call that
function for only one datatype at a time. This allowed the blocksize
scaling (which varies depending on whether we're doing 1m_r or 1m_c)
to happen on a per-datatype basis. This fixes issue 557. Thanks to
Devin Matthews and RuQing Xu for helping discover and report this bug.
- The aforementioned 1m fix required moving the 1m_r/1m_c logic from
bli_cntx_ref.c into a new function, bli_l3_set_schemas(), which is
called from each level-3 _front() function. The pack_t schemas in the
cntx_t were also removed entirely, along with the associated accessor
functions. This in turn required updating the trsm1m-related virtual
ukernels to read the pack schema for B from the auxinfo_t struct
rather than the context. This also required slight tweaks to
bli_gemm_md.c.
- Repositioned the logic for transposing the operation to accommodate
the microkernel IO preference. This mostly only affects gemm. Thanks
to Devin Matthews for his help with this.
- Updated dpackm pack ukernels in the 'armsve' kernel set to avoid
querying pack_t schemas from the context.
- Removed the num_t dt argument from the ind_cntx_init_ft type defined
in bli_gks.c. The context initialization functions for induced methods
were previously passed a dt argument, but I can no longer figure out
*why* they were passed this value. To reduce confusion, I've removed
the dt argument (including also from the function defintion +
prototype).
- Commented out setting of cntx_t schemas in bli_cntx_ind_stage.c. This
breaks high-leve implementations of 3m and 4m, but this is okay since
those implementations will be removed very soon.
- Removed some older blocks of preprocessor-disabled code.
- Comment update to test_libblis.c.

commit 81e103463214d589071ccbe2d90b8d7c19a186e4
Author: Minh Quan Ho <1337056+hominhquanusers.noreply.github.com>
Date: Wed Oct 13 20:28:02 2021 +0200

Alloc at least 1 elem in pool_t block_ptrs. (560)

Details:
- Previously, the block_ptrs field of the pool_t was allowed to be
initialized as any unsigned integer, including 0. However, a length of
0 could be problematic given that malloc(0) is undefined and therefore
variable across implementations. As a safety measure, we check for
block_ptrs array lengths of 0 and, in that case, increase them to 1.
- Co-authored-by: Minh Quan Ho <minh-quan.hokalray.eu>

commit 327481a4b0acf485d0cbdd8635dd9b886ba3f2a7
Author: Minh Quan Ho <1337056+hominhquanusers.noreply.github.com>
Date: Tue Oct 12 19:53:04 2021 +0200

Fix insufficient pool-growing logic in bli_pool.c. (559)

Details:
- The current mechanism for growing a pool_t doubles the length of the
block_ptrs array every time the array length needs to be increased
due to new blocks being added. However, that logic did not take in
account the new total number of blocks, and the fact that the caller
may be requesting more blocks that would fit even after doubling the
current length of block_ptrs. The code comments now contain two
illustrating examples that show why, even after doubling, we must
always have at least enough room to fit all of the old blocks plus
the newly requested blocks.
- This commit also happens to fix a memory corruption issue that stems
from growing any pool_t that is initialized with a block_ptrs length
of 0. (Previously, the memory pool for packed buffers of C was
initialized with a block_ptrs length of 0, but because it is unused
this bug did not manifest by default.)
- Co-authored-by: Minh Quan Ho <minh-quan.hokalray.eu>

commit 32a6d93ef6e2af5e486dfd5e46f8272153d3d53d
Merge: 408906fd 2604f407
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 9 15:53:54 2021 -0500

Merge pull request 543 from xrq-phys/armsve-packm-fix

ARMSVE Block SVE-Intrinsic Kernels for GCC 8-9

commit 408906fdd8892032aa11bd061b7971128f453bef
Merge: 4277fec0 ccf16289
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 9 15:50:25 2021 -0500

Merge pull request 542 from xrq-phys/armsve-zgemm

Arm SVE CGEMM / ZGEMM Natural Kernels

commit ccf16289d2e71fd9511ccf2d13dcebbfa29deabc
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 12:34:14 2021 +0900

Arm SVE C/ZGEMM Fix FMOV 0 Mistake

FMOV [hsd]M, imm does not allow zero immediate.
Use wzr, xzr instead.

commit 82b61283b2005f900101056e6df2a108258db602
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 12:17:29 2021 +0900

SH Kernel Unused Eigher

commit 1749dfa493054abd2e4ddba7cb21278d337e4f74
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 12:11:53 2021 +0900

Arm SVE C/ZGEMM Support *beta==0

commit 4b648e47daad256ab8ab698173a97f71ab9f75eb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Sep 22 16:42:09 2021 +0900

Arm SVE Config armsve Use ZGEMM/CGEMM

commit f76ea905e216cf640975e6319c6d2f54aeafed2e
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Sep 21 20:38:44 2021 +0900

Arm SVE: Update Perf. Graph

Pic. size seems a bit different from upstream.
Generaged w/ MATLAB. Open to any change.

commit 66a018e6ad00d9e8967b67e1aa3e23b20a7efdfe
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Sep 20 00:16:11 2021 +0900

Arm SVE CGEMM 2Vx10 Unindex Process Alpha=1.0

commit 9e1e781cb59f8fadb2a10a02376d3feac17ce38d
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun Sep 19 23:30:42 2021 +0900

Arm SVE ZGEMM 2Vx10 Unindex Process Alpha=1.0

commit f7c6c2b119423e7ba7a24ae2156790e076071cba
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:47:42 2021 +0900

A64FX Config Use ZGEMM/CGEMM

commit e4cabb977d038688688aca39b366f98f9c36b7eb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:34:26 2021 +0900

Arm SVE Typo Fix ZGEMM/CGEMM C Prefetch Reg

commit b677e0d61b23f26d9536e5c363fd6bbab6ee1540
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:18:54 2021 +0900

Arm SVE Add SGEMM 2Vx10 Unindexed

commit 3f68e8309f2c5b31e25c0964395a180a80014d36
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 01:00:54 2021 +0900

Arm SVE ZGEMM Support Gather Load / Scatt. St.

commit c19db2ff826e2ea6ac54569e8aa37e91bdf7cabe
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Sep 15 23:39:53 2021 +0900

Arm SVE Add ZGEMM 2Vx10 Unindexed

commit e13abde30b9e0e381c730c496e74bc7ae062a674
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Sep 15 04:19:45 2021 +0900

Arm SVE Add ZGEMM 2Vx7 Unindexed

commit 49b9d7998eb86f340ae7b26af3e5a135d6a8feee
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Sep 14 04:02:47 2021 +0900

Arm SVE Add ZGEMM 2Vx8 Unindexed

commit 4277fec0d0293400497ae8bcfc32be5e62319ae9
Merge: 2329d990 f44149f7
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 7 13:47:22 2021 -0500

Merge pull request 533 from xrq-phys/arm64-hi-bw

ARMv8 PACKM and GEMMSUP Kernels + Apple Firestorm Subconfig

commit 2329d99016fe1aeb86da4552295f497543cea311 (origin/1m_row_col_problem)
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 7 12:37:58 2021 -0500

Update Travis CI badge

[ci skip]

commit f44149f787ae3d4b53d9c4d8e6f23b2818b7770d
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Oct 8 02:35:58 2021 +0900

Armv8 Trash New Bulk Kernels

- They didn't make much improvements.
- Can't register row-preferral and column-preferral ukrs at the same time.
Will break 1m.

commit 70b52cadc5ef4c16431e1876b407019e6286614e
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 7 12:34:35 2021 -0500

Enable testing 1m in `make check`.

commit 2604f4071300d109f28c8438be845aeaf3ec44e4
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:39:00 2021 +0900

Config ArmSVE Unregister 12xk. Move 12xk to Old

commit 1e3200326be9109eb0f8c7b9e4f952e45700cbba
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:37:14 2021 +0900

Revert __has_include(). Distinguish w/ BLIS_FAMILY_**

commit a4066f278a5c06f73b16ded25f115ca4b7728ecb
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:26:05 2021 +0900

Register firestorm into arm64 Metaconfig

commit d7a3372247c37568d142110a1537632b34b8f2ff
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:25:14 2021 +0900

Armv8 DGEMMSUP Fix Edge 6x4 Switch Case Typo

commit 2920dde5ac52e09f84aa42990aab8340421522ce
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 02:01:45 2021 +0900

Armv8 DGEMMSUP Fix 8x4m Store Inst. Typo

commit 14b13583f1802c002e195b3b48874b3ebadbeb20
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Oct 6 10:22:34 2021 -0500

Add test for Apple M1 (firestorm)

This test will run on Linux, but all the kernels should run just fine. This does not test autodetection but then none of the other ARM tests do either.

commit a024715065532400da6257b8b3124ca5aecda405
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Oct 7 00:15:54 2021 +0900

Firestorm CPUID Dispatcher

Commenting out <sys/sysctl.h> due to possibly a Xcode bug.

commit b9da6d55fec447d05c8b67f34ce83617123d8357
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Oct 6 12:25:54 2021 +0900

Armv8 GEMMSUP Edge Cases Require Signed Ints

Fix a bug in bli_gemmsup_rd_armv8a_asm_d6x8m.c.
For safety upon similar strategies in the future,
change all [mn]_[iter/left] into signed ints.

commit 34919de3df5dda7a06fc09dcec12ca46dc8b26f4
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 2 18:48:50 2021 -0500

Make error checking level a thread-local variable.

Previously, this was a global variable. Setting the value was synchronized via a mutex but reading the value was not. Of course, these accesses are almost certainly atomic, but there is still the possibility of one thread attempting to set the value and then reading the value set by another thread. For correct operation under user threading (e.g. pthreads), this should probably be thread-local with no mutex.

commit c3024993c3d50236fad112822215f066496c5831
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 5 15:20:27 2021 -0500

Fix data race in testsuite.

commit 353a0d82572f26e78102cee25693130ce6e0ea5b
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 5 14:24:17 2021 -0500

Update .appveyor.yml

[ci skip]

commit 4bfadf9b561d4ebe0bbaf8b6d332f07ff531d618
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Oct 6 01:51:26 2021 +0900

Firestorm Block Size Fixes

commit 40baf83f0ea2749199b93b5a8ac45c01794b008c
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Oct 6 01:00:52 2021 +0900

Armv8 Handle *beta == 0 for GEMMSUP ??r Case.

commit 079fbd42ce8cf7ea67a939b0f80f488de5821319
Merge: f5c03e9f 9905f443
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 17:21:48 2021 -0500

Merge branch 'master' into arm64-hi-bw

commit 9905f44347eea4c57ef4927b81f1c63e76a92739
Merge: 6d3036e3 64a421f6
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:58:59 2021 -0500

Merge pull request 553 from flame/rpath-fix

Add an option to use an rpath-dependent install_name on macOS

commit 6d3036e31d8a2c1acbc1260489eeb8f535a8f97a
Merge: 53377fcc eaa554aa
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:58:43 2021 -0500

Merge pull request 545 from hominhquan/clean_error

bli_error: more cleanup on the error strings array

commit 53377fcca91e595787b38e2a47780ac0c35a7e7c
Merge: d0a0b4b8 80c5366e
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:45:53 2021 -0500

Merge pull request 554 from flame/armsve-cleanup

Move unused ARM SVE kernels to "old" directory.

commit 80c5366e4a9b8b72d97fba1eab89bab8989c44f4
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 15:40:28 2021 -0500

Move unused ARM SVE kernels to "old" directory.

commit 64a421f6983ab5bc0b55df30a2ddcfff5bfd73be
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 13:40:43 2021 -0500

Add an option to control whether or not to use rpath.

Adds `--enable-rpath/--disable--rpath` (default disabled) to use an install_name starting with rpath/. Otherwise, set the install_name to the absolute path of the install library, which was the previous behavior.

commit c4a31683dd6f4da3065d86c11dd998da5192740a
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 13:27:10 2021 -0500

Fix $ORIGIN usage on linux.

commit d0a0b4b841fce56b7b2d3c03c5d93ad173ce2b97
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Mon Oct 4 18:03:04 2021 +0000

Arm micro-architecture dispatch (344)

Details:
- Reworked support for ARM hardware detection in bli_cpuid.c to parse
the result of a CPUID-like instruction.
- Added a64fx support to bli_gks.c.
- include arm64 and arm32 family headers from bli_arch_config.h.
- Fix the ordering of the "armsve" and "a64fx" strings in the
config_name string array in bli_arch.c. The ordering did not match
the ordering of the corresponding arch_t values in bli_type_defs.h,
as it should have all along.
- Added clang support to make_defs.mk in arm64, cortexa53, cortexa57
subconfigs.
- Updated arm64 and arm32 families in config_registry.
- Updated docs/HardwareSupport.md to reflect added ARM support.
- Thanks to Dave Love, RuQing Xu, and Devin Matthews for their
contributions in this PR (344).

commit 91408d161a2b80871463ffb6f34c455bdfb72492
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Oct 4 11:37:48 2021 -0500

Use path-based install name on MacOS and use relocatable RPATH entries for testsuite inaries.

- RPATH entries (and DYLD_LIBRARY_PATH) do nothing on macOS unless the install_name of the library starts with rpath/. While the install_name can be set to the absolute install path, this makes the installation non-relocatable. When using path in the install_name, install paths within the normal DYLD_LIBRARY_PATH work with no changes on the user side, but for install paths off the beaten track, users must specify an RPATH entry when linking (or modify DYLD_LIBRARY_PATH at runtime). Perhaps this could be made into a configure-time option.
- Having relocable testsuite binaries is not necessarily a priority but it is easy to do with executable_path (macOS) or $ORIGIN (linux/BSD).

commit f5c03e9fe808f9bd8a3e0c62786334e13c46b0fc
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun Oct 3 16:51:51 2021 +0900

Armv8 Handle *beta == 0 for GEMMSUP ?rc Case.

commit abc648352c591e26ceee436bd3a45400115b70c5
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun Oct 3 13:14:19 2021 +0900

Armv8 Fix 6x8 Row-Maj Ukr

- Fixed for 6x8 only, 4x4 & 4x8 pending;
- Installed to config firestorm as benchmark seems to show better perf:
Old:
blis_dgemm_ukr_c 6 8 320 36.87 2.43e-17 PASS
blis_dgemm_ukr_c 6 8 352 40.55 1.04e-17 PASS
blis_dgemm_ukr_c 6 8 384 44.24 5.68e-17 PASS
blis_dgemm_ukr_c 6 8 416 41.67 3.51e-17 PASS
blis_dgemm_ukr_c 6 8 448 34.41 2.94e-17 PASS
blis_dgemm_ukr_c 6 8 480 42.53 2.35e-17 PASS

New:
blis_dgemm_ukr_r 6 8 352 50.69 1.59e-17 PASS
blis_dgemm_ukr_r 6 8 384 49.15 5.55e-17 PASS
blis_dgemm_ukr_r 6 8 416 50.44 2.86e-17 PASS
blis_dgemm_ukr_r 6 8 448 46.92 3.12e-17 PASS
blis_dgemm_ukr_r 6 8 480 48.08 4.08e-17 PASS

commit 0a45bc0fbc7aee3876c315ed567fc37f19cdc57f
Merge: 5013a6cb 13dbd5b5
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 2 18:59:43 2021 -0500

Merge pull request 552 from flame/armsve_beta_0

Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.

commit 13dbd5b5d3dbf27e33ecf0e98d43c97019a6339d
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Oct 2 20:40:25 2021 +0000

Apply patch from xrq-phys.

commit ae0eeeaf77c77892db17027cef10b95ec97c904f
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Sep 29 16:42:33 2021 -0500

Add explicit handling for beta == 0 in armsve sd and armv7a d gemm ukrs.

commit 5013a6cb7110746c417da96e4a1308ef681b0b88
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 29 10:38:50 2021 -0500

More edits and fixes to docs/FAQ.md.

commit b36fb0fbc5fda13d9a52cc64953341d3d53067ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 28 18:47:45 2021 -0500

Fixed newly broken link to CREDITS in FAQ.md.

commit 3442d4002b3bfffd8848f72103b30691df2b19b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 28 18:43:23 2021 -0500

More minor fixes to FAQ.md and Sandboxes.md.

commit 89aaf00650d6cc19b83af2aea6c8d04ddd3769cb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 28 18:34:33 2021 -0500

Updates to FAQ.md, Sandboxes.md, and README.md.

Details:
- Updated FAQ.md to include two new questions, reordered an existing
question, and also removed an outdated and redundant question about
BLIS vs. AMD BLIS.
- Updated Sandboxes.md to use 'gemmlike' as its main example, along with
other smaller details.
- Added ARM as a funder to README.md.

commit c52c43115ec2264fda9380c48d9e6bb1e1ea2ead
Merge: 1fc23d21 1f527a93
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Sep 26 15:56:54 2021 -0500

Merge branch 'dev'

commit 1fc23d2141189c7b583a5bff2cffd87fd5261444
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 21 14:54:20 2021 -0500

Safelist 'master', 'dev', 'amd' branches.

Details:
- Modified .travis.yml so that only commits to 'master', 'dev', and
'amd' branches get built by Travis CI. Thanks to Devin Matthews for
helping to track down the syntax for this change.

commit 1f527a93b996093e06ef7a8e94fb47ee7e690ce0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 20 17:56:36 2021 -0500

Re-enable and fix fb93d24.

Details:
- Re-enabled the changes made in fb93d24.
- Defined BLIS_ENABLE_SYSTEM in bli_arch.c, bli_cpuid.c, and bli_env.c,
all of which needed the definition (in addition to config_detect.c) in
order for the configure-time hardware detection binary to be compiled
properly. Thanks to Minh Quan Ho for helping identify these additional
files as needing to be updated.
- Added additional comments to all four source files, most notably to
prompt the reader to remember to update all of the files when updating
any of the files. Also made the cpp code in each of the files as
consistent/similar as possible.
- Refer to issues 532 and PR 546 for more history.

commit 7b39c1492067de941f81b49a3b6c1583290336fd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 20 16:13:50 2021 -0500

Reverted fb93d24.

Details:
- The latest changes in fb93d24 are still causing problems. Reverting
and preparing to move them to a branch.

commit fb93d242a4fef4694ce2680436da23087bbdd5fe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 20 15:42:08 2021 -0500

Re-enable and fix 8e0c425 (BLIS_ENABLE_SYSTEM).

Details:
- Re-enable the changes originally made in 8e0c425 but quickly reverted
in 2be78fc.
- Moved the include of bli_config.h so that it occurs before the
include of bli_system.h. This allows the define BLIS_ENABLE_SYSTEM
or define BLIS_DISABLE_SYSTEM in bli_config.h to be processed by the
time it is needed in bli_system.h. This change should have been
in the original 8e0c425, but was accidentally omitted. Thanks to Minh
Quan Ho for catching this.
- Add define BLIS_ENABLE_SYSTEM to config_detect.c so that the proper
cpp conditional branch executes in bli_system.h when compiling the
hardware detection binary. The changes made in 8e0c425 were an attempt
to support the definition of BLIS_OS_NONE when configuring with
--disable-system (in issue 532). That commit failed because, aside
from the required but omitted header reordering (second bullet above),
AppVeyor was unable to compile the hardware detection binary as a
result of missing Windows headers. This commit, which builds on PR
546, should help fix that issue. Thanks to Minh Quan Ho for his
assistance and patience on this matter.

commit eaa554aa52b879d181fdc87ba0bfad3ab6131517
Author: Minh Quan HO <minh-quan.hokalray.eu>
Date: Wed Sep 15 15:39:36 2021 +0200

bli_error: more cleanup on the error strings array

- There was redundance between the macro BLIS_MAX_NUM_ERR_MSGS (=200) and
the enum BLIS_ERROR_CODE_MAX (-170), while they both mean the same thing:
the maximal number of error codes/messages.
- The previous initialization of error messages at compile time ignored that
the 'bli_error_string' array still occupies useless memory due to 2D char[][]
declaration. Instead, it should be just an array of pointers, pointing at
strings in .rodata section.
- This commit does the two modifications:
* retired macros BLIS_MAX_NUM_ERR_MSGS and BLIS_MAX_ERR_MSG_LENGTH everywhere
* switch bli_error_string from char[][] to char *[] to reduce its footprint
from 40KB (200*200) to 1.3KB (170*sizeof(char*)).
(No problem to use the enum BLIS_ERROR_CODE_MAX at compile-time,
since compiler is smart enough to determine its value is 170.)

commit 52f29f739dbbb878c4cde36dbe26b82847acd4e9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 17 08:38:29 2021 -0500

Removed last vestige of define BLIS_NUM_ARCHS.

Details:
- Removed the commented-out define BLIS_NUM_ARCHS in bli_type_defs.h
and its associated (now outdated) comments. BLIS_NUM_ARCHS has been
part of the arch_t enum for some time now, and so this change is
mostly about removing any opportunity for confusion for people who
may be reading the code. Thanks to Minh Quan Ho for leading me to
cleanup.

commit 849aae09f4fbf8d7abf11f4df1471f1d057e874b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 16 14:47:45 2021 -0500

Added new packm var3 to 'gemmlike'.

Details:
- Defined a new packm variant for the 'gemmlike' sandbox. This new
variant (bls_l3_packm_var3.c) parallelizes the packing operation over
the k dimension rather than the m or n dimensions. Note that the
gemmlike implementation still uses var1 by default, and use of the new
code would require changing bls_l3_packm_a.c and/or bls_l3_packm_b.c
so that var3 is called instead. Thanks to Jeff Diamond for proposing
this (perhaps NUMA-friendly) solution.

commit b6f71fd378b7cd0cdc5c780e0b8c975a7abde998
Merge: 9293a68e e3dc1954
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 16 12:24:33 2021 -0500

Merge pull request 544 from flame/haswell-gemmsup-fpe

Fix more copy-paste errors in the haswell gemmsup code.

commit e3dc1954ffb5eee2a8b41fce85ba589f75770eea
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 16 10:59:37 2021 -0500

Fix problem where uninitialized registers are included in vhaddpd in the Mx1 gemmsup kernels for haswell.

The fix is to use the same (valid) source register twice in the horizontal addition.

commit 5191c43faccf45975f577c60b9089abee25722c9
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Sep 16 10:16:17 2021 -0500

Fix more copy-paste errors in the haswell gemmsup code.

Fixes 486.

commit 30c29b256ef13f0141ca9e9169cbdc7a45ce3a61
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 05:01:03 2021 +0900

Arm SVE Exclude SVE-Intrinsic Kernels for GCC 8-9

Affected configs: a64fx.

commit bffa85be59dece8e756b9444e762f18892c06ee1
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Sep 16 04:31:45 2021 +0900

Arm SVE: Correct PACKM Ker Name: Intrinsic Kers

SVE-Intrinsic-based kernels ought not to use asm in their names.

commit 9293a68eb6557a9ea43a846435908c3d52d4218b
Merge: ade10f42 98ce6e8b
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 14:13:29 2021 -0500

Merge pull request 534 from flame/cxx_test

Add test to Travis using C++ compiler to make sure blis.h is C++-compatible

commit 98ce6e8bc916e952510872caa60d818d62a31e69
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 14:12:13 2021 -0500

Do a fast test on OSX. [ci skip]

commit c76fcad0c2836e7140b6bef3942e0a632a5f2cda
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 13:57:02 2021 -0500

Fix AArch64 tests and consolidate some other tests.

commit e486d666ffefee790d5e39895222b575886ac1ea
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 13:50:16 2021 -0500

Use C++ cross-compiler for ARM tests.

commit fbb3560cb8e2aeab205c47c2b096d4fa306d93db
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 13:38:27 2021 -0500

Attempt to fix cxx-test for OOT builds.

commit 9c0064f3f67d59263c62d57ae19605562bb87cc2
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Sep 10 10:39:04 2021 -0500

Fix config_name in bli_arch.c

commit ade10f427835d5274411cafc9618ac12966eb1e7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 27 12:47:12 2021 -0500

Updated travis-ci.org link in README.md to .com.

commit 2be78fc97777148c83d20b8509e38aa1fc1b4540
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 27 12:17:26 2021 -0500

Disabled (at least temporarily) commit 8e0c425.

Details:
- Reverted changes in 8e0c425 due to AppVeyor build failures that we do
not yet understand.

commit 820f11a4694aee5f234e24277aecca40885ae9d4
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Aug 27 13:40:26 2021 +0900

Arm Whole GEMMSUP Call Route is Asm/Int Optimized

- `ref2` call in `bli_gemmsup_rv_armv8a_asm_d6x8m.c` is commented out.
- `bli_gemmsup_rv_armv8a_asm_d4x8m.c` contains a tail `ref2` call but
it's not called by any upper routine.

commit 8e0c4255de52a0a5cffecbebf6314aa52120ebe4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 26 15:29:18 2021 -0500

Define BLIS_OS_NONE when using --disable-system.

Details:
- Modified bli_system.h so that the cpp macro BLIS_OS_NONE is defined
when BLIS_DISABLE_SYSTEM is defined. Otherwise, the previous OS-
detecting macro conditionals are considered. This change is to
accommodate a solution to a cross-compilation issue described in
532.

commit d6eb70fbc382ad7732dedb4afa01cf9f53e3e027
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 26 13:12:39 2021 -0500

Updated stale calls to malloc_intl() in gemmlike.

Details:
- Updated two out-of-date calls to bli_malloc_intl() within the gemmlike
sandbox. These calls to malloc_intl(), which resided in
bls_l3_decor_pthreads.c, were missing the err_t argument that the
function uses to report errors. Thanks to Jeff Diamond for helping
isolate this issue.

commit 2f7325b2b770a15ff8aaaecc087b22238f0c67b7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 23 15:04:05 2021 -0500

Blacklist clang10/gcc9 and older for 'armsve'.

Details:
- Prohibit use of clang 10.x and older or gcc 9.x and older for the
'armsve' subconfiguration. Addresses issue 535.

commit 7e2951e61fda1c325d6a76ca9956253482d84924
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 17:06:44 2021 +0900

Arm: DGEMMSUP `Macro' Edge Cases Stop Calling Ref

Ref cannot handle panel strides (packed cases) thus cannot be called
from the beginning of `gemmsup` (i.e. cannot be dispatch target of
gemmsup to other sizes.)

commit 4fd82b0e9348553d83e258bd4969e49a81f8fcf0
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 05:18:32 2021 +0900

Header Typo

commit 35409ebe67557c0e7cf5ced138c8166c9c1c909f
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 04:51:47 2021 +0900

Arm: DGEMMSUP ??r(rv) Invoke Edge Size

Plus some fix at edges.

TODO: Should ensure that no ref kernel appear in beginning of gemmsup
kernels. As ref does not recognise panel stride.

commit a361492c24fdd919ee037763fc6523e8d7d2967a
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Mon Aug 23 01:13:39 2021 +0900

Arm: DGEMMSUP ?rc(rd) Invoke Edge Size

commit eaea67401c2ab31f2e51eede59725f64c1a21785
Merge: 5fc65cdd e320ec6d
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Aug 21 16:09:31 2021 -0500

Merge branch 'master' into cxx_test

commit 5fc65cdd9e4134c5dcb16d21cd4a79ff426ca9f3
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Aug 21 15:59:27 2021 -0500

Add test to Travis using C++ compiler to make sure blis.h is C++-compatible.

commit e320ec6d5cd44e03cb2e2faa1d7625e84f76d668
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 20 17:15:20 2021 -0500

Moved lang defs from _macro_def.h to _lang_defs.h.

Details:
- Moved miscellaneous language-related definitions, including defs
related to the handling of the 'restrict' keyword, from the top half
of bli_macro_defs.h into a new file, bli_lang_defs.h, which is now
included immediately after "bli_system.h" in blis.h. This change is
an attempt to fix a report of recent breakage of C++ compilers due
to the recent introduction of 'restrict' in bli_type_defs.h (which
previously was being included *before* bli_macro_defs.h and its
restrict handling therein. Thanks to Ivan Korostelev for reporting
this issue in 527.
- CREDITS file update.

commit e6799b26a6ecf1e80661a77d857d1c9e9adf50dc
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Aug 21 02:39:38 2021 +0900

Arm: Implement GEMMSUP Fallback Method

bli_dgemmsup_rv_armv8a_int_6x4mn

commit 7d5903d8d7570090eb37c592094424d1c64805d1
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Aug 21 01:55:50 2021 +0900

Arm64 Fix: Support Alpha/Beta in GEMMSUP Intrin

Forgot to support `alpha`/`beta` in gemmsup_armv8a_int.

commit 3b275f810b2479eb5d6cf2296e97a658cf1bb769
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 19 16:06:46 2021 -0500

Minor tweaks to gemmlike sandbox.

Details:
- In the gemmlike sandbox, changed the loop index variable of inner
loop of packm_cxk() from 'd' to 'i' (and likewise for the
corresponding inlined code within packm_var2()).
- Pack matrices A and B using packm_var1() instead of packm_var2().

commit 3eccfd456e7e84052c9a429dcde1183a7ecfaa48
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 19 13:22:10 2021 -0500

Added local _check() code to gemmlike sandbox.

Details:
- Added code to the gemmlike sandbox that handles parameter checking.
Previously, the gemmlike implementation called bli_gemm_check(), which
resides within the BLIS framework proper. Certain modifications that a
user may wish to perform on the sandbox, such as adding a new matrix
or vector operand, would have required additional checks, and so these
changes make it easier for such a person to implement those checks for
their custom gemm-like operation.

commit 7144230cdb0653b70035ddd91f7f41e06ad8d011
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 18 13:25:39 2021 -0500

README.md citation updates (e.g. BLIS7 bibtex).

commit 4a955e939044cfd2048cf9f3e33024e3ad1fbe00
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 16 13:49:27 2021 -0500

Tweaks to gemmlike to facilitate 3rd party mods.

Details:
- Changed the implementation in the 'gemmlike' sandbox to more easily
allow others to provide custom implementations of packm. These changes
include:
- Calling a local version of packm_cxk() that can be modified. This
version of packm_cxk() uses inlined loops in packm_cxk() rather
than querying the context for packm kernels (or even using scal2m).
- Providing two variants of packm, one of which calls the
aforementioned packm_cxk(), the other of which inlines the contents
of packm_cxk() into the variant itself, making it self-contained.
To switch from one to the other, simply change which function gets
called within bls_packm_a() and bls_packm_b().
- Simplified and cleaned up some variant names in both variants of
packm, relative to their parent code.

commit 2c0b4150e40c83ea814f69ca766da74c19ed0a58
Merge: c99fae50 4b8ed99d
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Aug 14 18:41:35 2021 -0500

Merge pull request 527 from flame/obj_t_makeover

Implement proposed new function pointer fields for obj_t.

commit 4b8ed99d926876fbf54c15468feae4637268eb6b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 13 15:31:10 2021 -0500

Whitespace tweaks.

commit c99fae50ac3de0b5380a085aeebebfe67a645407
Merge: e6d68bc4 4f70eb79
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 14:48:00 2021 -0500

Merge pull request 530 from flame/fix_clang_warnings

Clean up some warnings that show up on clang/OSX.

commit e6d68bc4fd0981bea90d7f045779cacfe53f6ae8
Merge: 20a1c401 ec06b6a5
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 14:47:46 2021 -0500

Merge pull request 529 from flame/fix_make_check_dependencies

Add dependency on the "flat" blis.h file for the BLIS and BLAS testuite objects.

commit 1772db029e10e0075b5a59d3fb098487b1ad542a
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 14:46:35 2021 -0500

Add row- and column-strides for A/B in obj_ukr_fn_t.

commit 4f70eb7913ad3ded193870361b6da62b20ec3823
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 13 11:12:43 2021 -0500

Clean up some warnings that show up on clang/OSX.

commit 3cddce1e2a021be6064b90af30022b99cbfea986
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 12 22:32:34 2021 -0500

Remove schema field on obj_t (redundant) and add new API functions.

commit ec06b6a503a203fa0cdb23273af3c0e3afeae7fa
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 12 19:27:31 2021 -0500

Add dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects.

This fixes a bug where "make -j<N> check" may fail after a change to one or more header files, or where testsuite code doesn't get properly recompiled after internal changes.

commit 20a1c4014c999063e6bc1cfa605b152454c5cbf4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 12 14:44:04 2021 -0500

Disabled sanity check in bli_pool_finalize().

Details:
- Disabled a sanity check in bli_pool_finalize() that was meant to alert
the user if a pool_t was being finalized while some blocks were still
checked out. However, this is exactly the situation that might happen
when a pool_t is re-initialized for a larger blocksize, and currently
bli_pool_reinit() is implemeneted as _finalize() followed by _init().
So, this sanity check is not universally appropriate. Thanks to
AMD-India for reporting this issue.

commit e366665cd2b5ae8d7683f5ba2de345df0a41096f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 12 14:06:53 2021 -0500

Fixed stale API calls to membrk API in gemmlike.

Details:
- Updated stale calls to the bli_membrk API within the 'gemmlike'
sandbox. This API is now called bli_pba (packed block allocator).
Ideally, this forgotten update would have been included as part of
21911d6, which is when the branch where the membrk->pba changes was
introduced was merged into 'master'.
- Comment updates.

commit e38ca28689f31c5e5bd2347704dc33042e5ea176
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Aug 13 03:21:19 2021 +0900

Added Apple Firestorm (A14/M1) Subconfig

- Use the same bulk kernel as Cortex-A53 / ThunderX2;
- Larger block size;
- Use gemmsup kernels for double precision.

commit 3df0e9b653fbb1293cad93010273eea579e753d9
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jul 17 04:21:53 2021 +0900

Arm64 8x4 Kernel Use Less Regs

commit 4e7e225057a05b9722ce65ddf75a9c31af9fbf36
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Jun 9 15:46:36 2021 +0900

Armv8-A Supplimentary GEMMSUP Sizes for RD

commit c792d506ba09530395c439051727631fd164f59a
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 04:20:24 2021 +0900

Armv8-A Fix GEMMSUP-RD Kernels on GNU Asm

Suffixed NEON opcode is not supported by GNU assembler

commit ce4473520975c2c8790c82c65a69d75f8ad758ea
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 04:08:14 2021 +0900

Armv8-A Adjust Types for PACKM Kernels

GCC does not have full NEON intrinsics support.

commit 8a32d19af85b61af92fcab1c316fb3be1a8d42ce
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 03:31:30 2021 +0900

Armv8-A GEMMSUP-RD 6x8m

Armv8-A now has a complete set of GEMMSUP kernels..

commit afd0fa6ad1889ed073f781c8aa8635f99e76b601
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat Jun 5 01:19:01 2021 +0900

Armv8-A GEMMSUP-RD 6x8n

commit 3c5f7405148ab142dee565d00da331d95a7a07b9
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Jun 4 21:50:51 2021 +0900

Armv8-A s/d Packing Kernels Fix Typo

For GCC.

commit 49b05df7929ec3abc0d27b475d2d406116fe2682
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Fri Jun 4 18:04:59 2021 +0900

Armv8-A Introduced s/d Packing Kernels

Sizes according to the 2014 kernels.

commit c3faf93168c3371ff48a2d40d597bdb27021cad4
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Jun 3 23:09:05 2021 +0900

Armv8-A DGEMMSUP 6x8m Kernel

Recommended kernels set:
...
BLIS_RRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_RCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CRR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8m, TRUE,
BLIS_CCR, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
BLIS_CCC, BLIS_DOUBLE, bli_dgemmsup_rv_armv8a_asm_6x8n, TRUE,
...
bli_blksz_init ( &blkszs[ BLIS_MR ], -1, 6, -1, -1,
-1, 8, -1, -1 );
bli_blksz_init_easy( &blkszs[ BLIS_NR ], -1, 8, -1, -1 );
...

commit 3efe707b5500954941061d4c2363d6ed41d17233
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Jun 3 17:20:57 2021 +0900

Armv8-A DGEMMSUP Adjustments

commit 8ed8f5e625de9b77a0f14883283effe79af01771
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Jun 3 16:37:37 2021 +0900

Armv8-A Add More DGEMMSUP

- Add 6x8 GEMMSUP.
- Adjust prefetching.
- Workaround for Clang's disability to handle reg clobbering.
- Subproduct 6x8 row-major GEMM <- incomplete.

commit a9ba79ea14de3b5a271e5970cb473d3c52e2fa5f
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Jun 2 15:04:29 2021 +0900

Armv8-A Add GEMMSUP 4x8n Kernel

- Compile w/ both GCC & Clang.
- Edge cases use ref-kernels.
- Can give performance boost in some contexts.

commit df40efe8fbfd399d76c6000ec03791a9b76ffbdf
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed Jun 2 00:04:20 2021 +0900

Armv8-A Add Part of GEMMSUP 8x4m Kernel

- Compile w/ both GCC & Clang
- Only block part is implement. Edge cases WIP
- Not Optimal kernel scheme. Should do 4x8 instead

commit 66399992881316514f64d68ec9eb60a87d53f674
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 05:52:05 2021 +0900

Armv8A DGEMM 4x4 Kernel WIP. Slow

Quite slow.

commit a29c16394ccef02d29141c79b71fb408e20073e6
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 04:58:45 2021 +0900

Armv8-A Add 8x4 Kernel WIP

Test result: a bit lower GFlOps than 6x8.

commit 64a1f786d58001284aa4f7faf9fae17f0be7a018
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Aug 11 17:53:12 2021 -0500

Implement proposed new function pointer fields for obj_t.

The added fields:
1. `pack_t schema`: storing the pack schema on the object allows the macrokernel to act accordingly without side-channel information from the rntm_t and cntx_t. The pack schema and "pack_[ab]" fields could be removed from those structs.
2. `void* user_data`: this field can be used to store any sort of additional information provided by the user. The pointer is propagated to submatrix objects and copies, but is otherwise ignored by the framework and the default implementations of the following three fields. User-specified pack, kernel, or ukr functions can do whatever they want with the data, and the user is 100% responsible for allocating, assigning, and freeing this buffer.
3. `obj_pack_fn_t pack`: the function called when a matrix is packed. This functions receives the expected arguments, as well as a mdim_t and mem_t* as memory must be allocated inside this function, and behavior may differ based on which matrix is being backed (i.e. transposition for B). This could also be achieved by passing a desired pack schema, but this would require additional information to travel down the control tree.
4. `obj_ker_fn_t ker`: the function called when we get to the "second loop", or the macro-kernel. Behavior may depend on the pack schemas of the input matrices. The default implementation would perform the inner two loops around the ukr, and then call either the default ukr or a user-supplied one (next field).
5. `obj_ukr_fn_t ukr`: the function called by the default macrokernel. This would replace the various current "virtual" microkernels, and could also be used to supply user-defined behavior. Users could supply both a custom kernel (above) and microkernel, although the user-specified kernel does **not** necessarily have to call the ukr function specified on the obj_t.

Note that no macros or functions for accessing these new fields have been defined yet. That is next once these are finalized. Addresses https://github.com/flame/blis/projects/1#card-62357687.

commit a32257eeab2e9946e71546a05a1847a39341ec6b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 5 16:23:02 2021 -0500

Fixed bli_init.c compile-time error on OSX clang.

Details:
- Fixed a compile-time error in bli_init.c when compiling with OSX's
clang. This error was introduced in 868b901, which introduced a
post-declaration struct assignment where the RHS was a struct
initialization expression (i.e. { ... }). This use of struct
initializer expressions apparently works with gcc despite it not
being strict C99. The fix included in this commit declares a temporary
variable for the purposes of being initialized to the desired value,
via the struct initializer, and then copies the temporary struct (via
'=' struct assignment) to the persistent struct. Thanks to Devin
Matthews for his help with this.

commit c8728cfbd19ecde9d43af05829e00bcfe7d86eed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 5 15:17:09 2021 -0500

Fixed configure breakage on OSX clang.

Details:
- Accept either 'clang' or 'LLVM' in vendor string when greping for
the version number (after determining that we're working with clang).
Thanks to Devin Matthews for this fix.

commit 868b90138e64c873c780d9df14150d2a370a7a42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 4 18:31:01 2021 -0500

Fixed one-time use property of bli_init() (525).

Details:
- Fixes a rather obvious bug that resulted in segmentation fault
whenever the calling application tried to re-initialize BLIS after
its first init/finalize cycle. The bug resulted from the fact that
the bli_init.c APIs made no effort to allow bli_init() to be called
subsequent times at all due to it, and bli_finalize(), being
implemented in terms of pthread_once(). This has been fixed by
resetting the pthread_once_t control variable for initialization
at the end of bli_finalize_apis(), and by resetting the control
variable for finalization at the end of bli_init_apis(). Thanks to
lschork2 for reporting this issue (525), and to Minh Quan Ho and
Devin Matthews for suggesting the chosen solution.
- CREDITS file update.

commit 8dba1e752c6846a85dea50907135bbc5cbc54ee5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 27 12:38:24 2021 -0500

CREDITS file update.

commit cc9206df667b7c710b57b190b8ad351176de53b8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 16 15:48:37 2021 -0500

Added Graviton2 Neoverse N1 performance results.

Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on a Graviton2
Neoverse N1 server. Special thanks to Nicholai Tukanov for
collecting these results via the Arm-HPC/AWS hackaton.
- Corrected what was supposed to be a temporary tweak to the legend
labels in test/3/octave/plot_l3_perf.m.

commit fab5c86d68137b59800715efb69214c0a7e458a7
Merge: 84f9dcd4 d073fc9a
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 13 16:46:21 2021 -0500

Merge pull request 516 from nicholaiTukanov/p10-sandbox-rework

P10 sandbox rework

commit 84f9dcd449fa7a4cf4087fca8ec4ca0d10e9b801
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 13 16:45:44 2021 -0500

Remove unnecesary windows/zen2 directory.

commit 21911d6ed3438ca4ba942d05851ba5d7e9835586
Merge: 17729cf4 689fa0f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 9 18:10:46 2021 -0500

Merge branch 'dev'

commit 17729cf449919d1db9777cea5b65d2efc77e2692
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jul 9 14:59:48 2021 -0500

Add vzeroupper to Haswell microkernels. (524)

Details:
- Added vzeroupper instruction to the end of all 'gemm' and 'gemmtrsm'
microkernels so as to avoid a performance penalty when mixing AVX
and SSE instructions. These vzeroupper instructions were once part
of the haswell kernels, but were inadvertently removed during a source
code shuffle some time ago when we were managing duplicate 'haswell'
and 'zen' kernel sets. Thanks to Devin Matthews for tracking this down
and re-inserting the missing instructions.

commit c9a7f59aa84daa54d8f8c771f1f1ef2bd8730da2
Merge: 75f03907 9a8e649c
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Jul 8 14:00:38 2021 -0500

Merge pull request 522 from flame/windows-avx512

Fix Win64 AVX512 bug.

commit 9a8e649c5ac89eba951bbee7136ca28aeb24d731
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 7 15:23:57 2021 -0500

Fix Win64 AVX512 bug.

Use `-march=haswell` for kernels. Fixes 514.

commit 75f03907c58385b656c8bd35d111db245814a9f3
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 7 15:44:11 2021 -0500

Add comment about make checkblas on Windows

[ci skip]

commit 4651583b1204a965e4aa672c7ad6de60f3ab1600
Merge: 69205ac2 174f7fc9
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jul 7 01:11:20 2021 -0500

Merge pull request 520 from flame/travis-ci-install

Test installation in Travis CI

commit 69205ac266947723ad4d7bb028b7521fe5c76991
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 6 20:39:22 2021 -0500

CREDITS file update.

Details:
- Thanks to Chengguo Sun for submitting 515 (5ef7f68).
- Thanks to Andrew Wildman for submitting 519 (551c6b4).
- Whitespace update to configure (spaces to tabs).

commit 174f7fc9a11712c7bd1a61510bdc5c262b3e8e1f
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 6 19:35:55 2021 -0500

Test installation in Travis CI

commit 551c6b4ee8cd9dd2e1d1b46c8dde09eb50b91b2c
Merge: 78eac6a0 f648df4e
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 6 19:32:53 2021 -0500

Merge pull request 519 from awild82/oot_build_bugfix

Fix installation from out-of-tree builds

commit f648df4e5588f069b2db96f8be320ead0c1967ef
Author: Andrew Wildman <apw4uw.edu>
Date: Tue Jul 6 16:35:12 2021 -0700

Add symlink to blis.pc.in for out-of-tree builds

commit 78eac6a0ab78c995c3f4e46a9e87388b5c3e1af6
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jul 6 11:05:43 2021 -0500

Revert "Always run `make check`."

This reverts commit a201a53440c51244739aaee20e3309b50121cc68.

commit a201a53440c51244739aaee20e3309b50121cc68
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 5 21:39:18 2021 -0500

Always run `make check`.

I'm concerned that problems may lurk for `x86_64` builds on Windows which may be uncovered by a fuller `make check`.

commit 5ef7f684dc75fc707c82f919e0836615f90a2627
Merge: aaa10c87 ad6231cc
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 5 21:35:07 2021 -0500

Merge pull request 515 from chengguosun/bug-fix

Fixed configure script bug.

commit ad6231cca3fc1e477752ecd31b1ee2323398a642
Author: sunchengguo <sunchengguohigon.com>
Date: Tue Jul 6 07:30:00 2021 -0400

Fixed configure script bug.
Details:
- Fixed kernel list string substitution error by adding function substitute_words in configure script.
if the string contains zen and zen2, and zen need to be replaced with another string, then zen2
also be incorrectly replaced.

commit d073fc9acac9d702556cab9fbbb3a253eeb1f998
Author: nicholaiTukanov <nicholaitukanovgmail.com>
Date: Fri Jul 2 19:54:33 2021 -0500

Update POWER10.md

commit 907226c0af4afb6323b4e02be4f73f5fb89cddaf
Author: nicholaiTukanov <nicholaitukanovgmail.com>
Date: Fri Jul 2 19:47:18 2021 -0500

Rework POWER10 sandbox

- Add a testsuite for gathering performance (in GFLOPs) and measuring correctness for the POWER10 GEMM reduced precision/integer kernels.
- Reworked GENERIC_GEMM template to hardcode the cache parameters.
- Remove kernel wrapper that checked that only allowed matrices that weren't transposed or conjugated. However, the kernels still assume the matrices are not transposed. This wrapper was removed for performance reasons.
- Renamed and restructured files and functions for clarity.
- Editted the POWER10 document to reflect new changes.

commit aaa10c87e19449674a4ca30fa3b6392bb22c3a66
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 21 17:53:52 2021 -0500

Skip clearing temp microtile in gemmlike sandbox.

Details:
- Removed code from gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. This code, introduced recently in 7f7d726, did
not actually fix any bug (despite that commit's log entry). The
microtile does not need to be initialized because it is completely
overwritten by a "beta = 0" invocation of gemm prior to it being
read. Any NaNs or Infs present at the outset would have no impact
on the output matrix C. Thanks to Devin Matthews for reminding me
of this.

commit bc10a3f2ff518360c32bea825b3eb62a9e4c8a77
Merge: bf727636 6548ceba
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jun 18 19:01:08 2021 -0500

Merge pull request 492 from flame/thunderx2-clang

Allow clang for ThunderX2 config

commit bf727636632a368f3247dc8ab1d4b6119e9c511a
Merge: e28f2a2d 5fc93e28
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jun 18 18:59:43 2021 -0500

Merge pull request 506 from xrq-phys/arm64-mac

BLIS on Darwin_Aarch64

commit e28f2a2dfcff14e7094fce0b279b3a917b3ab98c
Merge: d10e05bb 56ffca6a
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Jun 15 19:35:07 2021 -0500

Merge pull request 513 from nicholaiTukanov/asm_warning_p9_fix

Fix assembler warning in POWER9 DGEMM

commit 56ffca6a9bc67432a7894298739895f406e5f467
Author: nicholai <nicholaiibm.com>
Date: Tue Jun 15 18:17:39 2021 -0500

Fix asm warning

commit 689fa0f40399bde1acc5367d6dd4e8fc4eb6f3ea
Merge: b683d01b d10e05bb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jun 13 19:44:14 2021 -0500

Merge branch 'master' into dev

commit d10e05bbd1ce45ce2c0dfe5c64daae2633357b3f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jun 13 19:36:16 2021 -0500

Sandbox header edits trigger full library rebuild.

Details:
- Adjusted the top-level Makefile so that any change to a sandbox header
file will result in blis.h being regenerated along with a full
recompilation of the library. Previously, sandbox files were omitted
from the list of header files that, when touched, could trigger a full
rebuild. Why was it like that previously? Because originally we only
envisioned using sandboxes to *replace* gemm, not augment the library
with new functionality. When replacing gemm, blis.h does not need to
contain any local sandbox defintions in order for the user to be able
to (indirectly) use that sandbox. But if you are adding functions to
the library, those functions need to be prototyped so the compiler
can perform type checking against the user's invocation of those new
functions. Thanks to Jeff Diamond for helping us discover this
deficiency in the build system.

commit 7c3eb44efaa762088c190bb820ef6a3c87db8f65
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Jun 2 11:28:22 2021 -0500

Add vhsubpd/vhsubpd.

Horizontal subtraction instructions added to bli_x86_asm_macros.h, currently unused [ci skip].

commit 7f7d72610c25f511ba8cd2a53be7b59bdb80f3f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon May 31 16:50:18 2021 -0500

Fixed bugs in cpackm kernels, gemmlike code.

Details:
- Fixed intermittent bugs in bli_packm_haswell_asm_c3xk.c and
bli_packm_haswell_asm_c8xk.c whereby the imaginary component of the
kappa scalar was incorrectly loaded at an offset of 8 bytes (instead
of 4 bytes) from the real component. This was almost certainly a copy-
paste bug carried over from the corresonding zpackm kernels. Thanks to
Devin Matthews for bringing this to my attention.
- Added missing code to gemmlike sandbox files bls_gemm_bp_var1.c and
bls_gemm_bp_var2.c that initializes the elements of the temporary
microtile to zero. (This bug was never observed in output but rather
noticed analytically. It probably would have also manifested as
intermittent failures, this time involving edge cases.)
- Minor commented-out/disabled changes to testsuite/src/test_gemm.c
relating to debugging.

commit 5fc93e280614b4a21a9cff36cf873b4b9407285b
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 18:44:47 2021 +0900

Armv8A Rename Regs for Safe Darwin Compile

Avoid x18 use in FP32 kernel:
- C address lines x[18-26] renamed to x[19-27] (reg index +1)
- Original role of x27 fulfilled by x5 which is free after k-loop pert.

FP64 does not require changing since x18 is not used there.

commit 9f4a4a3cfb2244e4024445e127dafd2a11f39fc5
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 17:21:28 2021 +0900

Armv8A Rename Regs for Clang Compile: FP32 Part

Roughly the same as 916e1fa , additionally with x15 clobbering removed.
- x15: Not used at all.

Compilation w/ Clang shows warning about x18 reservation, but
compilation itself is OK and all tests got passed.

commit 916e1fa8be3cea0e3e2a4a7e8b00027ac2ee7780
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 16:46:52 2021 +0900

Armv8A Rename Regs for Clang Compile: FP64 Part

- x7, x8: Used to store address for Alpha and Beta.
As Alpha & Beta was not used in k-loops, use x0, x1 to load
Alpha & Beta's addresses after k-loops are completed, since A & B's
addresses are no longer needed there.
This "ldr [addr]; -> ldr val, [addr]" would not cause much performance
drawback since it is done outside k-loops and there are plenty of
instructions between Alpha & Beta's loading and usage.
- x9: Used to store cs_c. x9 is multiplied by 8 into x10 and not used
any longer. Directly loading cs_c and into x10 and scale by 8 spares
x9 straightforwardly.
- x11, x12: Not used at all. Simply remove from clobber list.
- x13: Alike x9, loaded and scaled by 8 into x14, except that x13 is
also used in a conditional branch so that "cmp x13, 1" needs to be
modified into "cmp x14, 8" to completely free x13.
- x3, x4: Used to store next_a & next_b. Untouched in k-loops. Load
these addresses into x0 and x1 after Alpha & Beta are both loaded,
since then neigher address of A/B nor address of Alpha/Beta is needed.

commit 7fabd896af773623ed01820a71bbff432e8a7d25
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 29 16:28:03 2021 +0900

Asm Flag Mingling for Darwin_Aarch64

Apple+Arm64 requires additional "tagging" of local symbols.

commit 213dce32d2eed8b7a38c6a3f6112072b0a89ecd0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 28 14:49:57 2021 -0500

Added a new 'gemmlike' sandbox.

Details:
- Added a new sandbox called 'gemmlike', which implements sequential and
multithreaded gemm in the style of gemmsup but also unconditionally
employs packing. The purpose of this sandbox is to
(1) avoid select abstractions, such as objects and control trees, in
order to allow readers to better understand how a real-world
implementation of high-performance gemm can be constructed;
(2) provide a starting point for expert users who wish to build
something that is gemm-like without "reinventing the wheel."
Thanks to Jeff Diamond, Tze Meng Low, Nicholai Tukanov, and Devangi
Parikh for requesting and inspiring this work.
- The functions defined in this sandbox currently use the "bls_" prefix
instead of "bli_" in order to avoid any symbol collisions in the main
library.
- The sandbox contains two variants, each of which implements gemm via a
block-panel algorithm. The only difference between the two is that
variant 1 calls the microkernel directly while variant 2 calls the
microkernel indirectly, via a function wrapper, which allows the edge
case handling to be abstracted away from the classic five loops.
- This sandbox implementation utilizes the conventional gemm microkernel
(not the skinny/unpacked gemmsup kernels).
- Updated some typos in the comments of a few files in the main
framework.

commit 82af05f54c34526a60fd2ec46656f13e1ac8f719
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 25 15:25:08 2021 -0500

Updated Fugaku (a64fx) performance results.

Details:
- Updated the performance graphs (pdfs and pngs) for the Fugaku/a64fx
entry within Performance.md, and also updated the experiment details
accordingly. Thanks to RuQing Xu for re-running the BLIS and SSL2
experiments reflected in this commit.
- In Performance.md, added an English translation of the project name
under which the Fugaku results were gathered, courtesy of RuQing Xu.

commit e5c85da3763f73854ecd739ba3008bb467ed77c3
Merge: cbd8d393 5feb04e2
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon May 24 16:56:22 2021 -0500

Merge pull request 503 from flame/windows-compiler-check

Add explicit compiler check for Windows.

commit cbd8d3932599485727204479fded66ac19186db4
Merge: 6d4ab022 932dfe6a
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon May 24 16:32:42 2021 -0500

Merge pull request 500 from xrq-phys/armsve+travis

Upgrade Travis CI for Arm SVE

commit 5feb04e233e1e6f81c727578ad9eae1367a2562f
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 23 18:46:56 2021 -0500

Add explicit compiler check for Windows.

Check the C compiler for a predefined macro `_WIN32` to indicate (cross-)compilation for Windows. Fixes 463.

commit 6d4ab0223d9014ac2a66d66759536aa305be5867
Merge: 61584ded 859fb77a
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 23 18:39:53 2021 -0500

Merge pull request 502 from flame/rm-rm-dupls

Remove `rm-dupls` function in common.mk.

commit 859fb77a320a3ace71d25a8885c23639b097a1b6
Author: Devin Matthews <damatthewssmu.edu>
Date: Sun May 23 18:15:23 2021 -0500

Remove `rm-dupls` function in common.mk.

AMD requested removal due to unclear licensing terms; original code was from stackoverflow. The function is unused but could easily be replaced by new implementation.

commit 932dfe6abb9617223bd26a249e53447169033f8c
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu May 20 02:07:31 2021 +0900

Travis CI Revert Unnecessary Extras from 91d3636

- Removed `V=1` in make line
- Removed `CFLAGS` in configure line
- Restored `pwd` surrounding OOT line

commit bd156a210d347a073a6939cc4adab3d9256c2e2b
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sun May 16 02:56:14 2021 +0900

Adjust TravisCI

- ArmSVE don't test gemmt (seems Qemu-only problem);
- Clang use TravisCI-provided version instead of fixing to clang-8
due to that clang-8 seems conflicting with TravisCI's clang-7.

commit 91d3636031021af3712d14c9fcb1eb34b6fe2a31
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Sat May 15 17:05:16 2021 +0900

Travis Support Arm SVE

- Updated distro to 20.04 focal aarch64-gcc-10.
This is minimal version required by aarch64-gcc-10.
SVE intrinsics would not compile without GCC >=10.
- x86 toolchains use official repo instead of ubuntu-toolchain-r/test.
20.04 focal is not supported by that PPA at the moment.
- Add extra configuration-time options to .travis.yml.
- Add Arm SVE entry to .travis.yml.

commit 61584deddf9b3af6d11a811e6e04328d22390202
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Wed May 19 23:52:29 2021 +0900

Added 512b SVE-based a64fx subconfig + SVE kernels.

Details:
- Added 512-bit specific 'a64fx' subconfiguration that uses empirically
tuned block size by Stepan Nassyr. This subconfig also sets the sector
cache size and enables memory-tagging code in SVE gemm kernels. This
subconfig utilizes (16, k) and (10, k) DPACKM kernels.
- Added a vector-length agnostic 'armsve' subconfiguration that computes
blocksizes according to the analytical model. This part is ported from
Stepan Nassyr's repository.
- Implemented vector-length-agnostic [d/s/sh] gemm kernels for Arm SVE
at size (2*VL, 10). These kernels use unindexed FMLA instructions
because indexed FMLA takes 2 FMA units in many implementations.
PS: There are indexed-FLMA kernels in Stepan Nassyr's repository.
- Implemented 512-bit SVE dpackm kernels with in-register transpose
support for sizes (16, k) and (10, k).
- Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for
size (12, k). This dpackm kernel is not currently used by any
subconfiguration.
- Implemented several experimental dgemmsup kernels which would
improve performance in a few cases. However, those dgemmsup kernels
generally underperform hence they are not currently used in any
subconfig.
- Note: This commit squashes several commits submitted by RuQing Xu via
PR 424.

commit b683d01b9c4ea5f64c8031bda816beccfbf806a0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 13 15:23:22 2021 -0500

Use extra undef when including ba/ex API headers.

Details:
- Inserted a "include bli_xapi_undef.h" after each usage of the basic
and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h,
bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to
the previous status quo, in which each header made minimal undef
prior to its own definitions and then a single instance of
"include bli_xapi_undef.h" cleaned up any remaining macro defs after
all other headers were used. This commit will guarantee that macro
defs from the setup of one header (say, bli_oapi_ex.h) don't "infect"
the definitions made in a subsequent header. As with this previous
commit, this change does not fix any issue but rather attempts to
avoid creating orphaned macro definitions that are only needed within
a very limited scope.
- Removed minimal undef from bli_?api_[ba|ex].h.
- Removed old commented-out lines from bli_?api_[ba|ex].h.

commit d4427a5b2f5cab5d2a64c58d87416628867c2b4a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 13 13:55:11 2021 -0500

Minor preprocessor/header cleanup.

Details:
- Added frame/include/bli_xapi_undef.h, which explicitly undefines all
macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
bli_tapi_ex.h. (This is for safety and good cpp coding practice, not
because it fixes anything.)
- Added include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h,
bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h.
- Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
bli_tapi_ex.h.
- Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing
that nothing in BLIS used those function pointer types. Also commented
out the "include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h.

commit 5aa63cd927b22a04e581b07d0b68ef391f4f9b1f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 12 19:53:35 2021 -0500

Fixed typo in cpp guard in bli_util_ft.h.

Details:
- Changed ifdef BLIS_OAPI_BASIC to ifdef BLIS_TAPI_BASIC in
bli_util_ft.h. This typo was causing some types to be redefined when
they weren't supposed to be.

commit f0e8634775094584e89f1b03811ee192f2aaf67f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 12 18:45:32 2021 -0500

Defined eqsc, eqv, eqm to test object equality.

Details:
- Defined eqsc, eqv, and eqm operations, which set a bool depending on
whether the two scalars, two vectors, or two matrix operands are equal
(element-wise). eqsc and eqv support implicit conjugation and eqm
supports diagonal offset, diag, uplo, and trans parameters (in a
manner consistent with other level-1m operations). These operations
are currently housed under frame/util, at least for now, because they
are not computational in nature.
- Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm.
- Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md.
Also:
- Documented getsc and setsc in both docs.
- Reordered entry for setijv in BLISTypedAPI.md, and added separator
bars to both docs.
- Added missing "Observed object properties" clauses to various
levle-1v entries in BLISObjectAPI.md.
- Defined bli_apply_trans() in bli_param_macro_defs.h.
- Defined supporting _check() function, bli_l0_xxbsc_check(), in
bli_l0_check.c for eqsc.
- Programming style and whitespace updates to bli_l1m_unb_var1.c.
- Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c
- Consolidated redundant macro redefinition for copym function pointer
type in bli_l1m_ft.h.
- Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that
allow oapi and tapi source files to forego defining certain expert
functions. (Certain operations such as printv and printm do not need
to have both basic expert interfaces. This also includes eqsc, eqv,
and eqm.)

commit 5d46dbee4a06ba5a422e19817836976f8574cb4f
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed May 12 18:42:09 2021 -0500

Replace bli_dlamch with something less archaic (498)

Details:
- Added new implementations of bli_slamch() and bli_dlamch() that use
constants from the standard C library in lieu of dynamically-computed
values (via code inherited from netlib). The previous implementation
is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is
defined by the subconfiguration at compile-time. Thanks to Devin
Matthews for providing this patch, and to Stefano Zampini for
reporting the issue (497) that prompted Devin to propose the patch.

commit 6a89c7d8f9ac3f51b5b4d8ccb2630d908d951e6f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat May 1 18:54:48 2021 -0500

Defined setijv, getijv to set/get vector elements.

Details:
- Defined getijv, setijv operations to get and set elements of a vector,
in bli_setgetijv.c and .h.
- Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively.
- Added additional bounds checking to getijm and setijm to prevent
actions with negative indices.
- Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv
and setijv.
- Added documentation to BLISTypedAPI.md for getijm and setijm, which
were inadvertently missing.
- Added a new entry to the FAQ titled "Why does BLIS have vector
(level-1v) and matrix (level-1m) variations of most level-1
operations?"
- Comment updates.

commit 4534daffd13ed7a8983c681d3f5e9de17c9f0b96
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 27 18:16:44 2021 -0500

Minor API breakage in bli_pack API.

Details:
- Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that
instead of returning a bool, they set a bool that is passed in by
address. This does break the public exported API, but I expect very
few users actually use this function. (This change is being made in
preparation for a much more extensive commit relating to error
checking.)

commit 6a4aa986ffc060d3e64ed230afe318b82630f8b2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 23 13:10:01 2021 -0500

Fixed typo in Table of Contents.

commit f6424b5b82160d346a09a0fbb526981ecf66cdb3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 23 13:08:06 2021 -0500

Added dedicated Performance section to README.md.

Details:
- Spun off the Performance.md and PerformanceSmall.md links in the
Documentation section into a new Performance section dedicated to
those two links. (The previous entries remain redundantly listed
within Documentation section.) Thanks to Robert van de Geijn for
suggesting this change.

commit 40ce5fd241b9ad140bf57278d440f0598d7f15d8
Merge: 6280757b 1f3461a5
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 21 09:54:25 2021 -0500

Merge pull request 493 from cassiersg/patch-1

Fix typo in FAQ.md

commit 1f3461a5a5a88510f913451a93e3190ec1556f39
Author: Gaëtan Cassiers <cassiersgusers.noreply.github.com>
Date: Wed Apr 21 16:49:05 2021 +0200

Fix typo in FAQ.md

commit 6548cebaf55a1f9bdb8417cc89dd0444d8f9c2e4
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Apr 14 13:00:42 2021 -0500

Allow clang for ThunderX2 config

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

commit 6280757be32f90fd77d8dd9357b07d9306e6f80d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 7 13:03:56 2021 -0500

Minor updates to a64fx section of Performance.md.

commit 1e6ed823c6cd11f9b671779f3c8bdbd2bbb40f34
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Thu Apr 8 02:59:26 2021 +0900

Additional A64fx Comments (490)

* Performance.md Update A64fx Comments

- Reason for ARMPL's missing data;
- Additional envs / flags for kernel selection;
- Update BLIS SRC commit.

* Include Another Fix in armsve-cfg-vendor

A prototype was forgotten, causing that void* pointer was not fully returned.

commit 2688f21a5b073950f6f187c95917fdbb5aac234a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 6 19:02:37 2021 -0500

Added Fujitsu A64fx (512-bit SVE) perf results.

Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on the "Fugaku"
Fujitsu A64fx supercomputer at the RIKEN Center for Computational
Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan
Nassyr for their work in developing and optimizing A64fx support in
BLIS and RuQing for gathering the performance data that is reflected
in these new graphs.

commit ba3ba8da83d48397162139e11337c036a631ba79
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 6 18:39:58 2021 -0500

Minor updates and fixes to test/3/octave scripts.

Details:
- Fixed an issue where the wrong string was being passed in for the
vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.

commit 09bd4f4f12311131938baa9f75d27e92b664d681
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 31 17:09:36 2021 -0500

Add err_t* "return" parameter to malloc functions.

Details:
- Added an err_t* parameter to memory allocation functions including
bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(),
bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions
already use the return value to return the allocated memory address,
they can't communicate errors to the caller through the return value.
This commit does not employ any error checking within these functions
or their callers, but this sets up BLIS for a more comprehensive
commit that moves in that direction.
- Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to
bli_type_defs.h. This was done so that what remains of bli_malloc.h
can be included after the definition of the err_t enum. (This ordering
was needed because bli_malloc.h now contains function prototypes that
use err_t.)
- Defined bli_is_success() and bli_is_failure() static functions in
bli_param_macro_defs.h. These functions provide easy checks for error
codes and will be used more heavily in future commits.
- Unfortunately, the additional err_t* argument discussed above breaks
the API for bli_malloc_user(), which is an exported symbol in the
shared library. However, it's quite possible that the only application
that calls bli_malloc_user()--indeed, the reason it is was marked for
symbol exporting to begin with--is the BLIS testsuite. And if that's
the case, this breakage won't affect anyone. Nonetheless, the "major"
part of the so_version file has been updated accordingly to 4.0.0.

commit f9ad55ce7e12f59930605753959fcfd41a218d8d
Merge: 04502492 90508192
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 31 14:20:19 2021 -0500

Merge branch 'master' into dev

commit 90508192f2d6ae95adc2a3ba9f4e5bad2c8d6fd2
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Mar 30 21:16:44 2021 -0500

Update do_sde.sh (489)

Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.

commit 22c6b5dc4c9cc21942f8ccc30891f9b4385a9504
Author: Nicholai Tukanov <nicholaitukanovgmail.com>
Date: Tue Mar 30 19:07:42 2021 -0500

Fixed bug in power10 microkernel I/O. (488)

Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
not store the microtile result correctly due to incorrect indices
calculations. (The error was introduced when I reorganized the
'kernels/power10/3' directory.)

commit 04502492671456b94bcdee60b9de347b6763a32d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Mar 28 19:11:43 2021 -0500

Always stay initialized after BLAS compat calls.

Details:
- Removed the option to finalize BLIS after every BLAS call, which also
means that BLIS would initialize at the beginning of every BLAS call.
This option never really made sense and wasn't even implemented
properly to begin with. (Because bli_init_auto() and _finalize_auto()
were implemented in terms of bli_init_once() and _finalize_once(),
respectively, the application would have only been able to call one
BLAS routine before BLIS would find itself in a unusable, permanently
uninitialized state.) Because this option was never meant for regular
use, it never made it into configure as an actual configure-time
option, and therefore this commit only removes parts of the code
affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED.

commit 3a6f41afb8197e831b6ce2f1ae7f63735685fa0a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 27 17:22:14 2021 -0500

Renamed membrk files/vars/functions to pba.

Details:
- Renamed the files, variables, and functions relating to the packing
block allocator from its legacy name (membrk) to its current name
(pba). This more clearly contrasts the packing block allocator with
the small block allocator (sba).
- Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that
caused the function to erroneously change the value of the pack_a
field of the global rntm_t instead of the pack_b field. (Apparently
nobody has used this API yet.)
- Comment updates.

commit 36cb4116d15cfef2d42ec4a834efd4a958f261b5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 27 15:15:09 2021 -0500

Switch allocator mutexes to static initialization.

Details:
- Switched the small block allocator (sba), as defined in bli_sba.c and
bli_apool.c, to static initialization of its internal mutex. Did a
similar thing for the packing block allocator (pba), which appears as
global_membrk in bli_membrk.c.
- Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex()
to ensure they won't be used in the future.
- In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp
blocks guarded by BLIS_USE_PTHREAD_MUTEX.

commit 159ca6f01a5f91b93513134c9470b69ff78f5354
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 24 15:57:32 2021 -0500

Made test/3/octave scripts robust to missing data.

Details:
- Modified the octave scripts in test/3 so that the script does not
choke when one or more of the expected OpenBLAS, Eigen, or vendor data
files is missing. (The BLIS data set, however, must be complete.) When
a file is missing, that data series is simply not included on that
particular graph. Also factored out a lot of the redundant logic from
plot_panel_4x5.m into a separate function in read_data.m.

commit 545e6c2f6d09d023b353002a9a43b11aa0c1d701
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:42:33 2021 -0500

CHANGELOG update (0.8.1)

0.8.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:42:33 2021 -0500

Version file update (0.8.1)

commit e56d9f2d94ed247696dda2cbf94d2ca05c7fc089
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:40:50 2021 -0500

ReleaseNotes.md update in advance of next version.

commit ca83f955d45814b7d84f53933cdb73323c0dea2c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 22 17:21:21 2021 -0500

CREDITS file update.

commit 57ef61f6cdb86957f67212aa59407f2f8e7f3d1a
Merge: bf1b578e e7a4a8ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 19 13:05:43 2021 -0500

Merge branch 'master' of github.com:flame/blis

commit bf1b578ea32ea1c9dbf7cb3586969e8ae89aa5ef
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 19 13:03:17 2021 -0500

Reduced KC on skx from 384 to 256.

Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
from 384 to 256. The maximum (extended) KC was also reduced
accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
this change.

commit e7a4a8edc940942357e8e4c4594383a29a962f93
Author: Nicholai Tukanov <nicholaitukanovgmail.com>
Date: Wed Mar 17 19:43:31 2021 -0500

Fix calculation of new pb size (487)

Details:
- Added missing parentheses to the i8 and i4 instantiations of the
GENERIC_GEMM macro in sandbox/power10/generic_gemm.c.

commit 4493cf516e01aba82642a43abe350943ba458fe2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 15 13:12:49 2021 -0500

Redefined BLIS_NUM_ARCHS to update automatically.

Details:
- Changed BLIS_NUM_ARCHS from a cpp macro definition to the last enum
value in the arch_t enum. This means that it no longer needs to get
updated manually whenever new subconfigurations are added to BLIS.
Also removed the explicit initial index assigment of 0 from the
first enum value, which was unnecessary due to how the C language
standard mandates indexing of enum values. Thanks to Devin Matthews
for originally submitting this as a PR in 446.
- Updated docs/ConfigurationHowTo.md to reflect the aforementioned
change.

commit a4b73de84cdffcbe5cf71969a0f7f0f8202b3510
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 12 17:12:27 2021 -0600

Disabled _self() and _equal() in bli_pthread API.

Details:
- Disabled the _self() and _equal() extensions to the bli_pthread API
introduced in d479654. These functions were disabled after I realized
that they aren't actually needed yet. Thanks to Devin Matthews for
helping me reason through the appropriate consumer code that will
appear in BLIS (eventually) in a future commit. (Also, I could never
get the Windows branch to link properly in clang builds in AppVeyor.
See the comment I left in the code, and 485, for more info.)

commit f9d604679d8715bc3e79a8630268446889b51388
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 11 16:57:55 2021 -0600

Added _self() and _equal() to bli_pthread API.

Details:
- Expanded the bli_pthread API to include equivalents to pthread_self()
and pthread_equal(). Implemented these two functions for all three cpp
branches present within bli_pthread.c: systemless, Windows, and
Linux/BSD.

commit fa9b3c8f6b3d5717f19832362104413e1a86dfb0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 11 15:13:51 2021 -0600

Shuffled code in Windows branch of bli_pthreads.c.

Details:
- Reordered the definitions in the cpp branch in bli_pthreads.c that
defines the bli_pthreads API in terms of Windows API calls. Also added
missing comments that mark sections of the API, which brings the code
into harmony with other cpp branches (as well as bli_pthread.h).

commit 95d4f3934d806b3563f6648d57a4e381d747caf5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 11 13:50:40 2021 -0600

Moved cpp macro redef of strerror_r to bli_env.c.

Details:
- Relocated the _MSC_VER-guarded cpp macro re-definition of strerror_r
(in terms of strerror_s) from bli_thread.h to bli_env.c. It was
likely left behind in bli_thread.h in a previous commit, when code
that now resides in bli_env.c was moved from bli_thread.c. (I couldn't
find any other instance of strerror_r being used in BLIS, so I moved
the define directly to bli_env.c rather than place it in bli_env.h.)
The code that uses strerror_r is currently disabled, though, so this
commit should have no affect on BLIS.

commit 8a3066c315358d45d4f5b710c54594455f9e8fc6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 9 17:52:59 2021 -0600

Relocated gemmsup_ref general stride handling.

Details:
- Moved the logic that checks for general stridedness in any of the
matrix operands in a gemmsup problem. The logic previously resided
near the top of bli_gemmsup_int(), which is the thread entry point
for the parallel region of the current gemmsup implementation. The
problem with this setup was that the code would attempt to reject
problems with any general-strided operands by returning BLIS_FAILURE,
and that return value was then being ignored by the l3_sup thread
decorator, which unconditionally returns BLIS_SUCCESS. To solve this
issue, rather than try to manage n return values, one from each of n
threads, I simply moved the logic into bli_gemmsup_ref(). I didn't
move it any higher (e.g. bli_gemmsup()) because I still want the
logic to be part of the current gemmsup handler implementation. That
is, perhaps someone else will create a different handler, and that
author wants to handle general stride differently. (We don't want to
force them into a particular way of handling general stride.)
- Removed the general stride handling from bli_gemmtsup_int(), even
though this function is inoperative for now.
- This commit addresses issue 484. Thanks to RuQing Xu for reporting
this issue.

commit 670bc7b60f6065893e8ec1bebd2fc9e5ba710dff
Author: Nicholai Tukanov <nicholaitukanovgmail.com>
Date: Fri Mar 5 13:53:43 2021 -0600

Add low-precision POWER10 gemm kernels (467)

Details:
- This commit adds a new BLIS sandbox that (1) provides implementations
based on low-precision gemm kernels, and (2) extends the BLIS typed
API for those new implementations. Currently, these new kernels can
only be used for the POWER10 microarchitecture; however, they may
provide a template for developing similar kernels for other
microarchitectures (even those beyond POWER), as changes would likely
be limited to select places in the microkernel and possibly the
packing routines. The new low-precision operations that are now
supported include: shgemm, sbgemm, i16gemm, i8gemm, i4gemm. For more
information, refer to the POWER10.md document that is included in
'sandbox/power10'.

commit b8dcc5bc75a746807d6f8fa22dc2123c98396bf5
Author: RuQing Xu <r-xug.ecc.u-tokyo.ac.jp>
Date: Tue Mar 2 06:58:24 2021 +0800

Fixed typed API definition for gemmt (476)

Details:
- Fixed incorrect definition and prototype of bli_?gemmt() in
frame/3/bli_l3_tapi.c and .h, respectively. gemmt was previously
defined identically to gemm, which was wrong because it did not
take into account the uplo property of C.
- Fixed incorrect API documentation for her2k/syr2k in BLISTypedAPI.md.
Specifically, the document erroneously listed only a single transab
parameter instead of transa and transb.

commit a0e4fe2340a93521e1b1a835a96d0f26dec8406a
Author: Ilknur <ilknuri607gmail.com>
Date: Tue Mar 2 02:06:56 2021 +0400

Fixed double free() in level1v example (482)

Details:
- In exampls/tapi/00level1v.c, pointer 'z' was being freed twice and
pointer 'a' was not being freed at all. This commit correctly frees
each pointer exactly once.

commit f5871c7e06a75799251d6b55a8a5fbfa1a92cf95
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Feb 28 17:03:57 2021 -0600

Added complex asm packm kernels for 'haswell' set.

Details:
- Implemented assembly-based packm kernels for single- and double-
precision complex domain (c and z) and housed them in the 'haswell'
kernel set. This means c3xk, c8xk, z3xk, and z4xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Minor modifications to the corresponding s and d packm kernels that
were introduced in 426ad67.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), upon which these complex kernels are
partially based.

commit 426ad679f55264e381eb57a372632b774320fb85
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Feb 27 18:39:56 2021 -0600

Added assembly packm kernels for 'haswell' set.

Details:
- Implemented assembly-based packm kernels for single- and double-
precision real domain (s and d) and housed them in the 'haswell'
kernel set. This means s6xk, s16xk, d6xk, and d8xk are now all
optimized.
- Registered the aforementioned packm kernels in the haswell, zen,
and zen2 subconfigs.
- Thanks to AMD, who originally contributed the double-precision real
packm kernels (d6xk and d8xk), which I have now tweaked and used to
create comparable single-precision real kernels (s6xk and s16xk).

commit f50c1b7e5886d29efe134e1994d05af9949cd4b6
Merge: 8f39aea1 b3953b93
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Feb 1 11:55:51 2021 -0600

Merge pull request 473 from ajaypanyala/pkgconfig

build: generate pkgconfig file

commit 8f39aea11f80a805b66cff4b4dc5e72727ea461d
Merge: f8db9fb3 2a815d5b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jan 30 17:59:56 2021 -0600

Merge branch 'dev'

commit f8db9fb33b48844d6b47fdef699625bd9197745a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 28 08:04:52 2021 -0600

Fixed missing parentheses in README.md Citations.

commit b3953b938eee59f79b4a4162ba583a5cb59fa34e
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Tue Jan 12 17:07:04 2021 -0800

drop CFLAGS in the generated pkgconfig file

commit b02d9376bac31c1a1c7916f44c4946277a1425e2
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Mon Jan 11 20:50:01 2021 -0800

add datadir

commit d8d8deeb6d8b84adb7ae5fdb88c6dd4f06624a76
Author: Ajay Panyala <ajay.panyalagmail.com>
Date: Mon Jan 11 17:47:50 2021 -0800

generate pkgconfig file

commit 8c65411c7c8737248a6f054ffa0ce008c95cb515
Merge: 328b4f88 874c3f04
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jan 11 16:01:45 2021 -0600

Merge pull request 471 from flame/fix-470

Fix kernel-to-config mapping for intel64

commit 874c3f04ece9af4d8fdf0e2713e21a259c117656
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jan 8 13:56:30 2021 -0600

Update configure

Choose last sub-config in the kernel-to-config map if the config list doesn't contain the name of the kernel set. E.g. for "zen: skx knl haswell" pick "haswell" instead of "skx" which was chosen previously. Fixes 470.

commit 2a815d5b365d934cb351b2f2a8cd1366e997b2e1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 4 18:03:39 2021 -0600

Support trsm pre-inversion in 1m, bb, ref kernels.

Details:
- Expanded support for disabling trsm diagonal pre-inversion to other
microkernel types, including the reference microkernel as well as the
kernel implementations for 1m and the pre-broadcast B (bb) format used
by the power9 subconfig. This builds on the 'haswell' and 'penryn'
kernel support added in 7038bba. Thanks to Bhaskar Nallani for
reminding me, in 461 (post-closure), that 1m support was missing from
that commit.
- Removed cpp branch of ref_kernels/3/bli_trsm_ref.c that contained the
omp simd implementation after making a stripped-down copy in 'old'.
This code has been disabled for some time and it seemed better suited
to rot away out of sight rather than clutter up a file that is already
cluttered by the presence of lower and upper versions.
- Minor comment update to bli_ind_init().

commit c3ed2cbb9f60100fc9beb2a9d75476de9f711dc5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 4 16:16:32 2021 -0600

Enable 1m only if real domain ukr is not reference.

Details:
- Previously, BLIS would automatically enable use of the 1m method
for a given precision if the complex domain microkernel was a
reference kernel. This commit adds an additional constraint so that
1m is only enabled if the corresponding real domain microkernel is
NOT reference. That is, BLIS now forgos use of 1m if both the real and
complex domain kernels are reference implementations. Note that this
does not prevent 1m from being enabled manually under those
conditions; it only means that 1m will not be enabled automatically
at initialization-time.

commit ed50c947385ba3b0b5d550015f38f7f0a31755c0
Merge: 0cef09aa 328b4f88
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 4 14:31:44 2021 -0600

Merge branch 'master' into dev

commit 328b4f8872b4bca9a53d2de8c6e285f3eb13d196
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Dec 30 17:54:18 2020 -0600

Shared object (dylib) was not built correctly for partial build.

The SO build rule used $? instead of $^. Observed on macOS, not sure if it affected Linux or not.

commit ae6ef66ef824da9bc6348bf9d1b588cd4f2ded9b
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Dec 30 17:34:55 2020 -0600

bli_diag_offset_with_trans had wrong return type. Fixes 468.

commit ebcf197fb86fdd0a864ea928140752bc2462e8c6
Merge: 472f138c 21aa67e1
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Dec 5 22:26:27 2020 -0600

Merge pull request 466 from isuruf/patch-3

fix cc_vendor for crosstool-ng toolchains

commit 21aa67e11cebbc5a6dd7c6353154256294df3c33
Author: Isuru Fernando <isurufgmail.com>
Date: Sat Dec 5 21:59:13 2020 -0600

fix cc_vendor for crosstool-ng toolchains

commit 472f138cb927b7259126ebb9c68919cfcc7a4ea3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Dec 5 14:13:52 2020 -0600

Fixed typo in README.md to CodingConventions.md.

commit 0cef09aa92208441a656bf097f197ea8e22b533b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 4 16:40:59 2020 -0600

Consolidated code in level-3 _front() functions.

Details:
- Reduced a code segment that appears in all of the bli_*_front()
functions except for bli_gemm_front(). Previously, the code looked
like this (taken from bli_herk_front()):

if ( bli_cntx_method( cntx ) == BLIS_NAT )
{
bli_obj_set_pack_schema( BLIS_PACKED_ROW_PANELS, &a_local );
bli_obj_set_pack_schema( BLIS_PACKED_COL_PANELS, &ah_local );
}
else // if ( bli_cntx_method( cntx ) != BLIS_NAT )
{
pack_t schema_a = bli_cntx_schema_a_block( cntx );
pack_t schema_b = bli_cntx_schema_b_panel( cntx );

bli_obj_set_pack_schema( schema_a, &a_local );
bli_obj_set_pack_schema( schema_b, &ah_local );
}

This code segment is part of a sort-of-hack that allows us to
communicate the pack schemas into the level-3 thread decorator, which
needs them so that they can be passed into bli_l3_cntl_create_if(),
where the control tree is created. However, the first conditional case
above is unnecessary because the second case is fully generalized.
That is, even in the native case, the context contains correct,
queryable schemas. Thus, these code segments were reduced to something
like:

pack_t schema_a = bli_cntx_schema_a_block( cntx );
pack_t schema_b = bli_cntx_schema_b_panel( cntx );

bli_obj_set_pack_schema( schema_a, &a_local );
bli_obj_set_pack_schema( schema_b, &ah_local );

There's always a small chance that the seemingly unnecessary code
in the first branch case has some special use that is not apparent to
me, but the testsuite's default input parameters seem to think this
commit will be fine.

commit 7038bbaa05484141195822291cf3ba88cbce4980
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 4 16:08:15 2020 -0600

Optionally disable trsm diagonal pre-inversion.

Details:
- Implemented a configure-time option, --disable-trsm-preinversion, that
optionally disables the pre-inversion of diagonal elements of the
triangular matrix in the trsm operation and instead uses division
instructions within the gemmtrsm microkernels. Pre-inversion is
enabled by default. When it is disabled, performance may suffer
slightly, but numerical robustness should improve for certain
pathological cases involving denormal (subnormal) numbers that would
otherwise result in overflow in the pre-inverted value. Thanks to
Bhaskar Nallani for reporting this issue via 461.
- Added preprocessor macro guards to bli_trsm_cntl.c as well as the
gemmtrsm microkernels for 'haswell' and 'penryn' kernel sets pursuant
to the aforementioned feature.
- Added macros to frame/include/bli_x86_asm_macros.h related to division
instructions.

commit 78aee79452cce2691c40f05b3632bdfc122300af
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 2 13:02:36 2020 -0600

Allow amaxv testsuite module to run with dim = 0.

Details:
- Exit early from libblis_test_amaxv_check() when the vector dimension
(length) of x is 0. This allows the module to run when the testsuite
driver passes in a problem size of 0. Thanks to Meghana Vankadari for
alerting us to this issue via 459.
- Note: All other testsuite modules appear to work with problem sizes
of 0, except for the microkernel modules. I chose not to "fix" those
modules because a failure (or segmentation fault, as happens in this
case) is actually meaningful in that it alerts the developer that some
microkernels cannot be used with k = 0. Specifically, the 'haswell'
kernel set contains microkernels that preload elements of B. Those
microkernels would need to be restructured to avoid preloading in
order to support usage when k = 0.

commit 92d2b12a44ee0990c22735472aeaf1c17deb2d9b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 2 13:02:00 2020 -0600

Fixed obscure testsuite gemmt dependency bug.

Details:
- Fixed a bug in the gemmt testsuite module that only manifested when
testing of gemmt is enabled but testing of gemv is disabled. The bug
was due to a copy-paste error dating back to the introduction of gemmt
in 88ad841.

commit b43dae9a5d2f078c9bbe07079031d6c00a68b7de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 1 16:44:38 2020 -0600

Fixed copy-paste bugs in edge-case sup kernels.

Details:
- Fixed bugs in two sup kernels, bli_dgemmsup_rv_haswell_asm_1x6() and
bli_dgemmsup_rd_haswell_asm_1x4(), which involved extraneous assembly
instructions that were left over from when the kernels were first
written. These instructions would cause segmentation faults in some
situations where extra memory was not allocated beyond the end of
the matrix buffers. Thanks to Kiran Varaganti for reporting these
bugs and to Bhaskar Nallani for identifying the cause and solution.

commit 11dfc176a3c422729f453f6c23204cf023e9954d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 1 19:51:27 2020 +0000

Reorganized thread auto-factorization logic.

Details:
- Reorganized logic of bli_thread_partition_2x2() so that the primary
guts were factored out into "fast" and "slow" variants. Then added
logic to the "fast" variant that allows for more optimal thread
factorizations in some situations where there is at least one factor
of 2.
- Changed BLIS_THREAD_RATIO_M from 2 to 1 in bli_kernel_macro_defs.h and
added comments to that file describing BLIS_THREAD_RATIO_? and
BLIS_THREAD_MAX_?R.
- In bli_family_zen.h and bli_family_zen2.h, preprocessed out several
macros not used in vanilla BLIS and removed the unused macro
BLIS_ENABLE_ZEN_BLOCK_SIZES from the former file.
- Disabled AMD's small matrix handling entry points in bli_syrk_front.c
and bli_trsm_front.c. (These branches of small matrix handling have
not been reviewed by vanilla BLIS developers.)
- Added commented-out calls printf() to bli_rntm.c.
- Whitespace changes to bli_thread.c.

commit 6d3bafacd7aa7ad198762b39490876c172bfbbcb
Author: Devin Matthews <damatthewssmu.edu>
Date: Sat Nov 28 17:17:56 2020 -0600

Update BuildSystem.md

Add git version >= 1.8.5 requirement (see 462).

commit 64856ea5a61b01d585750815788b6a775f729647
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 23 16:54:51 2020 -0600

Auto-reduce (by default) prime numbers of threads.

Details:
- When requesting multithreaded parallelism by specifying the total
number of threads (whether it be via environment variable, globally at
runtime, or locally at runtime), reduce the number of threads actually
used by one if the original value (a) is prime and (b) exceeds a
minimum threshold defined by the macro BLIS_NT_MAX_PRIME, which is set
to 11 by default. If, when specifying the total number of threads (and
not the individual ways of parallelism for each loop), prime numbers
of threads are desired, this feature may be overridden by defining the
BLIS_ENABLE_AUTO_PRIME_NUM_THREADS macro in the bli_family_*.h that
corresponds to the configuration family targeted at configure-time.
(For now, there is no configure option(s) to control this feature.)
Thanks to Jeff Diamond for suggesting this change.
- Defined a new function in bli_thread.c, bli_is_prime(), that returns a
bool that determines whether an integer is prime. This function is
implemented in terms of existing functions in bli_thread.c.
- Updated docs/Multithreading.md to document the above feature, along
with unrelated minor edits.

commit 55933b6ff6b9b8a12041715f42bba06273d84b74
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 20 10:39:32 2020 -0600

Added missing attribution to docs/ReleaseNotes.md.

commit e310f57b4b29fbfee479e0f9fe2040851efdec4f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 19 13:33:37 2020 -0600

CHANGELOG update (0.8.0)

0.8.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 19 13:33:37 2020 -0600

Version file update (0.8.0)

commit 2928ec750d3a3e1e5d55de5b57ddc04e9d0bd796
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 18 18:31:35 2020 -0600

ReleaseNotes.md update in advance of next version.

Details:
- Updated docs/ReleaseNotes.md in preparation for next version.

commit b9899bedff6854639468daa7a973bb14ca131a74
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 18 16:52:41 2020 -0600

CREDITS file update.

commit 9bb23e6c2a44b77292a72093938ab1ee6e6cc26a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 16 15:55:45 2020 -0600

Added support for systemless build (no pthreads).

Details:
- Added a configure option, --[enable|disable]-system, which determines
whether the modest operating system dependencies in BLIS are included.
The most notable example of this on Linux and BSD/OSX is the use of
POSIX threads to ensure thread safety for when application-level
threads call BLIS. When --disable-system is given, the bli_pthreads
implementation is dummied out entirely, allowing the calling code
within BLIS to remain unchanged. Why would anyone want to build BLIS
like this? The motivating example was submitted via 454 in which a
user wanted to build BLIS for a simulator such as gem5 where thread
safety may not be a concern (and where the operating system is largely
absent anyway). Thanks to Stepan Nassyr for suggesting this feature.
- Another, more minor side effect of the --disable-system option is that
the implementation of bli_clock() unconditionally returns 0.0 instead
of the time elapsed since some fixed point in the past. The reasoning
for this is that if the operating system is truly minimal, the system
function call upon which bli_clock() would normally be implemented
(e.g. clock_gettime()) may not be available.
- Refactored preprocess-guarded code in bli_pthread.c and bli_pthread.h
to remove redundancies.
- Removed old comments and commented include of "bli_pthread_wrap.h"
from bli_system.h.
- Documented bli_clock() and bli_clock_min_diff() in BLISObjectAPI.md
and BLISTypedAPI.md, with a note that both are non-functional when
BLIS is configured with --disable-system.

commit 88ad84143414644df4c56733b1cf91a36bfacaf8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 14 09:39:48 2020 -0600

Squash-merge 'pr' into 'squash'. (457)

Merged contributions from AMD's AOCL BLIS (448).

Details:
- Added support for level-3 operation gemmt, which performs a gemm on
only the lower or upper triangle of a square matrix C. For now, only
the conventional/large code path will be supported (in vanilla BLIS).
This was accomplished by leveraging the existing variant logic for
herk. However, some of the infrastructure to support a gemmtsup is
included in this commit, including
- A bli_gemmtsup() front-end, similar to bli_gemmsup().
- A bli_gemmtsup_ref() reference handler function.
- A bli_gemmtsup_int() variant chooser function (with variant calls
commented out).
- Added support for inducing complex domain gemmt via the 1m method.
- Added gemmt APIs to the BLAS and CBLAS compatiblity layers.
- Added gemmt test module to testsuite.
- Added standalone gemmt test driver to 'test' directory.
- Documented gemmt APIs in BLISObjectAPI.md and BLISTypedAPI.md.
- Added a C++ template header (blis.hh) containing a BLAS-inspired
wrapper to a set of polymorphic CBLAS-like function wrappers defined
in another header (cblas.hh). These two headers are installed if
running the 'install' target with INSTALL_HH is set to 'yes'. (Also
added a set of unit tests that exercise blis.hh, although they are
disabled for now because they aren't compatible with out-of-tree
builds.) These files now live in the 'vendor' top-level directory.
- Various updates to 'zen' and 'zen2' subconfigurations, particularly
within the context initialization functions.
- Added s and d copyv, setv, and swapv kernels to kernels/zen/1, and
various minor updates to dotv and scalv kernels. Also added various
sup kernels contributed by AMD to kernels/zen/3. However, these
kernels are (for now) not yet used, in part because they caused
AppVeyor clang failures, and also because I have not found time to
review and vet them.
- Output the python found during configure into the definition of PYTHON
in build/config.mk (via build/config.mk.in).
- Added early-return checks (A, B, or C with zero dimension; alpha = 0)
to bli_gemm_front.c.
- Implemented explicit beta = 0 handling in for the sgemm ukernel in
bli_gemm_armv7a_int_d4x4.c, which was previously missing. This latent
bug surfaced because the gemmt module verifies its computation using
gemm with its beta parameter set to zero, which, on a cortexa15 system
caused the gemm kernel code to unconditionally multiply the
uninitialized C data by beta. The C matrix likely contained
non-numeric values such as NaN, which then would have resulted in a
false failure.
- Fixed a bug whereby the implementation for bli_herk_determine_kc(),
in bli_l3_blocksize.c, was inadvertantly being defined in terms of
helper functions meant for trmm. This bug was probably harmless since
the trmm code should have also done the right thing for herk.
- Used cpp macros to neutralize the various AOCL_DTL_TRACE_ macros in
kernels/zen/3/bli_gemm_small.c since those macros are not used in
vanilla BLIS.
- Added cpp guard to definition of bli_mem_clear() in bli_mem.h to
accommodate C++'s stricter type checking.
- Added cpp guard to test/*.c drivers that facilitate compilation on
Windows systems.
- Various whitespace changes.

commit 234b8b0cf48f1ee965bd7999b291fc7add3b9a54
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 12 19:11:16 2020 -0600

Increased dotxaxpyf testsuite thresholds.

Details:
- Increased the test thresholds used by the dotxaxpyf testsuite module
by a factor of five in order to avoid residuals that unnecessarily
fall in the MARGINAL range. This commit should fix 455. Thanks to
nagsingh for reporting this issue.

commit ed612dd82c50063cfd23576a6b2465213d31b14b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 7 13:09:42 2020 -0600

Updated README.md with sgemmsup blurb.

Details:
- Added an entry to the "What's New" section of the README.md to
announce the availability of sgemmsup.

commit e14424f55b15d67e8d18384aea45a11b9b772e02
Merge: 0cfe1aac eccdd75a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 7 13:02:50 2020 -0600

Merge branch 'dev'

commit 0cfe1aac222008a78dff3ee03ef5183413936706
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 30 17:10:36 2020 -0500

Relocated operation index to ToC in API docs.

Details:
- Moved the "Operation index" section of both the BLISObjectAPI.md and
BLISTypedAPI.md docs to appear immediately after the table of contents
of each document. This allows the reader to quickly jump to the
documentation for any operation without having to scroll through much
of the document (when rendered via a web browser).
- Fixed a mistake in the BLISObjectAPI.md for the setd operation, which
does *not* observe the diag property of its matrix argument. Thanks to
Jeff Diamond for reporting this.

commit 2a0682f8e5998be536da313525292f0da6193147
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Oct 18 18:04:03 2020 -0500

Implemented runtime subconfig selection (451).

Details:
- Implemented support for the user manually overriding the automatic
subconfiguration selection that happens at runtime. This override
can be requested by setting the BLIS_ARCH_TYPE environment variable.
The variable must be set to the arch_t id (as enumerated in
bli_type_defs.h) corresponding to the desired subconfiguration. If a
value outside this enumerated range is given, BLIS will abort with an
error message. If the value is in the valid range but corresponds to a
subconfiguration that was not activated at configure-time/compile-time,
BLIS will abort with a (different) error message. Thanks to decandia50
for suggesting this feature via issue 451.
- Defined a new function bli_gks_lookup_id to return the address of an
internal data structure within the gks. If this address is NULL, then
it indicates that the subconfig corresponding to the arch_t id passed
into the function was not compiled into BLIS. This function is used
in the second of the two abort scenarios described above.
- Defined the enumerated error code BLIS_UNINITIALIZED_GKS_CNTX, which
is returned for the latter of the two abort scenarios mentioned above,
along with a corresponding error message and a function to perform
the error check.
- Added cpp macro branching to bli_env.c to support compilation of the
auto-detect.x executable during configure-time. This cpp branch is
similar to the cpp code already found in bli_arch.c and bli_cpuid.c.
- Cleaned up the auto_detect() function to facilitate easier maintenance
going forward. Also added a convenient debug switch that outputs the
compilation command for the auto-detect.x executable and exits.

commit eccdd75a2d8a0c46e91e94036179c49aa5fa601c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 15:44:16 2020 -0500

Whitespace tweak in docs/PerformanceSmall.md.

commit 7677e9ba60ac27496e3421c2acc7c239e3f860e9
Merge: addcd46b a0849d39
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 15:41:25 2020 -0500

Merge branch 'dev' of github.com:flame/blis into dev

commit addcd46b0559d401aa7d33d4c7e6f63f5313a8e0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 15:41:09 2020 -0500

Added Epyc 7742 Zen2 ("Rome") sup perf results.

Details:
- Added single-threaded and multithreaded sup performance results to
docs/PerformanceSmall.md for both sgemm and dgemm. These results were
gathered on an Epyc 7742 "Rome" server featuring AMD's Zen2
microarchitecture. Special thanks to Jeff Diamond for facilitating
access to the system via the Oracle Cloud.
- Updates to octave scripts in test/sup/octave for use with Octave 5.2
and for use with subplot_tight().
- Minor updates to octave scripts in test/3/octave.
- Renamed files containing the previous Zen performance results for
consistency with the new results.
- Decreased line thickness slightly in large/conventional Zen2 graphs.
I'm done tweaking those this time. Really.
- Added missing line regarding eigen header installation for each
microarchitecture section.

commit a0849d390d04067b82af937cda8191b049b98915
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 9 20:22:17 2020 +0000

Register l3 sup kernels in zen2 subconfig.

Details:
- Registered full suite of sgemm and dgemm sup millikernels, blocksizes,
and crossover thresholds in bli_cntx_init_zen2.c.
- Minor updates to test/sup/runme.sh for running on Zen2 Epyc 7742
system.

commit d98368c32d5fbfaab8966ee331d9bcb5c4fe7a59
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 8 19:05:51 2020 -0500

Another tweak to line thickness of Zen2 graphs.

commit 1855dfbdaafa37892b36c97fd317fd5d8da76676
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 8 19:01:00 2020 -0500

Tweaked line thickness in Zen2 graphs once more.

Details:
- Decreased (relative to previous commit) line thickness in recent Zen2
graphs.

commit 0991611e7ed82889c53a5c3f1ef1d49552c50d61
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 8 18:54:49 2020 -0500

Increased line thickness in recent Zen2 graphs.

Details:
- Increased the width of the lines in the graphs introduced in 74ec6b8.

commit 8273cbacd7799e9af59e5320d66055f2f5d9cb31
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 7 14:51:33 2020 -0500

README.md, docs/FAQ.md updates.

Details:
- Added a frequently asked question to docs/FAQ.md regarding the
difference between upstream (vanilla) BLIS and AMD BLIS.
- Updated the name of ICES in the README.md to reflect the Oden
rebranding.

commit a178a822ad3d5021489a0e61f909d8550ae12a8f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 16:00:52 2020 -0500

Added Zen2 links to docs/Performance.md Contents.

commit 74ec6b8f457cabe37d2382aaab35ba04fc737948
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 15:54:18 2020 -0500

Added Epyc 7742 Zen2 ("Rome") performance results.

Details:
- Added single-threaded and multithreaded performance results to
docs/Performance.md. These results were gathered on an Epyc 7742
"Rome" server with AMD's Zen2 microarchitecture. Special thanks
to Jeff Diamond for facilitating access to the system via the
Oracle Cloud.
- Renamed files containing the previous Zen performance results for
consistency with the new results.

commit bc4a213a2c3dcf8bbfcbb3a1ef3e9fc9e3226c34
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 15:28:20 2020 -0500

Updated matlab (now octave) plot code in test/3.

Details:
- Renamed test/3/matlab to test/3/octave.
- Within test/3, updated and tuned plot_l3_perf.m and plot_panel_4x5.m
files for use with octave (which is free and doesn't crash on me
mid-way through my use of subplot).
- Updated runthese.m scratchpad for zen2 invocations.
- Added Nikolay S.'s subplot_tight() function, along with its license.

commit c77ddc418187e1884fa6bcfe570eee295b9cb8bc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 30 20:15:43 2020 +0000

Added optional numactl usage to test/3/runme.sh.

commit 2d8ec164e7ae4f0c461c27309dc1f5d1966eb003
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Tue Sep 29 16:52:18 2020 -0500

Add POWER10 support to BLIS (450)

commit 4fd8d9fec2052257bf2a5c6e0d48ae619ff6c3e4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 28 23:39:05 2020 +0000

Tweaked zen2 subconfig's MC cache blocksizes.

Details:
- Updated the MC cache blocksizes registered by the 'zen2' subconfig.
- Minor updates to test/3/Makefile and test/3/runme.sh.

commit 5efcdeffd58af621476d179afc0c19c0f912baa8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 25 14:25:24 2020 -0500

More minor README.md updates.

commit 9e940f8aad6f065ea1689e791b9a4e1fb7900c40
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 25 13:53:35 2020 -0500

Added 1m SISC bibtex to README.md.

Details:
- Added final citation info to 1m bibtex in README.md file.
- Updated draft 1m paper link.
- Changed some http to https.

commit e293cae2d1b9067261f613f25eaa0e871356b317
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 15 16:09:11 2020 -0500

Implemented sgemmsup assembly kernels.

Details:
- Created a set of single-precision real millikernels and microkernels
comparable to the dgemmsup kernels that already exist within BLIS.
- Added prototypes for all kernels within bli_kernels_haswell.h.
- Registered entry-point millikernels in bli_cntx_init_haswell.c and
bli_cntx_init_zen.c.
- Added sgemmsup support to the Makefile, runme.sh script, and source
file in test/sup. This included edits that allow for separate "small"
dimensions for single- and double-precision as well as for single-
vs. multithreaded execution.

commit 2765c6f37c11cb7f71cd4b81c64cea6130636c68
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:48:15 2020 -0500

Type saga continues; fixed sgemm ukernel signature.

Details:
- Changed double* pointers in sgemm function signature to float*. At
this point I've lost track of whether this was my fault or another
dormant bug like the one described in ece9f6a, but at this point I
no longer care. It's one of those days (aka I didn't ask for this).

commit 0779559509e0a1af077530d09ed151dac54f32ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:37:21 2020 -0500

Fixed missing restrict in knl sgemm prototype.

Details:
- Added a missing 'restrict' qualifier in the sgemm ukernel prototype
for knl. (Not sure how that code was ever compiling before now.)

commit ece9f6a3ef1b26b53ecf968cd069df7a85b139fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:22:42 2020 -0500

Fixed dormant type bugs in bli_kernels_knl.h.

Details:
- Fixed dormant type mismatches in the use of the prototype-generating
macros in bli_kernels_knl.h. Specifically, some float prototypes
were incorrectly using double as their ctype. This didn't actually
matter until the type changes in 645d771, as previously those types
were not used since packm was prototyped with void* pointers.

commit 8ebb3b60e1c4c045ddb48e02de6e246cecde24a4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 17:00:47 2020 -0500

Fixed accidental breakage in 645d771.

Details:
- In trying to clean up kappa_cast variables in the reference packm
kernels, which I initally believed to be redundant given the other
void* -> ctype* changes in 645d771, I accidentally ended up violating
restrict semantics for 1e/1r packing and possibly other packm kernels.
(Normally, my pre-commit testsuite run would have caught this, but I
was unknowingly using an edited input.operations file in which I'd
disabled most tests as part of unrelated work.) This commit reverts
the kappa_cast changes in 645d771.

commit 645d771a14ae89aa7131d6f8f4f4a8090329d05e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 12 15:31:56 2020 -0500

Minor packm kernel type cleanup (void* -> ctype*).

Details:
- Changed all void* function arguments in reference packm kernels to
those of the native type (ctype*). These pointers no longer need to
be void* and are better represented by their native types anyway.
(See below for details.) Updated knl packm kernels accordingly.
- In the definition of the PACKM_KER_PROT prototype macro template in
frame/1m/bli_l1m_ker_prot.h, changed the pointer types for kappa, a,
and p from void* to ctype*. They were originally void* because these
function signatures had to share the same type so they could all be
stored in a single array of that shared type, from which they were
queried and called by packm_cxk(). This is no longer how the function
pointers are stored, and so it no longer makes sense to force the
caller of packm kernels to use void*, only so that the implementor
of the packm kernels can typecast back to the native datatype within
the kernel definition. This change has no effect internally within
BLIS because currently all packm kernels are called after querying
the function addresses from the context and then typecasting to the
appropriate function pointer type, which is based upon type-specific
function pointers like float* and double*.
- Removed a comment in frame/1m/bli_l1m_ft_ker.h that was outdated and
misleading due to changes to the handling of packm kernels since
moving them into the context.

commit 54bf6c35542a297e25bc8efec6067a6df80536f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 10 15:42:01 2020 -0500

Minor README.md update.

Details:
- Added a new entry to the "What people are saying about BLIS" section.

commit e50b4d40462714ae33df284655a2faf7fa35f37c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 9 14:12:53 2020 -0500

Minor update to README.md (SIAM Best Paper Prize).

commit a8efb72074691e2610372108becd88b4b392299e
Merge: b0c4da17 97e87f2c
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Sep 7 16:18:19 2020 -0500

Merge pull request 434 from flame/intel-zdot

Add an option to change the complex return type.

commit 97e87f2c9f3878a05e1b7c6ec237ee88d9a72a42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 7 15:56:42 2020 -0500

Whitespace/comment updates to 434 PR.

commit b0c4da1732b6c6a9ff66f70c36e4722e0f9645ae
Merge: 810e90ee b1b5870d
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Sep 7 15:47:54 2020 -0500

Merge pull request 436 from flame/s390x

Add checks so that s390x is detected as 64-bit.

commit 810e90ee806510c57504f0cf8eeaf608d38bd9dd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 1 16:11:40 2020 -0500

Minor README.md update.

Details:
- Added HPE to list of funders.
- Changed http to https in funders' website links.

commit 7d411282196e036991c26e52cb5e5f85769c8059
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 13 17:50:58 2020 -0500

Use -O2 for all framework code. (435)

It seems that -O3 might be causing intermittent problems with the f2c'ed packed and banded code. -O3 is retained for kernel code. Fixes 341 and fixes 342.

commit 9c5b485d356367b0a1288761cd623f52036e7344
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Fri Aug 7 20:11:18 2020 +0000

Don't override -mcpu with -march on ARM (353)

* Use -mcpu for ARM
See the GCC doc about -march, -mtune, and -mpu and maybe
https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu

* Fix typo in flags

* Fix typo in cortexa9 flags

* Modify cortexa53 compilation flags to fix failing BLAS check (341)

commit c253d14a72a746b670b3ffbb6e81bcafc73d1133
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 7 09:39:04 2020 -0500

Also handle Intel-style complex return in CBLAS interface.

commit 5d653a11a0cc71305d0995507b1733995856f475
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 17:58:26 2020 -0500

Update Multithreading.md

Addresses the issue raised in 426.

commit b1b5870dd3f9b1c78cf5f58a53514d73f001fc4c
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 17:34:20 2020 -0500

Add checks so that s390x is detected as 64-bit.

commit 882dcb11bfc9ea50aa2f9044621833efd90d42be
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 6 17:28:14 2020 -0500

Mention example code at top of documentation docs.

Details:
- Steer the reader towards the example code section of each
documentation doc (object and typed).
- Trivial update to examples/oapi/README, examples/tapi/README.

commit f4894512e5bf56ff83701c07dd02972e300741a5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 6 17:20:00 2020 -0500

Very minor updates to previous commit.

commit adedb893ae8dfacd1dc54035979e15c44d589dbb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 6 17:14:01 2020 -0500

Documented mutator functions in BLISObjectAPI.md.

Details:
- Added documentation for commonly-used object mutator functions in
BLISObjectAPI.md. Previously, only accessor functions were documented.
Thanks to Jeff Diamond for pointing out this omission.
- Explicitly set the 'diag' property of objects in oapi example modules
(08level2.c and 09level3.c).

commit 5b5278ff494888509543a79c09ea82089f6c95d9
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 14:19:37 2020 -0500

Use ifdef instead of if as macro may be undefined.

commit 7fdc0fc893d0c6727b725ea842053b65be2c20ba
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Aug 6 14:03:55 2020 -0500

Add an option to change the complex return type.

ifort apparently does not return complex numbers in registers as in C/C++ (or gfortran), but instead creates a "hidden" first parameter for the return value. The option --complex-return=gnu|intel has been added, as well as a guess based on a provided FC if not specified (otherwise default to gnu). This option affects the signatures of cdotc, cdotu, zdotc, and zdotu, and a single library cannot be used with both GNU and Intel Fortran compilers. Fixes 433.

commit 6e522e5823b762d4be09b6acdca30faafba56758
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 30 19:31:37 2020 -0500

Mention disabling of sup in docs/Sandboxes.md.

Details:
- Added language to remind the reader to disable sup if the intended
behavior is for the sandbox implementation to handle all problem
sizes, even the smaller ones that would normally be handled by the
sup code path.

commit 00e14cb6d849e963a2e1ac35e7dbbe186af00a58
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 29 14:24:34 2020 -0500

Replaced use of bool_t type with C99 bool.

Details:
- Textually replaced nearly all non-comment instances of bool_t with the
C99 bool type. A few remaining instances, such as those in the files
bli_herk_x_ker_var2.c, bli_trmm_xx_ker_var2.c, and
bli_trsm_xx_ker_var2.c, were promoted to dim_t since they were being
used not for boolean purposes but to index into an array.
- This commit constitutes the third phase of a transition toward using
C99's bool instead of bool_t, which was raised in issue 420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7. The second phase, which redefined the bool_t typedef
in terms of bool (from gint_t), was implemented by commit 2c554c2.

commit 2c554c2fce885f965a425e727a0314d3ba66c06d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 24 15:57:19 2020 -0500

Redefined bool_t typedef in terms of C99 bool.

Details:
- Changed the typedef that defines bool_t from:

typedef gint_t bool_t;

where gint_t is a signed integer that forms the basis of most other
integers in BLIS, to:

typedef bool bool_t;

- Changed BLIS's TRUE and FALSE macro definitions from being in terms of
integer literals:

define TRUE 1
define FALSE 0

to being in terms of C99 boolean constants:

define TRUE true
define FALSE false

which are provided by stdbool.h.
- This commit constitutes the second phase of a transition toward using
C99's bool instead of bool_t, which will address issue 420. The first
phase, which cleaned up various typecasts in preparation for using
bool as the basis for bool_t (instead of gint_t), was implemented by
commit a69a4d7.

commit e01dd125581cec87f61e15590922de0dc938ec42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 24 15:41:46 2020 -0500

Fail-safe updates to Makefiles in 'test' dir.

Details:
- Updated Makefiles in test, test/3, and test/sup so that running any of
the usual targets without having first built BLIS results in a helpful
error message. For example, if BLIS is not yet configured, make will
output:

Makefile:327: *** Cannot proceed: config.mk not detected! Run
configure first. Stop.

Similarly, if BLIS is configured but not yet built, make will output:

Makefile:340: *** Cannot proceed: BLIS library not yet built! Run
make first. Stop.

In previous commits, these actions would result in a rather cryptic
make error such as:

make: *** No rule to make target 'test_sgemm_2400_asm_blis_st.x',
needed by 'blis-nat-st'. Stop.

commit b4f47f7540062da3463e2cb91083c12fdda0d30a
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Jul 24 13:56:13 2020 -0500

Add BLIS_EXPORT_BLIS to bli_abort. (429)

Fixes 428.

commit a69a4d7e2f4607c919db30b14535234ce169c789
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 22 16:13:09 2020 -0500

Cleaned up bool_t usage and various typecasts.

Details:
- Fixed various typecasts in

frame/base/bli_cntx.h
frame/base/bli_mbool.h
frame/base/bli_rntm.h
frame/include/bli_misc_macro_defs.h
frame/include/bli_obj_macro_defs.h
frame/include/bli_param_macro_defs.h

that were missing or being done improperly/incompletely. For example,
many return values were being typecast as
(bool_t)x && y
rather than
(bool_t)(x && y)
Thankfully, none of these deficiencies had manifested as actual bugs
at the time of this commit.
- Changed the return type of bli_env_get_var() from dim_t to gint_t.
This reflects the fact that bli_env_get_var() needs to be able to
return a signed integer, and even though dim_t is currently defined
as a signed integer, it does not intuitively appear to necessarily be
signed by inspection (i.e., an integer named "dim_t" for matrix
"dimension"). Also, updated use of bli_env_get_var() within
bli_pack.c to reflect the changed return type.
- Redefined type of thrcomm_t.barrier_sense field from bool_t to gint_t
and added comments to the bli_thrcomm_*.h files that will explain a
planned replacement of bool_t with C99's bool type.
- Note: These changes are being made to facilitate the substitution of
'bool' for 'bool_t', which will eliminate the namespace conflict with
arm_sve.h as reported in issue 420. This commit implements the first
phase of that transition. Thanks to RuQing Xu for reporting this
issue.
- CREDITS file update.

commit a6437a5c11d364c6c88af527294d29734d7cc7d6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jul 20 19:21:07 2020 -0500

Replaced broken ref99 sandbox w/ simpler version.

Details:
- The 'ref99' sandbox was broken by multiple refactorings and internal
API changes over the last two years. Rather than try to fix it, I've
replaced it with a much simpler version based on var2 of gemmsup.
Why not fix the previous implementation? It occurred to me that the
old implementation was trying to be a lightly simplified duplication
of what exists in the framework. Duplication aside, this sandbox
would have worked fine if it had been completely independent of the
framework code. The problem was that it was only partially
independent, with many function calls calling a function in BLIS
rather than a duplicated/simplified version within the sandbox. (And
the reason I didn't make it fully independent to begin with was that
it seemed unnecessarily duplicative at the time.) Maintaining two
versions of the same implementation is problematic for obvious
reasons, especially when it wasn't even done properly to begin with.
This explains the reimplementation in this commit. The only catch is
that the newer implementation is single-threaded only and does not
perform any packing on either input matrix (A or B). Basically, it's
only meant to be a simple placeholder that shows how you could plug
in your own implementation. Thanks to Francisco Igual for reporting
this brokenness.
- Updated the three reference gemmsup kernels (defined in
ref_kernels/3/bli_gemmsup_ref.c) so that they properly handle
conjugation of conja and/or conjb. The general storage kernel, which
is currently identical to the column-storage kernel, is used in the
new ref99 sandbox to provide basic support for all datatypes
(including scomplex and dcomplex).
- Minor updates to docs/Sandboxes.md, including adding the threading
and packing limitations to the Caveats section.
- Fixed a comment typo in bli_l3_sup_var1n2m.c (upon which the new
sandbox implementation is based).

commit bca040be9da542dd9c75d91890fa7731841d733d
Merge: 2605eb4d 171ecc1d
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Jul 20 09:27:30 2020 -0500

Merge pull request 425 from gmargari/patch-1

Update Multithreading.md

commit 171ecc1dc6f055ea39da30e508f711b49a734359
Author: Giorgos Margaritis <gmargariprotonmail.com>
Date: Mon Jul 20 12:24:06 2020 +0300

Update Multithreading.md

commit 2605eb4d99d3813c37a624c011aa2459324a6d89
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 15 15:25:19 2020 -0500

Added missing rv_d?x6 edge cases to sup kernel.

Details:
- Added support to bli_gemmsup_rv_haswell_asm_d6x8n.c for handling
various n = 6 edge cases with a single sup kernel call. Previously,
only n = {4,2,1} were handled explicitly as single kernel calls;
that is, cases where n = 6 were previously being executed via two
kernel calls (n = 4 and n = 2).
- Added commented debug line to testsuite's test_libblis.c.

commit 72f6ed0637dfcb021de04ac7d214d5c87e55d799
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 3 17:55:54 2020 -0500

Declare/define static functions via BLIS_INLINE.

Details:
- Updated all static function definitions to use the cpp macro
BLIS_INLINE instead of the static keyword. This allows blis.h to
use a different keyword (inline) to define these functions when
compiling with C++, which might otherwise trigger "defined but
not used" warning messages. Thanks to Giorgos Margaritis for
reporting this issue and Devin Matthews for suggesting the fix.
- Updated the following files, which are used by configure's
hardware auto-detection facility, to unconditionally define
BLIS_INLINE to the static keyword (since we know BLIS will be
compiled with C, not C++):
build/detect/config/config_detect.c
frame/base/bli_arch.c
frame/base/bli_cpuid.c
- CREDITS file update.

commit 5fc701ac5f94c6300febbb2f24e731aa34f0f34a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 1 15:48:58 2020 -0500

Added -fomit-frame-pointer option to CKOPTFLAGS.

Details:
- Added the -fomit-frame-pointer compiler option to the CKOPTFLAGS
variable in the following make_defs.mk files:
config/haswell/make_defs.mk
config/skx/make_defs.mk
as well as comments that mention why the compiler option is needed.
This option is needed to prevent the compiler from using the rbp
frame register (in the very early portion of kernel code, typically
where k_iter and k_left are defined and computed), which, as of
1c719c9, is used explicitly by the gemmsup millikernels. Thanks to
Devin Matthews for identifying this missing option and to Jeff
Diamond for reporting the original bug in 417.
- The file
config/zen/amd_config.mk
which feeds into the make_defs.mk for both zen and zen2 subconfigs,
was also touched, but only to add a commented-out compiler option
(and the aforementioned explanatory comment) since that file already
uses -fomit-frame-pointer in COPTFLAGS, which forms the basis of
CKOPTFLAGS.

commit 6af59b705782dada47e45df6634b479fe781d4fe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 1 14:54:23 2020 -0500

Fixed disabled edge case optimization in gemmsup.

Details:
- Fixed an inadvertently disabled edge case optimization in the two
gemmsup variants in bli_l3_sup_var1n2m.c. Background: These edge case
optimizations allow the last millikernel operation in the jr loop to
be executed with inflated an register blocksize if it is the last
(or only) iteration. For example, if mr=6 and nr=8 and the gemmsup
problem is m=8, n=100, k=100. (In this case, the panel-block variant
(var1n) is executed, which places the jr loop in the m dimension.)
In principle, this problem could be executed as two millikernels: one
with dimensions 6x100x100, and one as 2x100x100. However, with the
support for inflated blocksizes in the kernel, the entire 8x100x100
problem can be passed to the millikernel function, which will then
execute it more favorably as two 4x100x100 millikernel sub-calls.
Now, this optimization is disabled under certain circumstances, such
as when multithreading. Previously, the is_mt predicate was being set
incorrectly such that it was non-zero even when running
single-threaded.
- Upon fixing the is_mt issue above, another bit of code needed to be
moved so that the result of the optimization could have an impact on
the assignment of loop bounds ranges to threads.

commit b37634540fab0f9b8d4751b8356ee2e17c9e3b00
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 25 16:05:12 2020 -0500

Support ldims, packing in sup/test drivers.

Details:
- Updated the test/sup source file (test_gemm.c) and Makefile to support
building matrices with small or large leading dimensions, and updated
runme.sh to support executing both kinds of test drivers.
- Updated runme.sh to allow for executing sup drivers with unpacked (the
default) or packed matrices (via setting BLIS_PACK_A, BLIS_PACK_B
environment variables), and for capturing output to files that encode
both the leading dimension (small or large) and packing status into
the filenames.
- Consolidated octave scripts in test/sup/octave_st, test/sup/octave_mt
into test/sup/octave and updated the octave code in that consolidated
directory to read the new output filename format (encoding ldim and
packing). Also added comments and streamlined code, particularly in
plot_panel_trxsh.m. Tested the octave scripts with octave 5.2.0.
- Moved old octave_st, octave_mt directories to test/sup/old.

commit ceb9b95a96cc3844ecb43d9af48ab289584e76b6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 18 17:15:25 2020 -0500

Fixed incorrect link to shiftd in BLISTypedAPI.md.

Details:
- Previously, the entry for shiftd in the Operation index section of
BLISTypedAPI.md was incorrectly linking to the shiftd operation entry
in BLISObjectAPI.md. This has been fixed. Thanks to Jeff Diamond for
helping find this incorrect link.

commit b3c42016818797f79e55b32c8b7d090f9d0aa0ea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 18 14:00:56 2020 -0500

CREDITS file update.

commit 31af73c11abae03248d959da0f81eacea015b57a
Author: Isuru Fernando <isurufgmail.com>
Date: Thu Jun 18 13:35:54 2020 -0500

Expand windows instructions (414)

* Expand windows instructions

* Windows: both static and shared don't work at the same time

commit b5b604e106076028279e6d94dc0e51b8ad48e802
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 17 16:42:24 2020 -0500

Ensure random objects' 1-norms are non-zero.

Details:
- Fixed an innocuous bug that manifested when running the testsuite on
extremely small matrices with randomization via the "powers of 2 in
narrow precision range" option enabled. When the randomization
function emits a perfect 0.0 to fill a 1x1 matrix, the testsuite will
then compute 0.0/0.0 during the normalization process, which leads to
NaN residuals. The solution entails smarter implementaions of randv,
randnv, randm, and randnm, each of which will compute the 1-norm of
the vector or matrix in question. If the object has a 1-norm of 0.0,
the object is re-randomized until the 1-norm is not 0.0. Thanks to
Kiran Varaganti for reporting this issue (413).
- Updated the implementation of randm_unb_var1() so that it loops over
a call to the randv_unb_var1() implementation directly rather than
calling it indirectly via randv(). This was done to avoid the overhead
of multiple calls to norm1v() when randomizing the rows/columns of a
matrix.
- Updated comments.

commit 35e38fb693e7cbf2f3d7e0505a63b2c05d3f158d
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Jun 16 10:59:41 2020 -0500

FIx typo in FAQ

commit 1c719c91a3ef0be29a918097652beef35647d4b2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 4 17:21:08 2020 -0500

Bugfixes, cleanup of sup dgemm ukernels.

Details:
- Fixed a few not-really-bugs:
- Previously, the d6x8m kernels were still prefetching the next upanel
of A using MR*rs_a instead of ps_a (same for prefetching of next
upanel of B in d6x8n kernels using NR*cs_b instead of ps_b). Given
that the upanels might be packed, using ps_a or ps_b is the correct
way to compute the prefetch address.
- Fixed an obscure bug in the rd_d6x8m kernel that, by dumb luck,
executed as intended even though it was based on a faulty pointer
management. Basically, in the rd_d6x8m kernel, the pointer for B
(stored in rdx) was loaded only once, outside of the jj loop, and in
the second iteration its new position was calculated by incrementing
rdx by the *absolute* offset (four columns), which happened to be the
same as the relative offset (also four columns) that was needed. It
worked only because that loop only executed twice. A similar issue
was fixed in the rd_d6x8n kernels.
- Various cleanups and additions, including:
- Factored out the loading of rs_c into rdi in rd_d6x8[mn] kernels so
that it is loaded only once outside of the loops rather than
multiple times inside the loops.
- Changed outer loop in rd kernels so that the jump/comparison and
loop bounds more closely mimic what you'd see in higher-level source
code. That is, something like:
for( i = 0; i < 6; i+=3 )
rather than something like:
for( i = 0; i <= 3; i+=3 )
- Switched row-based IO to use byte offsets instead of byte column
strides (e.g. via rsi register), which were known to be 8 anyway
since otherwise that conditional branch wouldn't have executed.
- Cleaned up and homogenized prefetching a bit.
- Updated the comments that show the before and after of the
in-register transpositions.
- Added comments to column-based IO cases to indicate which columns
are being accessed/updated.
- Added rbp register to clobber lists.
- Removed some dead (commented out) code.
- Fixed some copy-paste typos in comments in the rv_6x8n kernels.
- Cleaned up whitespace (including leading ws -> tabs).
- Moved edge case (non-milli) kernels to their own directory, d6x8,
and split them into separate files based on the "NR" value of the
kernels (Mx8, Mx4, Mx2, etc.).
- Moved config-specific reference Mx1 kernels into their own file
(e.g. bli_gemmsup_r_haswell_ref_dMx1.c) inside the d6x8 directory.
- Added rd_dMx1 assembly kernels, which seems marginally faster than
the corresponding reference kernels.
- Updated comments in ref_kernels/bli_cntx_ref.c and changed to using
the row-oriented reference kernels for all storage combos.

commit 943a21def0bedc1732c0a2453afe7c90d7f62e95
Author: Isuru Fernando <isurufgmail.com>
Date: Thu May 21 14:09:21 2020 -0500

Add build instructions for Windows (404)

commit fbef422f0d968df10e598668b427af230cfe07e8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 21 10:30:41 2020 -0500

Separate OS X and Windows into separate FAQs.

Details:
- Separated the unified Mac OS X / Windows frequently asked question
into two separate questions, one for each OS.

commit 28be1a4265ea67e3f177c391aba3dbbcf840bd52
Author: Guodong Xu <guodong.xulinaro.org>
Date: Thu May 21 02:22:22 2020 +0800

avoid loading twice in armv8a gemm kernel (403)

This bug happens at a corner case, when k_iter == 0 and we jump to
CONSIDERKLEFT.

In current design, first row/col. of a and b are loaded twice.

The fix is to rearrange a and b (first row/col.) loading instructions.

Signed-off-by: Guodong Xu <guodong.xulinaro.org>

commit d51245e58b0beff2717156b980007c90337150d8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 8 18:00:54 2020 -0500

Add support for Intel oneAPI in configure.

Details:
- Properly select cc_vendor based on the output of invoking CC with the
--version option, including cases where CC is the variant of clang
that is included with Intel oneAPI. (However, we continue to treat
the compiler as clang for other purposes, not icc.) Thanks to Ajay
Panyala and Devin Matthews for reporting on this issue via 402.

commit 787adad73bd5eb65c12c39d732723a1ac0448748
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 8 16:18:20 2020 -0500

Defined netlib equivalent of xerbla_array().

Details:
- Added a function definition for xerbla_array_(), which largely mirrors
its netlib implementation. Thanks to Isuru Fernando for suggesting the
addition of this function.

commit c53b5153bee585685bf95ce22e058a7af72ecef0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 5 12:39:12 2020 -0500

Documented Perl prerequisite for build system.

Details:
- Added Perl to list of prerequisites for building BLIS. This is in part
(and perhaps completely?) due to some substitution commands used at
the end of configure that include '\n' characters that are not
properly interpreted by the version of sed included on some versions
of OS X. This new documentation addresses issue 398.

commit f032d5d4a6ed34c8c3e5ba1ed0b14d1956d0097c
Author: Guodong Xu <guodong.xulinaro.org>
Date: Thu Apr 30 01:08:46 2020 +0800

New kernel set for Arm SVE using assembly (396)

Here adds two kernels for Arm SVE vector extensions.
1. a gemm kernel for double at sizes 8x8.
2. a packm kernel for double at dimension 8xk.

To achive best performance, variable length agonostic programming
is not used. Vector length (VL) of 256 bits is mandated in both kernels.
Kernels to support other VLs can be added later.

"SVE is a vector extension for AArch64 execution mode for the A64
instruction set of the Armv8 architecture. Unlike other SIMD architectures,
SVE does not define the size of the vector registers, but constrains into
a range of possible values, from a minimum of 128 bits up to a maximum of
2048 in 128-bit wide units. Therefore, any CPU vendor can implement the
extension by choosing the vector register size that better suits the
workloads the CPU is targeting. Instructions are provided specifically
to query an implementation for its register size, to guarantee that
the applications can run on different implementations of the ISA without
the need to recompile the code." [1]

[1] https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning

Signed-off-by: Guodong Xu <guodong.xulinaro.org>

commit 4d87eb24e8e1f5a21e04586f6df4f427bae0091b
Author: Yingbo Ma <mayingbo5gmail.com>
Date: Mon Apr 27 17:02:47 2020 -0400

Update KernelsHowTo.md (395)

commit 477ce91c5281df2bbfaddc4d86312fb8c8f879e2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 22 14:26:49 2020 -0500

Moved include "cpuid.h" to bli_cpuid.c.

Details:
- Relocated the include "cpuid.h" directive from bli_cpuid.h to
bli_cpuid.c. This was done because cpuid.h (which is pulled into
the post-build blis.h developer header) doesn't protect its
definitions with a preprocessor guard of the form:

ifndef FOOBAR_H
define FOOBAR_H
// header contents.
endif

and as a result, applications (previously) could not include both
blis.h and cpuid.h (since the former was already including the
latter). Thanks to Bhaskar Nallani for raising this issue via 393
and to Devin Matthews for suggesting this fix.
- CREDITS file update.

commit 8bde63ffd7474a97c3a3b0b0dc1eae45be0ab889
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 18 12:50:12 2020 -0500

Adding missing conjy to her2/syr2 in typed API doc.

Details:
- Fixed a missing argument (conjy) in the function signatures of
bli_?her2() and bli_?syr2() in docs/BLISTypedAPI.md. Thanks to Robert
van de Geijn for reporting this omission.

commit 976902406b610afdbacb2d80a7a2b4b43ff30321
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 17 15:11:10 2020 -0500

Disable packing by default in expert rntm_t init.

Details:
- Changed the behavior of bli_rntm_init() as well as the static
initializer, BLIS_RNTM_INITIALIZER, so that user-initialized rntm_t
objects by default specify the disabling of packing for A and B.
Packing of A/B was already disabled by default when calling non-expert
APIs (and enabled only when the user set environment variables
BLIS_PACK_A or BLIS_PACK_B). With this commit, the default behavior of
using user-initialized rntm_t objects with expert APIs comes into line
with the default behavior of non-expert APIs--that is, they now both
lead to the avoidance of packing in the sup code path. (Note: The
conventional code path is unaffected by the environment variables
BLIS_PACK_A/BLIS_PACK_B and/or the disabling of packing in a rntm_t
object when calling an expert API.) This addresses issue 392. Thanks
to Kiran Varaganti for bringing this inconsistency to our attention.
- The above change was accomplished by changing the the definitions of
static functions bli_rntm_clear_pack_a() and bli_rntm_clear_pack_b()
in bli_rntm.h, which are both for internal use only.

commit 5f2aee7c5fa5d562acaf8fbde3df0e2a04e1dd1b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:55:15 2020 -0500

README.md update to promote supmt dgemm.

Details:
- Updated the sup entry in the "What's New" section of the README.md
file to promote the multithreaded dgemm sup feature introduced in
c0558fd.

commit f5923cd9ff5fbd91190277dea8e52027174a1d57
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:41:45 2020 -0500

CHANGELOG update (0.7.0)

Page 1 of 7

Releases

Has known vulnerabilities

Blis

Page 1 of 7

10.0rc0

1.3

1.0rc0

0.9.0

0.8.1

0.8.0

Page 1 of 7

Links

Releases