Blis

Latest version: v1.3.0

Safety actively analyzes 724051 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 7

0.7.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:41:44 2020 -0500

Version file update (0.7.0)

commit b04de636c1702e4cb8e7ad82bab3cf43d2dbdfc6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 7 14:37:43 2020 -0500

ReleaseNotes.md update in advance of next version.

Details:
- Updated docs/ReleaseNotes.md in preparation for next version.

commit 2cb604ba472049ad498df72d4a2dc47a161d4c3c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 6 16:42:14 2020 -0500

Rename more bli_thread_obarrier(), _obroadcast().

Details:
- Renamed instances of bli_thread_obarrier() and bli_thread_obroadcast()
that were made in the supmt-specific code commited to the 'amd'
branch, which has now been merged with 'master'. Prior to the merge,
'master' received commit c01d249, which applied these renamings to
the existing, non-sup codebase.

commit efb12bc895de451067649d5dceb059b7827a025f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 6 15:01:53 2020 -0500

Minor updates/elaborations to RELEASING file.

commit 2e3b3782cfb7a2fd0d1a325844983639756def7d
Merge: 9f3a8d4d da0c086f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 6 14:55:35 2020 -0500

Merge branch 'master' into amd

commit da0c086f4643772e111318f95a712831b0f981a8
Author: Satish Balay <balaymcs.anl.gov>
Date: Tue Mar 31 17:09:41 2020 -0500

OSX: specify the full path to the location of libblis.dylib (390)

* OSX: specify the full path to the location of libblis.dylib so that it can be found at runtime

Before this change:

Appication gives runtime error [when linked with blis]
dyld: Library not loaded: libblis.3.dylib

balaykpro lib % otool -L libblis.dylib
libblis.dylib:
libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)

After this change:
balaykpro lib % otool -L libblis.dylib
libblis.dylib:
/Users/balay/petsc/arch-darwin-c-debug/lib/libblis.3.dylib (compatibility version 0.0.0, current version 0.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1281.0.0)

* INSTALL_LIBDIR -> libdir as INSTALL_LIBDIR has DESTDIR

Co-Authored-By: Jed Brown <jedjedbrown.org>

* CREDITS file update.

Co-authored-by: Jed Brown <jedjedbrown.org>
Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>

commit 2bca03ea9d87c0da829031a5332545d05e352211
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 28 22:10:00 2020 +0000

Updates, tweaks to runme.sh in test/1m4m.

Details:
- Made several updates to test/1m4m/runme.sh, including:
- Added missing handling for 1m and 4m1a implementations when setting
the BLIS_??_NT environment variables.
- Added support for using numactl to run the test executables.
- Several other cleanups.

commit c40a33190b94af5d5c201be63366594859b1233f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 26 16:55:00 2020 -0500

Warn user when auto-detection returns 'generic'.

Details:
- Added logic to configure that causes the script to output a warning
to the user if/when "./configure auto" is run and the underlying
hardware feature detection code is unable to identify the hardware.
In these cases, the auto-detect code will return 'generic', which
is likely not what the user expected, and a flag will be set so that
a message is printed at the end of the configure output. (Thankfully,
we don't expect this scenario to play out very often.) Thanks to
Devin Matthews for suggesting this fix 384.

commit 492a736fab5b9c882996ca024b64646877f22a89
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Mar 24 17:28:47 2020 -0500

Fix vectorized version of bli_amaxv (382)

* Fix vectorized version of bli_amaxv

To match Netlib, i?amax should return:
- the lowest index among equal values
- the first NaN if one is encountered

* Fix typos.

* And another one...

* Update ref. amaxv kernel too.

* Re-enabled optimized amaxv kernels.

Details:
- Re-enabled the optimized, intrinsics-based amaxv kernels in the 'zen'
kernel set for use in haswell, zen, zen2, knl, and skx subconfigs.
These two kernels (for s and d datatypes) were temporarily disabled in
e186d71 as part of issue 380. However, the key missing semantic
properties that prompted the disabling of these kernels--returning the
index of the *first* rather than of the last element with largest
absolute value, and returning the index of the first NaN if one is
encountered--were added as part of 382 thanks to Devin Matthews.
Thus, now that the kernels are working as expected once more, this
commit causes these kernels to once again be registered for the
affected subconfigs, which effectively reverts all code changes
included in e186d71.
- Whitespace/formatting updates to new macros in bli_amaxv_zen_int.c.

Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>

commit e186d7141a51f2d7196c580e24e7b7db8f209db9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 21 18:40:36 2020 -0500

Disabled optimized amaxv kernels.

Details:
- Disabled use of optimized amaxv kernels, which use vector intrinsics
for both 's' and 'd' datatypes. We disable these kernels because the
current implementations fail to observe a semantic property of the
BLAS i?amax_() subroutine, which is to return the index of the
*first* element containing the maximum absolute value (that is, the
first element if there exist two or more elements that contain the
same value). With the optimized kernels disabled, the affected
subconfigurations (haswell, zen, zen2, knl, and skx) will use the
default reference implementations. Thanks to Mat Cross for reporting
this issue via 380.
- CREDITS file update.

commit 9f3a8d4d851725436b617297231a417aa9ce8c6a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 14 17:48:43 2020 -0500

Added missing return to bli_thread_partition_2x2().

Details:
- Added a missing return statement to the body of an early case handling
branch in bli_thread_partition_2x2(). This bug only affected cases
where n_threads < 4, and even then, the code meant to handle cases
where n_threads >= 4 executes and does the right thing, albeit using
more CPU cycles than needed. Nonetheless, thanks to Kiran Varaganti
for reporting this bug via issue 377.
- Whitespace changes to bli_thread.c (spaces -> tabs).

commit 8c3d9b9eeb6f816ec8c32a944f632a5ad3637593
Merge: 71249fe8 0f9e0399
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 10 14:03:33 2020 -0500

Merge branch 'amd' of github.com:flame/blis into amd

commit 71249fe8ddaa772616698f1e3814d40e012909ea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 10 13:55:29 2020 -0500

Merged test/sup, test/supmt into test/sup.

Details:
- Updated the Makefile, test_gemm.c, and runme.sh in test/sup to be able
to compile and run both single-threaded and multithreaded experiments.
This should help with maintenance going forward.
- Created a test/sup/octave_st directory of scripts (based on the
previous test/sup/octave scripts) as well as a test/sup/octave_mt
directory (based on the previous test/supmt/octave scripts). The
octave scripts are slightly different and not easily mergeable, and
thus for now I'll maintain them separately.
- Preserved the previous test/sup directory as test/sup/old/supst and
the previous test/supmt directory as test/sup/old/supmt.

commit 0f9e0399e16e96da2620faf2c0c3c21274bb2ebd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 5 17:03:21 2020 -0600

Updated sup performance graphs; added mt results.

Details:
- Reran all existing single-threaded performance experiments comparing
BLIS sup to other implementations (including the conventional code
path within BLIS), using the latest versions (where appropriate).
- Added multithreaded results for the three existing hardware types
showcased in docs/PerformanceSmall.md: Kaby Lake, Haswell, and Epyc
(Zen1).
- Various minor updates to the text in docs/PerformanceSmall.md.
- Updates to the octave scripts in test/sup/octave, test/supmt/octave.

commit 90db88e5729732628c1f3acc96eeefab49f2da41
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 2 15:06:48 2020 -0600

Updated sup[mt] Makefiles for variable dim ranges.

Details:
- Updated test/sup/Makefile and test/supmt/Makefile to allow specifying
different problem size ranges for the drivers where one, two, or three
matrix dimensions is large. This will facilitate the generation of
more meaningful graphs, particularly when two dimensions are tiny.

commit 31f11a06ea9501724feec0d2fc5e4644d7dd34fc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 27 14:33:20 2020 -0600

Updates to octave scripts in test/sup[mt]/octave.

Details:
- Optimized scripts in test/sup/octave and test/supmt/octave for use
with octave 5.2.0 on Ubuntu 18.04.
- Fixed stray 'end' keywords in gen_opsupnames.m and plot_l3sup_perf.m,
which were not only unnecessary but also causing issues with versions
5.x.

commit c01d249d7c546fe2e3cee3fe071cd4c4c88b9115
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 25 14:50:53 2020 -0600

Renamed bli_thread_obarrier(), _obroadcast().

Details:
- Renamed two bli_thread_*() APIs:
bli_thread_obarrier() -> bli_thread_barrier()
bli_thread_obroadcast() -> bli_thread_broadcast()
The 'o' was a leftover from when thrcomm_t objects tracked both
"inner" and "outer" communicators. They have long since been
simplified to only support the latter, and thus the 'o' is
superfluous.

commit f6e6bf73e695226c8b23fe7900da0e0ef37030c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 24 17:52:23 2020 -0600

List Gentoo under supported external packages.

Details:
- Add mention of Gentoo Linux under the list of external packages in
the README.md file. Thanks to M. Zhou for maintaining this package.

commit 9e5f7296ccf9b3f7b7041fe1df20b927cd0e914b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 18 15:16:03 2020 -0600

Skip building thrinfo_t tree when mt is disabled.

Details:
- Return early from bli_thrinfo_sup_grow() if the thrinfo_t object
address is equal to either &BLIS_GEMM_SINGLE_THREADED or
&BLIS_PACKM_SINGLE_THREADED.
- Added preprocessor logic to bli_l3_sup_thread_decorator() in
bli_l3_sup_decor_single.c that (by default) disables code that
creates and frees the thrinfo_t tree and instead passes
&BLIS_GEMM_SINGLE_THREADED as the thrinfo_t pointer into the
sup implementation.
- The net effect of the above changes is that a small amount of
thrinfo_t overhead is avoided when running small/skinny dgemm
problems when BLIS is compiled with multithreading disabled.

commit 90081e6a64b5ccea9211bdef193c2d332c68492f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 17 14:57:25 2020 -0600

Fixed bug(s) in mt sup when single-threaded.

Details:
- Fixed a syntax bug in bli_l3_sup_decor_single.c as a result of
changing function interface for the thread entry point function
(of type l3supint_t).
- Unfortunately, fixing the interface was not enough, as it caused
a memory leak in the sba at bli_finalize() time. It turns out that,
due to the new multithreading-capable variant code useing thrinfo_t
objects--specifically, their calling of bli_thrinfo_grow()--we
have to pass in a real thrinfo_t object rather than the global
objects &BLIS_PACKM_SINGLE_THREADED or &BLIS_GEMM_SINGLE_THREADED.
Thus, I inserted the appropriate logic from the OpenMP and pthreads
versions so that single-threaded execution would work as intended
with the newly upgraded variants.

commit c0558fde4511557c8f08867b035ee57dd2669dc6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 17 14:08:08 2020 -0600

Support multithreading within the sup framework.

Details:
- Added multithreading support to the sup framework (via either OpenMP
or pthreads). Both variants 1n and 2m now have the appropriate
threading infrastructure, including data partitioning logic, to
parallelize computation. This support handles all four combinations
of packing on matrices A and B (neither, A only, B only, or both).
This implementation tries to be a little smarter when automatic
threading is requested (e.g. via BLIS_NUM_THREADS) in that it will
recalculate the factorization in units of micropanels (rather than
using the raw dimensions) in bli_l3_sup_int.c, when the final
problem shape is known and after threads have already been spawned.
- Implemented bli_?packm_sup_var2(), which packs to conventional row-
or column-stored matrices. (This is used for the rrc and crc storage
cases.) Previously, copym was used, but that would no longer suffice
because it could not be parallelized.
- Minor reorganization of packing-related sup functions. Specifically,
bli_packm_sup_init_mem_[ab]() are called from within packm_sup_[ab]()
instead of from the variant functions. This has the effect of making
the variant functions more readable.
- Added additional bli_thrinfo_set_*() static functions to bli_thrinfo.h
and inserted usage of these functions within bli_thrinfo_init(), which
previously was accessing thrinfo_t fields via the -> operator.
- Renamed bli_partition_2x2() to bli_thread_partition_2x2().
- Added an auto_factor field to the rntm_t struct in order to track
whether automatic thread factorization was originally requested.
- Added new test drivers in test/supmt that perform multithreaded sup
tests, as well as appropriate octave/matlab scripts to plot the
resulting output files.
- Added additional language to docs/Multithreading.md to make it clear
that specifying any BLIS_*_NT variable, even if it is set to 1, will
be considered manual specification for the purposes of determining
whether to auto-factorize via BLIS_NUM_THREADS.
- Minor comment updates.

commit d7a7679182d72a7eaecef4cd9b9a103ee0a7b42b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 7 17:37:03 2020 -0600

Fixed int-to-packbuf_t conversion error (C++ only).

Details:
- Fixed an error that manifests only when using C++ (specifically,
modern versions of g++) to compile drivers in 'test' (and likely most
other application code that includes blis.h. Thanks to Ajay Panyala
for reporting this issue (374).

commit d626112b8d5302f9585fb37a8e37849747a2a317
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jan 15 13:27:02 2020 -0600

Removed sorting on LDFLAGS in common.mk (373).

Details:
- Removed a line of code in common.mk that passed LDFLAGS through the
sort function. The purpose was not to sort the contents, but rather
to remove duplicates. However, there is valid syntax in a string of
linker flags that, when sorted, yields different/broken behavior.
So I've removed the line in common.mk that sorts LDFLAGS. Also, for
future use, I've added a new function, rm-dupls, that removes
duplicates without sorting. (This function was based on code from a
stackoverflow thread that is linked to in the comments for that
code.) Thanks to Isuru Fernando for reporting this issue (373).

commit e67deb22aaeab5ed6794364520190936748ef272
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 16:01:34 2020 -0600

CHANGELOG update (0.6.1)

0.6.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 16:01:33 2020 -0600

Version file update (0.6.1)

commit 5db8e710a2baff121cba9c63b61ca254a2ec097a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 15:59:59 2020 -0600

ReleaseNotes.md update in advance of next version.

Details:
- Updated ReleaseNotes.md in preparation for next version.

commit cde4d9d7a26eb51dcc5a59943361dfb8fda45dea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 15:19:25 2020 -0600

Removed 'attic/windows' (to prevent confusion).

Details:
- Finally removed 'attic/windows' and its contents. This directory once
contained "proto" Windows support for BLIS, but we've since moved on
to (thanks to Isuru Fernando) providing Windows DLL support via
AppVeyor's build artifacts. Furthermore, since 'windows' was the only
subdirectory within 'attic', the directory path would show up in
GitHub's listing at https://github.com/flame/blis, which probably led
to someone being confused about how BLIS provides Windows support. I
assume (but don't know for sure) that nobody is using these files, so
this is admittedly a case of shoot first and ask questions later.

commit 7d3407d4681c6449f4bbb8ec681983700ab968f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 14 15:17:53 2020 -0600

CREDITS file update.

commit f391b3e2e7d11a37300d4c8d3f6a584022a599f5
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Mon Jan 6 20:15:48 2020 +0000

Fix parsing in vpu_count on workstation SKX (351)

* Fix parsing in vpu_count on workstation SKX

* Document Skylake-X as Haswell for single FMA

* Update vpu_count for Skylake and Cascade Lake models

* Support printing the configuration selected, controlled by the environment

Intended particularly for diagnosing mis-selection of SKX through
unknown, or incorrect, number of VPUs.

* Move bli_log outside the cpp condition, and use it where intended

* Add Fixme comment (Skylake D)

* Mostly superficial edits to commits towards 351.

Details:
- Moved architecture/sub-config logging-related code from bli_cpuid.c
to bli_arch.c, tweaked names, and added more set/get layering.
- Tweaked log messages output from bli_cpuid_is_skx() in bli_cpuid.c.
- Content, whitespace changes to new bullet in HardwareSupport.md that
relates to single-VPU Skylake-Xs.

* Fix comment typos

Co-authored-by: Field G. Van Zee <fieldcs.utexas.edu>

commit 5ca1a3cfc1c1cc4dd9da6a67aa072ed90f07e867
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 6 12:29:12 2020 -0600

Fixed 'configure' breakage introduced in 6433831.

Details:
- Added a missing 'fi' (endif) keyword to a conditional block added in
the configure script in commit 6433831.

commit e7431b4a834ef4f165c143f288585ce8e2272a23
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jan 6 12:01:41 2020 -0600

Updated 1m draft article link in README.md.

commit 6433831cc3988ad205637ebdebcd6d8f7cfcf148
Author: Jeff Hammond <jeff.r.hammondintel.com>
Date: Fri Jan 3 17:52:49 2020 -0800

blacklist ICC 18 for knl/skx due to test failures

Signed-off-by: Jeff Hammond <jeff.r.hammondintel.com>

commit af3589f1f98781e3a94a8f9cea8d5ea6f155f7d2
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Fri Jan 3 13:23:24 2020 -0800

blacklist Intel 19+

Signed-off-by: Jeff Hammond <jeff.r.hammondintel.com>

commit 60de939debafb233e57fd4e804ef21b6de198caf
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Wed Jan 1 21:30:38 2020 -0800

fix link to docs

the comment contains an incorrect link, which is trivially fixed here.

fgvanzee I hope you don't mind that I committed directly to master but this cannot break anything.

commit 52711073789b6b84eb99bb0d6883f457ed3fcf80
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 16 16:30:26 2019 -0600

Fixed bugs in cblas_sdsdot(), sdsdot_().

Details:
- Fixed a bug in sdsdot_sub() that redundantly added the "alpha" scalar,
named 'sb'. This value was already being added by the underlying
sdsdot_() function. Thus, we no longer add 'sb' within sdsdot_sub().
Thanks to Simon Lukas Märtens for reporting this bug via 367.
- Fixed a second bug in order of typecasting intermediate products in
sdsdot_(). Previously, the "alpha" scalar was being added after the
"outer" typecast to float. However, the operation is supposed to first
add the dot product to the (promoted) scalar and THEN downcast the sum
to float. Thanks to Devin Matthews for catching this bug.

commit fe2560a4b1d8ef8d0a446df6002b1e7decc826e9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 6 17:12:44 2019 -0600

Annoted missing thread-related symbols for export.

Details:
- Added BLIS_EXPORT_BLIS annotation to function prototypes for

bli_thrcomm_bcast()
bli_thrcomm_barrier()
bli_thread_range_sub()

so that these functions are exported to shared libraries by default.
This (hopefully) fixes issue 366. Thanks to Kyungmin Lee for
reporting this bug.
- CREDITS file update.

commit 2853825234001af8f175ad47cef5d6ff9b7a5982
Merge: efa61a6c 61b1f0b0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 6 16:06:46 2019 -0600

Merge branch 'master' into amd

commit 61b1f0b0602faa978d9912fe58c6c952a33af0ac
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Wed Dec 4 14:18:47 2019 -0600

Add prototypes for POWER9 reference kernels (365)

Updates and fixes to power9 subconfig.

Details:
- Register s,c,z reference gemm and trsm ukernels that assume elements
of B have been broadcast.
- Added prototypes for level-3 ukernels that assume elements of B have
been broadcast. Also added prototype for an spackm function that
employs a duplication/broadcast factor of 4.
- Register virtual gemmtrsm ukernels that work with broadcasting of B.
- Disable right-side hemm, symm, trmm, and trmm3 in bli_family_power9.h.
- Thanks to Nicholai Tukanov for providing these updates.

commit efa61a6c8b1cfa48781fc2e4799ff32e1b7f8f77
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 29 16:17:04 2019 -0600

Added missing bli_l3_sup_thread_decorator() symbol.

Details:
- Defined dummy versions of bli_l3_sup_thread_decorator() for Openmp
and pthreads so that those builds don't fail when performing shared
library linking (especially for Windows DLLs via AppVeyor). For now,
these dummy implementations of bli_l3_sup_thread_decorator() are
merely carbon-copies of the implementation provided for single-
threaded execution (ie: the one found in bli_l3_sup_decor_single.c).
Thus, an OpenMP or pthreads build will be able to use the gemmsup
code (including the new selective packing functionality), as it did
before 39fa7136, even though it will not actually employ any
multithreaded parallelism.

commit 39fa7136f4a4e55ccd9796fb79ad5f121b872ad9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 29 15:27:07 2019 -0600

Added support for selective packing to gemmsup.

Details:
- Implemented optional packing for A or B (or both) within the sup
framework (which currently only supports gemm). The request for
packing either matrix A or matrix B can be made via setting
environment variables BLIS_PACK_A or BLIS_PACK_B (to any
non-zero value; if set, zero means "disable packing"). It can also
be made globally at runtime via bli_pack_set_pack_a() and
bli_pack_set_pack_b() or with individual rntm_t objects via
bli_rntm_set_pack_a() and bli_rntm_set_pack_b() if using the expert
interface of either the BLIS typed or object APIs. (If using the
BLAS API, environment variables are the only way to communicate the
packing request.)
- One caveat (for now) with the current implementation of selective
packing is that any blocksize extension registered in the _cntx_init
function (such as is currently used by haswell and zen subconfigs)
will be ignored if the affected matrix is packed. The reason is
simply that I didn't get around to implementing the necessary logic
to pack a larger edge-case micropanel, though this is entirely
possible and should be done in the future.
- Spun off the variant-choosing portion of bli_gemmsup_ref() into
bli_gemmsup_int(), in bli_l3_sup_int.c.
- Added new files, bli_l3_sup_packm_a.c, bli_l3_sup_packm_b.c, along
with corresponding headers, in which higher-level packm-related
functions are defined for use within the sup framework. The actual
packm variant code resides in bli_l3_sup_packm_var.c.
- Pass the following new parameters into var1n and var2m: packa, packb
bool_t's, pointer to a rntm_t, pointer to a cntl_t (which is for now
always NULL), and pointer to a thrinfo_t* (which for nowis the address
of the global single-threaded packm thread control node).
- Added panel strides ps_a and ps_b to the auxinfo_t structure so that
the millikernel can query the panel stride of the packed matrix and
step through it accordingly. If the matrix isn't packed, the panel
stride of interest for the given millikernel will be set to the
appropriate value so that the mkernel may step through the unpacked
matrix as it normally would.
- Modified the rv_6x8m and rv_6x8n millikernels to read the appropriate
panel strides (ps_a and ps_b, respectively) instead of computing them
on the fly.
- Spun off the environment variable getting and setting functions into
a new file, bli_env.c (with a corresponding prototype header). These
functions are now used by the threading infrastructure (e.g.
BLIS_NUM_THREADS, BLIS_JC_NT, etc.) as well as the selective packing
infrastructure (e.g. BLIS_PACK_A, BLIS_PACK_B).
- Added a static initializer for mem_t objects, BLIS_MEM_INITIALIZER.
- Added a static initializer for pblk_t objects, BLIS_PBLK_INITIALIZER,
for use within the definition of BLIS_MEM_INITIALIZER.
- Moved the global_rntm object to bli_rntm.c and extern it where needed.
This means that the function bli_thread_init_rntm() was renamed to
bli_rntm_init_from_global() and relocated accordingly.
- Added a new bli_pack.c function, which serves as the home for
functions that manage the pack_a and pack_b fields of the global
rntm_t, including from environment variables, just as we have
functions to manage the threading fields of the global rntm_t in
bli_thread.c.
- Reorganized naming for files in frame/thread, which mostly involved
spinning off the bli_l3_thread_decorator() functions into their own
files. This change makes more sense when considering the further
addition of bli_l3_sup_thread_decorator() functions (for now limited
only to the single-threaded form found in the _single.c file).
- Explicitly initialize the reference sup handlers in both
bli_cntx_init_haswell.c and bli_cntx_init_zen.c so that it's more
obvious how to customize to a different handler, if desired.
- Removed various snippets of disabled code.
- Various comment updates.

commit bbb21fd0a9be8c5644bec37c75f9396eeeb69e48
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 18:15:16 2019 -0600

Tweaked SIAM/SC Best Prize language in README.md.

commit 043366f92d5f5f651d5e3371ac3adb36baf4adce
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 18:13:51 2019 -0600

Fixed typo in previous commit (SIAM/SC prize).

commit 05a4d583e65a46ff2a1100ab4433975d905d91f9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 18:12:24 2019 -0600

Added SIAM/SC prize to "What's New" in README.md.

commit 881b05ecd40c7bc0422d3479a02a28b1cb48383f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 21 16:34:27 2019 -0600

Fixed blastest failure for 'generic' subconfig.

Details:
- Fixed a subtle and complicated bug that only manifested via the BLAS
test drivers in the generic subconfiguration, and possibly any other
subconfiguration that did not register complex-domain gemm ukernels,
or registered ONLY real-domain ukernels as row-preferential. This is
a long story, but it boils down to an exception to the "transpose the
operation to bring storage of C into agreement with ukernel pref"
optimization in bli_hemm_front.c and bli_symm_front.c sabotaging the
proper functioning of the 1m method, but only when the imaginary
component of beta is zero. See the comments in issue 342 for more
details. Thanks to Dave Love for identifying the commit in which this
bug was introduced, and other feedback related to this bug.

commit 0c7165fb01cdebbc31ec00124d446161b289942f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 14 16:48:14 2019 -0600

Fixed obscure bug in bli_acquire_mpart_[mn]dim().

Details:
- Fixed a bug in bli_acquire_mpart_mdim(), bli_acquire_mpart_ndim(),
and bli_acquire_mpart_mndim() that allowed the use of a blocksize b
that is too large given the current row/column index (i.e., the i/j
argument) and the size of the dimension being partitioned (i.e., the
m/n argument). This bug only affected backwards partitioning/motion
through the dimension and was the result of a misplaced conditional
check-and-redirect to the backwards code path. It should be noted
that this bug was discovered not because it manifested the way it
could (thanks to the callers in BLIS making sure to always pass in
the "correct" blocksize b), but could have manifested if the
functions were used by 3rd party callers. Thanks to Minh Quan Ho for
reporting the bug via issue 363.

commit fb8bef9982171ee0f60bc39e41a33c4d31fd59a9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 14 13:05:28 2019 -0600

Fixed copy-paste bug in bli_spackm_6xk_bb4_ref().

Details:
- Fixed a copy-paste bug in the new bli_spackm_6xk_bb4_ref() that
manifested as failures in single-precision real level-3 operations.
Also replaced the duplication factor constants with a const-qualifed
varialbe, dfac, so that this won't happen again.
- Changed NC for single-precision real from 4080 to 8160 so that the
packed matrix B will have the same byte footprint in both single
and double real.

commit 8f399c89403d5824ba767df1426706cf2d19d0a7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 12 15:32:57 2019 -0600

Tweaked/added notes to docs/Multithreading.md.

Details:
- Added language to docs/Multithreading.md cautioning the reader about
the nuances of setting multithreading parameters via the manual and
automatic ways simultaneously, and also about how these parameters
behave when multithreading is disabled at configure-time. These
changes are an attempt to address the issues that arose in issue 362.
Thanks to Jérémie du Boisberranger for his feedback on this topic.
- CREDITS file update.

commit bdc7ee3394500d8e5b626af6ff37c048398bb27e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 11 15:47:17 2019 -0600

Various fixes to support packing duplication in B.

Details:
- Added cpp macros to trmm and trmm3 front-ends to optionally force
those operations to be cast so the structured matrix is on the left.
symm and hemm already had such macros, but these too were renamed so
that the macros were individual to the operation. We now have four
such macros:
define BLIS_DISABLE_HEMM_RIGHT
define BLIS_DISABLE_SYMM_RIGHT
define BLIS_DISABLE_TRMM_RIGHT
define BLIS_DISABLE_TRMM3_RIGHT
Also, updated the comments in the symm and hemm front-ends related to
the first two macro guards, and added corresponding comments to the
trmm and trmm3 front-ends for the latter two guards. (They all
functionally do the same thing, just for their specific operations.)
Thanks to Jeff Hammond for reporting the bugs that led me to this
change (via 359).
- Updated config/old/haswellbb subconfiguration (used to debug issues
related to duplicating B during packing) to register: a packing
kernel for single-precision real; gemmbb ukernels for s, c, and z;
trsmbb ukernels for s, c, and z; gemmtrsmbb virtual ukrnels for s, c
and z; and to use non-default cache and register blocksizes for s, c,
and z datatypes. Also declared prototypes for all of the gemmbb,
trsmbb, and gemmtrsmbb ukernel functions within the
bli_cntx_init_haswellbb() function. This should, once applied to the
power9 configuration, fix the remaining issues in 359.
- Defined bli_spackm_6xk_bb4_ref(), which packs single reals with a
duplication factor of 4. This function is defined in the same file as
bli_dpackm_6xk_bb2_ref() (bli_packm_cxk_bb_ref.c).

commit 0eb79ca8503bd7b237994335b9687457227d3290
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 8 14:48:48 2019 -0600

Avoid unused variable warning in lread.c (356).

Details:
- Replaced the line

f = f;

with

( void )f;

for the unused variable 'f' in blastest/f2c/lread.c. (Hopefully)
addresses issue 356, but since we don't use xlc who knows. Thanks
to Jeff Hammond for reporting this.

commit f377bb448512f0b578263387eed7eaf8f2b72bb7
Author: Jérôme Duval <jerome.duvalgmail.com>
Date: Thu Nov 7 23:39:29 2019 +0100

Add Haiku to the known OS list (361)

commit e29b1f9706b6d9ed798b7f6325f275df4e6be973
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 5 17:15:19 2019 -0600

Fixed failing testsuite gemmtrsm_ukr for power9.

Details:
- Added code that fixes false failures in the gemmtrsm_ukr module of the
testsuite. The tests were failing because the computation (bli_gemv())
that performs the numerical check was not able to properly travserse
the matrix operands bx1 and b11 that are views into the micropanel of
B, which has duplicated/broadcast elements under the power9 subconfig.
(For example, a micropanel of B with duplication factor of 2 needs to
use a column stride of 2; previously, the column stride was being
interpreted as 1.)
- Defined separate bli_obj_set_row_stride() and bli_obj_set_col_stride()
static functions in bli_obj_macro_defs.h. (Previously, only the
function bli_obj_set_strides() was defined. Amazing to think that we
got this far without these former functions.)
- Updated/expounded upon comments.

commit 49177a6b9afcccca5b39a21c6fd8e243525e1505
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 4 18:09:37 2019 -0600

Fixed latent testsuite ukr module bugs for power9.

Details:
- Fixed a latent bug in the testsuite ukernel modules (gemm, trsm, and
gemmtrsm) that only manifested once we began running with parameters
that mimic those of power9. The problem was rooted in the way those
modules were creating objects (and thus allocating memory) for the
micropanel operands to the microkernel being tested. Since power9
duplicates/broadcasts elements of B in memory, we needed an easy way
of asking for more than one storage element per logical element in
the matrix. I incorrectly expressed this as:

bli_obj_create( datatype, k, n, ldbp, 1, &bp );

The problem here is that bli_obj_create() is exceedingly efficient
at calculating the size it passes to malloc() and doesn't allocate a
full leading dimension's worth of elements for the last column (or
row, in this example). This would normally not bother anyone since
you're not supposed to access that memory anyway. But here, my
attempted "hack" for getting extra elements was insufficient, and
needed to be changed to:

bli_obj_create( datatype, k, ldbp, ldbp, 1, &bp );

That is, the extra elements needed to be baked into the dimensions of
the matrix object in order to have the intended effect on the number
of elements actually allocated. Thanks to Jeff Hammond for reporting
this bug.
- Fixed a typically harmless memory leak in the aforementioned test
modules (the objects for the packed micropanels were not being freed).
- Updated/expanded a common comment across all three ukr test modules.

commit c84391314d4f1b3f73d868f72105324e649f2a72
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 4 13:57:12 2019 -0600

Reverted minor temp/wspace changes from b426f9e.

Details:
- Added missing license header to bli_pwr9_asm_macros_12x6.h.
- Reverted temporary changes to various files in 'test' and 'testsuite'
directories.
- Moved testsuite/jobscripts into testsuite/old.
- Minor whitespace/comment changes across various files.

commit 4870260f6b8c06d2cc01b7147d7433ddee213f7f
Author: Jeff Hammond <jeff.r.hammondintel.com>
Date: Mon Nov 4 11:55:47 2019 -0800

blacklist GCC 5 and older for POWER9 (360)

commit b426f9e04e5499c6f9c752e49c33800bfaadda4c
Author: Nicholai Tukanov <nicholaiutexas.edu>
Date: Fri Nov 1 17:57:03 2019 -0500

POWER9 DGEMM (355)

Implemented and registered power9 dgemm ukernel.

Details:
- Implemented 12x6 dgemm microkernel for power9. This microkernel
assumes that elements of B have been duplicated/broadcast during the
packing step. The microkernel uses a column orientation for its
microtile vector registers and thus implements column storage and
general stride IO cases. (A row storage IO case via in-register
transposition may be added at a future date.) It should be noted that
we recommend using this microkernel with gcc and *not* xlc, as issues
with the latter cropped up during development, including but not
limited to slightly incompatible vector register mnemonics in the GNU
extended inline assembly clobber list.

commit 58102aeaa282dc79554ed045e1b17a6eda292e15
Merge: 52059506 b9bc222b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 28 17:58:31 2019 -0500

Merge branch 'amd'

commit 52059506b2d5fd4c3738165195abeb356a134bd4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 23 15:26:42 2019 -0500

Added "How to Download BLIS" section to README.md.

Details:
- Added a new section to the README.md, just prior to the "Getting
Started" section, titled "How to Download BLIS". This section details
the user's options for obtaining BLIS and lays out four common ways
of downloading the library. Thanks to Jeff Diamond for his feedback
on this topic.

commit e6f0a96cc59aef728470f6850947ba856148c38a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 14 17:05:39 2019 -0500

Updated README.md to ack Facebook as funder.

commit b9bc222bfc3db4f9ae5d7b3321346eed70c2c3fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 14 16:38:15 2019 -0500

Call bli_syrk_small() before error checking.

Details:
- In bli_syrk_front(), moved the conditional call to bli_syrk_check()
(if error checking is enabled) and the conditional scaling of C by
beta (if alpha is zero) so that they occur after, instead of before,
the call to bli_syrk_small(). This sequencing now matches that of
bli_gemm_small() in bli_gemm_front() and bli_trsm_small() in
bli_trsm_front().

commit f0959a81dbcf30d8a1076d0a6348a9835079d31a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 14 15:46:28 2019 -0500

When manual config is blacklisted, output error.

Details:
- Fixed and adjusted the logic in configure so that a more informative
error message is output when a user runs './configure ... <conf>' and
<conf> is present in the configuration blacklist. Previously, this
particular set of conditions would result in the message:

'user-specified configuration '' is NOT registered!

That is, the error message mis-identified the targeted configuration
as the empty string, and (more importantly) mis-identifies the
problem. Thanks to Tze Meng Low for reporting this issue.
- Fixed a nearby error messages somewhat unrelated to the issue above.
Specifically, the wrong string was being printed when the error
message was identifying an auto-detected configuration that did not
appear to be registered.

commit 6218ac95a525eefa8921baf8d0d7057dfacebe9c
Merge: 0016d541 a617301f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 11:53:51 2019 -0500

Merge branch 'master' into amd

commit 0016d541e6b0da617b1fae6612d2b314901b7a75
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 11:09:44 2019 -0500

Changed -march=znver2 to =znver1 for clang on zen2.

Details:
- In config/zen2/make_defs.mk, changed the -march= flag so that
-march=znver1 is used instead of -march=znver2 when CC_VENDOR is
clang. (The gcc branch attempts to differentiate between various
versions, but the equivalent version cutoffs for clang are not
yet known by us, so we have to use a single flag for all versions
of clang. Hopefully -march=znver1 is new enough. If not, we'll
fall back to -march=bdver4 -mno-fma4 -mno-tbm -mno-xop -mno-lwp.)
This issue was discovered thanks to AppVeyor.

commit e94a0530e5ac4c78a18f09105f40003be2b517f7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 10:48:27 2019 -0500

Corrected zen NC that was non-multiple of NR.

Details:
- Updated an incorrectly set cache blocksize NC for single real within
config/zen/bli_cntx_init_zen.c that was non a multiple of the
corresponding value of NR. This issue, which was caught by Travis CI,
was introduced in 29b0e1e.

commit a2ffac752076bf55eb8c1fe2c5da8d9104f1f85b
Merge: 1cfe8e25 29b0e1ef
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 10:31:18 2019 -0500

Merge branch 'amd-master' into amd

commit 29b0e1ef4e8b84ce76888d73c090009b361f1306
Merge: 1cfe8e25 fdce1a56
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 11 10:24:24 2019 -0500

Code review + tweaks to AMD's AOCL 2.0 PR (349).

Details:
- NOTE: This is a merge commit of 'master' of git://github.com/amd/blis
into 'amd-master' of flame/blis.
- Fixed a bug in the downstream value of BLIS_NUM_ARCHS, which was
inadvertantly not incremented when the Zen2 subconfiguration was
added.
- In bli_gemm_front(), added a missing conditional constraint around the
call to bli_gemm_small() that ensures that the computation precision
of C matches the storage precision of C.
- In bli_syrk_front(), reorganized and relocated the notrans/trans logic
that existed around the call to bli_syrk_small() into bli_syrk_small()
to minimize the calling code footprint and also to bring that code
into stylistic harmony with similar code in bli_gemm_front() and
bli_trsm_front(). Also, replaced direct accessing of obj_t fields with
proper accessor static functions (e.g. 'a->dim[0]' becomes
'bli_obj_length( a )').
- Added ifdef BLIS_ENABLE_SMALL_MATRIX guard around prototypes for
bli_gemm_small(), bli_syrk_small(), and bli_trsm_small(). This is
strictly speaking unnecessary, but it serves as a useful visual cue to
those who may be reading the files.
- Removed cpp macro-protected small matrix debugging code from
bli_trsm_front.c.
- Added a GCC_OT_9_1_0 variable to build/config.mk.in to facilitate gcc
version check for availability of -march=znver2, and added appropriate
support to configure script.
- Cleanups to compiler flags common to recent AMD microarchitectures in
config/zen/amd_config.mk, including: removal of -march=znver1 et al.
from CKVECFLAGS (since the -march flag is added within make_defs.mk);
setting CRVECFLAGS similarly to CKVECFLAGS.
- Cleanups to config/zen/bli_cntx_init_zen.c.
- Cleanups, added comments to config/zen/make_defs.mk.
- Cleanups to config/zen2/make_defs.mk, including making use of newly-
added GCC_OT_9_1_0 and existing GCC_OT_6_1_0 to choose the correct
set of compiler flags based on the version of gcc being used.
- Reverted downstream changes to test/test_gemm.c.
- Various whitespace/comment changes.

commit a617301f9365ac720ff286514105d1b78951368b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 8 17:14:05 2019 -0500

Updates to docs/CodingConventions.md.

commit 171f10069199f0cd280f18aac184546bd877c4fe
Merge: 702486b1 05d58edf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 4 11:18:23 2019 -0500

Merge remote-tracking branch 'loveshack/emacs'

commit 702486b12560b5c696ba06de9a73fc0d5107ca44
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 2 16:35:41 2019 -0500

Removed stray FAQ section introduced in 1907000.

commit 1907000ad6ea396970c010f07ae42980b7b14fa0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 2 16:31:54 2019 -0500

Updated to FAQ (AMD-related questions).

Details:
- Added a couple potential frequently-asked questions/answers releated
to AMD's fork of BLIS.
- Updated existing answers to other questions.

commit 834f30a0dad808931c9d80bd5831b636ed0e1098
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 2 12:45:56 2019 -0500

Mention mixeddt paper in docs/MixedDatatypes.md.

commit 05d58edfe0ea9279971d74f17a5f7a69c4672ed5
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Wed Oct 2 10:33:44 2019 +0100

Note .dir-locals.el in docs

commit 531110c339f199a4d165d707c988d89ab4f5bfe8
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Wed Oct 2 10:16:22 2019 +0100

Modify Emacs config
Confine it to cc-mode and add comment-start/end.

commit 4bab365cab98202259c70feba6ec87408cba28d8
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Tue Oct 1 19:22:47 2019 +0000

Add .dir-locals.el for Emacs (348)

A minimal version that could probably do with extending, but at least
gets the indentation roughly right.

commit 4ec8dad66b3d37b0a2b47d19b7144bb62d332622
Author: Dave Love <dave.lovemanchester.ac.uk>
Date: Thu Sep 26 16:27:53 2019 +0100

Add .dir-locals.el for Emacs

A minimal version that could probably do with extending, but at least
gets the indentation roughly right.

commit bc16ec7d1e2a30ce4a751255b70c9cbe87409e4f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 23 15:37:33 2019 -0500

Set execute bits of shared library at install-time.

Details:
- Modified the 0644 octal code used during installation of shared
libraries to 0755 (for Linux/OSX only). Thanks to Adam J. Stewart
for reporting this issue via 343.
- CREDITS file update.

commit c60db26aee9e7b4e5d0b031b0881e58d23666b53
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 18:04:17 2019 -0500

Fixed bad loop counter in bli_[cz]scal2bbs_mxn().

Details:
- Fixed a typo in the loop counter for the 'd' (duplication) dimension
in the complex macros of frame/include/level0/bb/bli_scal2bbs_mxn.h.
They shouldn't be used by anyone yet, but thankfully clang via
AppVeyor spit out warnings that alerted me to the issue.

commit c766c81d628f0451d8255bf5e4b8be0a4ef91978
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 18:00:29 2019 -0500

Added missing schema arg to knl packm kernels.

Details:
- Added the pack_t schema argument to the knl packm kernel functions.
This change was intended for inclusion in 31c8657. (Thank you SDE +
Travis CI.)

commit 31c8657f1d6d8f6efd8a73fd1995e995fc56748b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 17:42:10 2019 -0500

Added support for pre-broadcast when packing B.

Details:
- Added support for being able to duplicate (broadcast) elements in
memory when packing matrix B (ie: the left-hand operand) in level-3
operations. This turns out advantageous for some architectures that
can afford the cost of the extra bandwidth and somehow benefit from
the pre-broadcast elements (and thus being able to avoid using
broadcast-style load instructions on micro-rows of B in the gemm
microkernel).
- Support optionally disabling right-side hemm and symm. If this occurs,
hemm_r is implemented in terms of hemm_l (and symm_r in terms of
symm_l). This is needed when broadcasting during packing because the
alternative--supporting the broadcast of B while also allowing matrix
B to be Hermitian/symmetric--would be an absolute mess.
- Support alignment factors for packed blocks of A, B, and C separately
(as well as for general-purpose buffers). In addition, we support
byte offsets from those alignment values (which is different from
aligning by align+offset bytes to begin with). The default alignment
values are BLIS_PAGE_SIZE in all four cases, with the offset values
defaulting to zero.
- Pass pack_t schema into bli_?packm_cxk() so that it can be then passed
into the packm kernel, where it will be needed by packm kernels that
perform broadcasts of B, since the idea is that we *only* want to
broadcast when packing micropanels of B and not A.
- Added definition for variadic bli_cntx_set_l3_vir_ukrs(), which can be
used to set custom virtual level-3 microkernels in the cntx_t, which
would typically be done in the bli_cntx_init_*() function defined in
the subconfiguration of interest.
- Added a "broadcast B" kernel function for use with NP/NR = 12/6,
defined in in ref_kernels/1m/bli_packm_cxk_bb_ref.c.
- Added a gemm, gemmtrsm, and trsm "broadcast B" reference kernels
defined in ref_kernels/3/bb. (These kernels have been tested with
double real with NP/NR = 12/6.)
- Added ifndef ... endif guards around several macro constants defined
in frame/include/bli_kernel_macro_defs.h.
- Defined a few "broadcast B" static functions in
frame/include/level0/bb for use by "broadcast B"-style packm reference
kernels. For now, only the real domain kernels are tested and fully
defined.
- Output the alignment and offset values for packed blocks of A and B
in the testsuite's "BLIS configuration info" section.
- Comment updates to various files.
- Bumped so_version to 3.0.0.

commit fd9bf497cd4ff73ccdfc030ba037b3cb2f1c2fad
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 15:45:24 2019 -0500

CREDITS file update.

commit 6c8f2d1486ce31ad3c2083e5c2035acfd4409a43
Author: ShmuelLevine <shmuel.levinegmail.com>
Date: Tue Sep 17 16:43:46 2019 -0400

Fix description for function bli_*pxby2v (340)

Fix typo in BLISTypedAPI.md for bli_?axpy2v() description.

commit b5679c1520f8ae7637b3cc2313133461f62398dc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 17 14:00:37 2019 -0500

Inserted Multithreading links into BuildSystem.md.

Details:
- Inserted brief disclaimers about default disabled multithreading
and default single-threadedness to BuildSystem.md along with links to
the Multithreading.md document. Thanks to Jeff Diamond for suggesting
these additions.
- Trivial reword of sentence regarding automatically-detected
architectures.

commit f4f5170f8482c94132832eb3033bc8796da5420b
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Sep 11 07:34:48 2019 -0500

Update README.md (338)

commit 1cfe8e2562e5e50769468382626ce36b734741c1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 5 16:08:30 2019 -0500

Reimplemented bli_cpuid_query() for ARM.

Details:
- Rewrote bli_cpuid_query() for ARM architectures to use stdio-based
functions such as fopen() and fgets() instead of popen(). The new code
does more or less the same thing as before--searches /proc/cpuinfo for
various strings, which are then parsed in order to determine the
model, part number, and features. Thanks to Dave Love for suggesting
this change in issue 335.

commit 7c7819145740e96929466a248d6375d40e397e19
Author: Devin Matthews <damatthewssmu.edu>
Date: Fri Aug 30 16:52:09 2019 -0500

Always use sqsumv to compute normfv. (334)

* Always use sqsumv to compute normfv on MacOS.

* Unconditionally disable the "dot trick" in normfv.

* Added explanatory comment to normfv definition.

Details:
- Added a comment above the unconditional disabling of the dotv-based
implementation to normfv. Thanks to Roman Yurchak, Devin Matthews,
and Isuru Fernando in helping with this improvement.
- CREDITS file update.

commit 80e6c10b72d50863b4b64d79f784df7befedfcd1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 29 12:12:08 2019 -0500

Added reproduction section to Performance docs.

Details:
- Added section titled "Reproduction" to both Performance.md and
PerformanceSmall.md that briefly nudges the motivated reader in the
right direction if he/she wishes to run the same performance
benchmarks used to produce the graphs shown in those documents.
Thanks to Dave Love for making this suggestion.

commit 14cb426414856024b9ae0f84ac21efcc1d329467
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 28 17:04:33 2019 -0500

Updated OpenBLAS, Eigen sup results.

Details:
- Updated the results shown in docs/PerformanceSmall.md for OpenBLAS and
Eigen.

commit b02e0aae8ce2705e91023b98ed416cd05430a78e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 27 14:37:46 2019 -0500

Updated test drivers to iterate backwards.

Details:
- Updated test driver source in test, test/3, test/1m4m, and
test/mixeddt to iterate through the problem space backwards. This
can help avoid certain situations where the CPU frequency does not
immediately throttle up to its maximum. Thanks to Robert van de
Geijn for recommending this fix (originally made to test/sup drivers
in 57e422a).
- Applied off-by-one matlab output bugfix from b6017e5 to test drivers
in test, test/3, test/1m4m, and test/mixeddt directories.

commit b6017e53f4b26c99b14cdaa408351f11322b1e80
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 27 14:18:14 2019 -0500

Bugfix of output text + tweaks to test/sup driver.

Details:
- Fixed an off-by-one bug in the output of matlab row indices in
test/sup/test_gemm.c that only manifested when the problem size
increment was equal to 1.
- Disabled the building of rrc, rcr, rcc, crr, crc, and ccr storage
combinations for blissup drivers in test/sup. This helps make the
building of drivers complete sooner.
- Trivial changes to test/sup/runme.sh.

commit 138d403b6bb15e687a3fe26d3d967b8ccd1ed97b
Author: Devin Matthews <damatthewssmu.edu>
Date: Mon Aug 26 18:11:27 2019 -0500

Use -funsafe-math-optimizations and -ffp-contract=fast for all reference kernels when using gcc or clang. (331)

commit d5a05a15a7fcc38fb2519031dcc62de8ea4a530c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 26 16:54:31 2019 -0500

Cropped whitespace from new sup graphs.

Details:
- Previously forgot crop whitespace from the new .png graphs
added/updated in docs/graphs/sup.

commit a6c80171a353db709e43f9e6e7a3da87ce4d17ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 26 16:51:31 2019 -0500

Fixed contents links in docs/PerformanceSmall.md.

Details:
- Corrected links in contents section of docs/PerformanceSmall.md,
which were erroneously directing readers to the corresponding
sections of docs/Performance.md.

commit 40781774df56a912144ef19cc191ed626a89f0de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 26 16:47:37 2019 -0500

Updated sup performance graphs with libxsmm.

Details:
- Added libxsmm to column-stored sup graphs presented in
docs/PerformanceSmall.md.
- Updated sup results for BLASFEO.
- Added sup results for Lonestar5 (Haswell).
- Addresses issue 326.

commit bfddf671328e7e372ac7228f72ff2d9d8e03ae18
Author: figual <figualucm.es>
Date: Mon Aug 26 12:01:33 2019 +0200

Fixed context registration for Cortex A53 (329).

commit 4a0a6e89c568246d14de4cc30e3ff35aac23d774
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 24 15:25:16 2019 -0500

Changed test/sup alpha to 1; test libxsmm+netlib.

Details:
- Changed the value of alpha to 1.0 in test/sup/test_gemm.c. This is
needed because libxsmm currently only optimizes gemm operations where
alpha is unit (and beta is unit or zero).
- Adjusted the test/sup/Makefile to test libxsmm with netlib BLAS as its
fallback library. This is the library that will be called the
problem dimensions are deemed too large, or any other criteria for
optimization are not met. (This was done not because it is realistic,
but rather so that it would be very clear when libxsmm ceased handling
gemm calls internally when the data are graphed.)

commit 7aa52b57832176c5c13a48e30a282e09ecdabf73
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 23 16:12:50 2019 -0500

Use libxsmm API in test/sup; add missing -ldl.

Details:
- Switch the driver source in test/sup so that libxsmm_?gemm() is called
instead of ?gemm_() when compiling for / linking against libxsmm.
libxsmm's documentation isn't clear on whether it is even *trying* to
provide BLAS API compatibility, and I got tired of trying to figure it
out.
- Added missing -ldl in LDFLAGS when linking against libxsmm.

commit 57e422aa168bee7416965265c93fcd4934cd7041
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 23 14:17:52 2019 -0500

Added libxsmm support to test/sup drivers.

Details:
- Modified test/sup/Makefile to build drivers that test the performance
of skinny/small problems via libxsmm.
- Modified test/sup/runme.sh to run aforementioned drivers.
- Modified test/sup/test_gemm.c so that problem sizes are tested in
reverse order (from largest to smallest). This can help avoid certain
situations where the CPU frequency does not immediately throttle up
to its maximum. Thanks to Robert van de Geijn for recommending this
fix.

commit 661681fe33978acce370255815c76348f83632bc
Merge: 2f387e32 ef0a1a0f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 22 14:29:50 2019 -0500

Merge branch 'master' of github.com:flame/blis

commit 2f387e32ef5f9a17bafb5076dc9f66c38b52b32d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 22 14:27:30 2019 -0500

Added Eigen -march=native hack to perf docs.

Details:
- Spell out the hack given to me by Sameer Agarwal in order to get Eigen
to build with -march=native (which is critically important for Eigen)
in docs/Performance.md and docs/PerformanceSmall.md.

commit ef0a1a0faf683fe205f85308a54a77ffd68a9a6c
Author: Devin Matthews <damatthewssmu.edu>
Date: Wed Aug 21 17:40:24 2019 -0500

Update do_sde.sh (330)

* Update do_sde.sh

Automatically accept SDE license and download directly from Intel

* Update .travis.yml

[ci skip]

* Update .travis.yml

Enable SDE testing for PRs.

commit 0cd383d53a8c4a6871892a0395591ef5630d4ac0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 21 13:39:05 2019 -0500

Corrected variable type and comment update.

Details:
- Forgot to save all changes from bli_gemmtrsm4m1_ref.c before commit
in 8122f59. Fixed type mismatch and referenced github issue in
comment.

commit 8122f59745db780987da6aa1e851e9e76aa985e0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 21 13:22:12 2019 -0500

Pacify 'restrict' warning in gemmtrsm4m1 ref ukr.

Details:
- Previously, some versions of gcc would complain that the same
pointer, one_r, is being passed in for both alpha and beta in the
fourth call to the real gemm ukernel in bli_gemmtrsm4m1_ref.c. This
is understandable since the compiler knows that the real gemm ukernel
qualifies all of its floating-point arguments (including alpha and
beta) with restrict. A small hack has been inserted into the file
that defines a new variable to store the value 1.0, which is now used
in lieu of one_r for beta in the fourth call to the real gemm ukernel,
which should pacify the compiler now. Thanks to Dave Love for
reporting this issue (328) and for Devin Matthews for offering his
'restrict' expertise.

commit e8c6281f139bdfc9bd68c3b36e5e89059b0ead2e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 21 12:38:53 2019 -0500

Add -march support for specific gcc version ranges.

Details:
- Added logic to configure that checks the version of the compiler
against known version ranges that could cause problems later in the
build process. For example, versions of gcc older than 4.9.0 use
different -march labels than version 4.9.0 or later
('-march=corei7-avx' vs '-march=sandybridge', respectively).
Similarly, before 6.1, compilation on Zen was possible, but you
need to start with -march=bdver4 and then disable instruction sets
that were discarded during the transition from Excavator to Zen. So
now, configure substitutes 'yes'/'no' values into anchors in
config.mk.in, which sets various make variables (e.g. GCC_OT_4_9_0),
which can be accessed and branched upon by the various
configurations' make_defs.mk files when setting their compiler flags.
- Updated config/haswell/make_defs.mk to branch on GCC_OT_4_9_0.
- Updated config/sandybridge/make_defs.mk to branch on GCC_OT_4_9_0.
- Updated config/zen/make_defs.mk to branch on GCC_OT_6_1_0.

commit e6ac4ebcb6e6a372820e7f509c0af3342966b84a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 20 13:49:47 2019 -0500

Added page size, source location to perf docs.

Details:
- Added the page size, as returned via 'getconf -a | grep PAGE_SIZE',
and the location of the performance drivers to docs/Performance.md
(test/3) and docs/PerformanceSmall.md (test/sup). Thanks to Dave
Love for suggesting these additions in 325.

commit fdce1a5648d69034fab39943100289323011c36f
Author: Meghana <Meghana.Vankadariamd.com>
Date: Wed Jul 24 15:04:41 2019 +0530

changed gcc version check condition from 'ifeq' to 'if greater or equal'

Change-Id: Ie4c461867829bcc113210791bbefb9517e52c226

commit c9486e0c4f82cd9f58f5ceb71c0df039e9970a20
Author: Meghana <Meghana.Vankadariamd.com>
Date: Wed Jul 24 09:45:17 2019 +0530

code to detect version of gcc and set flags accordingly for zen2

Change-Id: I29b0311d0000dee1a2533ee29941acf53f9e9f34

commit 54afe3dfe6828a1aff65baabbf14c98d92e50692
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 23 16:54:28 2019 -0500

Added "Education and Learning" ToC entry to README.

commit 9f53b1ce7ac702e84e71801fe96986f6aa16040e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 23 16:50:35 2019 -0500

Added "Education and Learning" section to README.

Details:
- Added a short section after the Intro of the README.md file titled
"Education and Learning" that directs interested readers to the
"LAFF-On Programming for High-Performance" massive open online course
(MOOC) hosted via edX.

commit deda4ca8a094ee18d7c7c45e040e8ef180f33a48
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jul 22 13:59:05 2019 -0500

Added test/1m4m driver directory.

Details:
- Added a new standalone test driver directory named '1m4m' that can
build and run performance experiments for BLIS 1m, 4m1a, assembly,
OpenBLAS, and the vendor library (MKL). This new driver directory
was used to regenerate performance results for the 1m paper.
- Added alternate (commented-out) cache blocksizes to
config/haswell/bli_cntx_init_haswell.c. These blocksizes tend to
work well on an a 12-core Intel Xeon E5-2650 v3.

commit dcc0ce12fde4c6dca2b4764a1922a2ab19725867
Author: Meghana <Meghana.Vankadariamd.com>
Date: Mon Jul 22 17:12:01 2019 +0530

Added a global Makefile for AMD architectures in config/zen folder
This Makefile(amd_config.mk) has all the flags that are common to EPYC series

Change-Id: Ic02c60a8293ccdd37f0f292e631acd198e6895de

commit af17bca26a8bd3dcbee8ca81c18d7b25de09c483
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 19 14:46:23 2019 -0500

Updated haswell MC cache blocksizes.

Details:
- Updated the default MC cache blocksizes used by the haswell subconfig
for both row-preferential (the default) and column-preferential
microkernels.

commit b5e9bce4dde5bf014dd9771ae741048e1f6c7748
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 19 14:42:37 2019 -0500

Updated -march flags for sandybridge, haswell.

Details:
- Updated the '-march=corei7-avx' flag in the sandybridge subconfig
to '-march=sandybridge' and the '-march=core-avx2' flag in the
haswell subconfig to '-march=haswell'. The older flags were used
by older versions of gcc and should have been updated to the newer
forms a long time ago. (The older flags were clearly working, even
though they are no longer documented in the gcc man page.)

commit c22b9dba5859a9fc94c8431eccc9e4eb9be02be1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 16 13:14:47 2019 -0500

More updates to comments in testsuite modules.

Details:
- Updated most comments in testsuite modules that describe how the
correctness test is performed so that it is clear whether the vector
(normfv) or matrix (normfm) form of Frobenius norm is used.

commit c4cc6fa702f444a05963db01db51bc7d6669e979
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 16 13:00:35 2019 -0500

New cntx_t blksz "set" functions + misc tweaks.

Details:
- Defined two new static functions in bli_cntx.h:
bli_cntx_set_blksz_def_dt()
bli_cntx_set_blksz_max_dt()
which developers may find convenient when experimenting with different
values of cache blocksizes.
- Updated one- and two-socket multithreaded problem size range and
increment values in test/3/Makefile.
- Changed default to column storage in test/3/test_gemm.c.
- Fixed typo in comment in testsuite/src/test_subm.c.

commit b84cee29f42855dc1f263e42b83b1a46ac8def87
Merge: 1f80858a c7dd6e6c
Author: Meghana Vankadari <Meghana.Vankadariamd.com>
Date: Mon Jul 8 02:03:07 2019 -0400

Merge "Added compiler flags for vanilla clang" into amd-staging-rome2.0

commit 1f80858abf5ca220b2998fbe6f9b06c32d3864c3
Author: kdevraje <kiran.Devrajegowdaamd.com>
Date: Fri Jul 5 16:05:11 2019 +0530

This checkin solves the dgemm performance issue jira ticket CPUPL 458, as else was missed during integration, it was always following else path to get the block sizes

Change-Id: I0084b5856c2513ab1066c08c15b5086db6532717

commit c7dd6e6cd2f910cbefcdc1e04a5adeb919a23de0
Author: Meghana <meghana.vankadariamd.com>
Date: Thu Jul 4 09:32:51 2019 +0530

Added compiler flags for vanilla clang

Change-Id: I13c00b4c0d65bbda4c929848fd48b0ab611952ab

commit 2acd49b76457635625a01e31c2abc8902b23cf51
Author: Meghana <meghana.vankadariamd.com>
Date: Mon Jul 1 15:42:38 2019 +0530

fix for test failures using AOCC 2.0

Change-Id: If44eaccc64bbe96bbbe1d32279b1b5773aba08d1

commit ceee2f973ebe115beca55ca77f9e3ce36b14c28a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 24 17:47:40 2019 -0500

Fixed thrinfo_t printing bug for small problems.

Details:
- Fixed a bug in bli_l3_thrinfo_print_gemm_paths() and
bli_l3_thrinfo_print_trsm_paths(), defined in bli_l3_thrinfo.c,
whereby subnodes of the thrinfo_t tree are "dereferenced" near the
beginning of the functions, which may lead to segfaults in certain
situations where the thread tree was not fully formed because the
matrix problem was too small for the level of parallelism specified.
(That is, too small because some problems were assigned no work due
to the smallest units in the m and n dimensions being defined by the
register blocksizes mr and nr.) The fix requires several nested levels
of if statements, and this is one of those few instances where use of
goto statements results in (mostly) prettier code, especially in the
case of _gemm_paths(). And while it wasn't necessary, I ported this
goto usage to the loop body that prints the thrinfo_t work_id and
comm_id values for each thread. Thanks to Nicholai Tukanov for helping
to find this bug.

commit cac127182dd88ed0394ad81e6b91b897198e168a
Merge: 565fa385 3a45ecb1
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Mon Jun 24 13:01:27 2019 +0530

Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis
with public repo commit id 565fa3853b381051ac92cff764625909d105644d.

Change-Id: I68b9824b110cf14df248217a24a6191b3df79d42

commit c152109e9a3b1cd74760e8a3215a676d25c18d2e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 19 13:23:24 2019 -0500

Updated BLASFEO results in PerformanceSmall.md.

Details:
- Updated the BLASFEO performance graphs shown in PerformanceSmall.md
using a new commit of BLASFEO (2c9f312); updated PerformanceSmall.md
accordingly.
- Updated test/sup/octave/plot_l3sup_perf.m so that the .m files
containing the mpnpkp results do not need to be preprocessed in order
to plot half the problem size range (ie: up to 400 instead of the
800 range of the other shape cases).
- Trivial updates to runme.m.

commit 4d19c98110691d33ecef09d7e1b97bd1ccf4c420
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jun 8 11:02:03 2019 -0500

Trivial change to MixedDatatypes.md link text.

commit 24965beabe83e19acf62008366097a7f198d4841
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jun 8 11:00:22 2019 -0500

Fixed typo in README.md's MixedDatatypes.md link.

commit 50dc5d95760f41c5117c46f754245edc642b2179
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 7 13:10:16 2019 -0500

Adjust -fopenmp-simd for icc's preferred syntax.

Details:
- Use -qopenmp-simd instead of -fopenmp-simd when compiling with Intel
icc. Recall that this option is used for SIMD auto-vectorization in
reference kernels only. Support for the -f option has been completely
deprecated and removed in newer versions of icc in favor of -q. Thanks
to Victor Eijkhout for reporting this issue and suggesting the fix.

commit ad937db9507786874c801b41a4992aef42d924a1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 7 11:34:08 2019 -0500

Added missing include "bli_family_thunderx2.h".

Details:
- Added a cpp-conditional directive block to bli_arch_config.h that
includes "bli_family_thunderx2.h". The code has been missing since
adf5c17f. However, this never manifested as an error because the file
is virtually empty and not needed for thunderx2 (or most subconfigs).
Thanks to Jeff Diamond for helping to spot this.

commit ce671917b2bc24895289247feef46f6fdd5020e7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 6 14:17:21 2019 -0500

Fixed formatting/typo in docs/PerformanceSmall.md.

commit 86c33a4eb284e2cf3282a1809be377785cdb3703
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 5 11:43:55 2019 -0500

Tweaked language in README.md related to sup/AMD.

commit cbaa22e1ca368d36a8510f2b4ecd6f1523d1e1f3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jun 4 16:06:58 2019 -0500

Added BLASFEO results to docs/PerformanceSmall.md.

Details:
- Updated the graphs linked in PerformanceSmall.md with BLASFEO results,
and added documenting language accordingly.
- Updated scripts in test/sup/octave to plot BLASFEO data.
- Minor tweak to language re: how OpenBLAS was configured for
docs/Performance.md.

commit 763fa39c3088c0e2c0155675a3ca868a58bffb30
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jun 4 14:46:45 2019 -0500

Minor tweaks to test/sup.

Details:
- Changed starting problem and increment from 16 to 4.
- Added 'lll' (square problems) to list of problem size shapes to
compile and run with.
- Define BLASFEO location and added BLASFEO-related definitions.

commit 5e1e696003c9151b1879b910a1957b7bdd7b0deb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 3 18:37:20 2019 -0500

CHANGELOG update (0.6.0)

0.6.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 3 18:37:19 2019 -0500

Version file update (0.6.0)

commit 0f1b3bf49eb593ca7bb08b68a7209f7cd550f912
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 3 18:35:19 2019 -0500

ReleaseNotes.md update in advance of next version.

Details:
- Updated ReleaseNotes.md in preparation for next version.
- CREDITS file update.

commit 27da2e8400d900855da0d834b5417d7e83f21de1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 3 17:14:56 2019 -0500

Minor edits to docs/PerformanceSmall.md.

Details:
- Added performance analysis to "Comments" section of both Kaby Lake and
Epyc sections.
- Added emphasis to certain passages.

commit 09ba05c6f87efbaadf085497dc137845f16ee9c5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 3 16:53:19 2019 -0500

Added sup performance graphs/document to 'docs'.

Details:
- Added a new markdown document, docs/PerformanceSmall.md, which
publishes new performance graphs for Kaby Lake and Epyc showcasing
the new BLIS sup (small/skinny/unpacked) framework logic and kernels.
For now, only single-threaded dgemm performance is shown.
- Reorganized graphs in docs/graphs into docs/graphs/large, with new
graphs being placed in docs/graphs/sup.
- Updates to scripts in test/sup/octave, mostly to allow decent output
in both GNU octave and Matlab.
- Updated README.md to mention and refer to the new PerformanceSmall.md
document.

commit 6bf449cc6941734748034de0e9af22b75f1d6ba1
Merge: abd8a9fa a4e8801d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 31 17:42:40 2019 -0500

Merge branch 'amd'

commit a4e8801d08d81fa42ebea6a05a990de8dcedc803
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 31 17:30:51 2019 -0500

Increased MT sup threshold for double to 201.

Details:
- Fine-tuned the double-precision real MT threshold (which controls
whether the sup implementation kicks for smaller m dimension values)
from 180 to 201 for haswell and 180 to 256 for zen.
- Updated octave scripts in test/sup/octave to include a seventh column
to display performance for m = n = k.

commit 3a45ecb15456249c30ccccd60e42152f355615c1
Merge: 3f867c96 b69fb0b7
Author: Kiran Devrajegowda <Kiran.Devrajegowdaamd.com>
Date: Fri May 31 06:47:02 2019 -0400

Merge "Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup" into amd-staging-rome2.0

commit b69fb0b74a4756168de270fc9b18f7cf7aa57f17
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Fri May 31 15:14:22 2019 +0530

Added back BLIS_ENABLE_ZEN_BLOCK_SIZES macro to zen configuration, this is same as release 1.3. This was added before to improve DGEMM Multithreaded scalability on Naples for when number of threads is greater than 16. By mistake this got deleted in many changes done for 2.0 release, now we are adding this change back., in bli_gemm_front.c - code cleanup

Change-Id: I9f5d8225254676a99c6f2b09a0825e545206d0fc

commit 3f867c96caea3bbbbeeff1995d90f6cf8c9895fb
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Fri May 31 12:22:44 2019 +0530

When running HPL with pure MPI without DGEMM Threading (Single Threaded BLIS ), making this macro 1 gives best performance.wq

Change-Id: I24fd0bf99216f315e49f1c74c44c3feaffd7078d

commit abd8a9fa7df4569aa2711964c19888b8e248901f (origin/pfhp)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 28 12:49:44 2019 -0500

Inadvertantly hidden xerbla_() in blastest (313).

Details:
- Attempted a fix to issue 313, which reports that when building only
a shared library (ie: static library build is disabled), running the
BLAS test drivers can fail because those drivers provide their own
local version of xerbla_() as a clever (albeit still rather hackish)
way of checking the error codes that result from the individual tests.
This local xerbla_() function is never found at link-time because the
BLAS test drivers' Makefile imports BLIS compilation flags via the
get-user-cflags-for() function, which currently conveys the
-fvisibility=hidden flag, which hides symbols unless they are
explicitly annotated for export. The -fvisibility=hidden flag was
only ever intended for use when building BLIS (not for applications),
and so the attempted solution here is to omit the symbol export
flag(s) from get-user-cflags-for() by storing the symbol export
flag(s) to a new BULID_SYMFLAGS variable instead of appending it
to the subconfigurations' CMISCFLAGS variable (which is returned by
every get-*-cflags-for() function). Thanks to M. Zhou for reporting
this issue and also to Isuru Fernando for suggesting the fix.
- Renamed BUILD_FLAGS to BUILD_CPPFLAGS to harmonize with the newly
created BUILD_SYMFLAGS.
- Fixed typo in entry for --export-shared flag in 'configure --help'
text.

commit 13806ba3b01ca0dd341f4720fb930f97e46710b0
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Mon May 27 16:24:43 2019 +0530

This check in has changes w.r.t Copyright information, which is changed to (start year) - 2019

Change-Id: Ide3c8f7172210b8d3538d3c36e88634ab1ba9041

commit ee123f535872510f77100d3d55a43d4ca56047d5
Author: Meghana <meghana.vankadariamd.com>
Date: Mon May 27 15:36:44 2019 +0530

Defined small matrix thresholds for TRSM for various cases for NAPLES and ROME
Updated copyright information for kernels/zen/bli_trsm_small.c file
Removed separate kernels for zen2 architecture
Instead added threshold conditions in zen kernels both for ROME and NAPLES

Change-Id: Ifd715731741d649b6ad16b123a86dbd6665d97e5

commit 9d93a4caa21402d3a90aac45d7a1603736c9fd63
Author: prangana <pradeep.raoamd.com>
Date: Fri May 24 17:59:13 2019 +0530

update version 2.0

commit 755730608d923538273a90c48bfdf77571f86519
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 23 17:34:36 2019 -0500

Minor rewording of language around mt env. vars.

commit ba31abe73c97c16c78fffc59a215761b8d9fd1f6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 23 14:59:53 2019 -0500

Added BLIS theading info to Performance.md.

Details:
- Documented the BLIS environment variables that were set
(e.g. BLIS_JC_NT, BLIS_IC_NT, BLIS_JR_NT) for each machine and
threading configuration in order to achieve the parallelism reported
on in docs/Performance.md.

commit cb788ffc89cac03b44803620412a5e83450ca949
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 23 13:00:53 2019 -0500

Increased MT sup threshold for double to 180.

Details:
- Increased the double-precision real MT threshold (which controls
whether the sup implementation kicks for smaller m dimension values)
from 80 to 180, and this change was made for both haswell and zen
subconfigurations. This is less about the m dimension in particular
and more about facilitating a smoother performance transition when
m = n = k.

commit 057f5f3d211e7513f457ee6ca6c9555d00ad1e57
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu May 23 12:51:17 2019 -0500

Minor build system housekeeping.

Details:
- Commented out redundant setting of LIBBLIS_LINK within all driver-
level Makefiles. This variable is already set within common.mk, and
so the only time it should be overridden is if the user wants to link
to a different copy of libblis.
- Very minor changes to build/gen-make-frags/gen-make-frag.sh.
- Whitespace and inconsequential quoting change to configure.
- Moved top-level 'windows' directory into a new 'attic' directory.

commit e05171118c377f356f89c4daf8a0d5ddc5a4e4f7
Author: Meghana <meghana.vankadariamd.com>
Date: Thu May 23 16:15:27 2019 +0530

Implemented TRSM for small matrices for cases where A is on the right

Added separate kernels for zen and zen2

Change-Id: I6318ddc250cf82516c1aa4732718a35eae0c9134

commit 02920f5c480c42706b487e37b5ecc96c3555b851
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Thu May 23 15:29:59 2019 +0530

make checkblis fails for matrix dimension check at the begining hence reverting it

Change-Id: Ibd2ee8c2d4914598b72003fbfc5845be9c9c1e87

commit 84215022f29fb3bfedd254d041635308d177e6c0
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Thu May 23 11:08:41 2019 +0530

Adding threshold condition to dgemm small matrix kernels, defining the constants in zen2 configuration

Change-Id: I53a58b5d734925a6fcb8d8bea5a02ddb8971fcd5

commit a3554eb1dcc1b5b94d81c60761b2f01c3d827ffa
Merge: ea082f83 17b878b6
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Thu May 23 11:51:07 2019 +0530

Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis to configure zen2

Change-Id: I97e17bca9716b80b862925f97bb513c07b4b0cae

commit ea082f839071dd9ec555062dc3851c31d12f00e4
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Thu May 23 10:38:29 2019 +0530

adding empty zen2 directory with .gitignore file

Change-Id: Ifa37cf54b2578aa19ad335372b44bca17043fe4b

commit b80bd5bcb2be8551a9a21fafc8e6c8b6336c99b5
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue May 21 15:11:47 2019 +0530

config/zen/bli_cntx_init_zen.c: removed BLIS_ENBLE_ZEN_BLOCK_SIZES macro. We have different configurations for both zen and zen2
config/zen/bli_family_zen.h: deleted macro BLIS_ENBLE_ZEN_BLOCK_SIZES
config/zen/make_defs.mk: removed compiler flag -mno-avx256-split-unaligned-store
frame/base/bli_cpuid.c: ROME family is 17H but model is from 0x30H.
test/test_gemm.c - commented out define FILE_IN_OUT (some compilation error when BLIS is configured as amd64)
Now we can use single configuration has ./configure amd64 - this will work both for ROME & Naples

Change-Id: I91b4fc35380f8a35b4f4c345da040c6b5910b4a2

commit a042db011df9a1c3e7c7ac546541f4746b176ea5
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Mon May 20 14:17:32 2019 +0530

Modified make_defs.mk for zen2 to get compiled by gcc version less than gcc9.0

Change-Id: I8fcac30538ee39534c296932639053b47b9a2d43

commit a23f92594cf3d530e5794307fe97afc877d853b7
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Mon May 20 10:48:06 2019 +0530

config_registry: New AMD zen2 architecture configuration added.
frame/base/bli_arch.c: ifdef BLIS_FAMILY_ZEN2 id = BLIS_ARCH_ZEN2; endif added. zen2 is added in config_name[BLIS_NUM_ARCHS]
frame/base/bli_cpuid.c : ifdef BLIS_CONFIG_ZEN2 if ( bli_cpuid_is_zen2( family, model, features ) ) return BLIS_ARCH_ZEN2; endif, defined new function bool bli_cpuid_is_zen2(...).
frame/base/bli_cpuid.h : declared bli_cpuid_is_zen2(..).
frame/base/bli_gks.c : ifdef BLIS_CONFIG_ZEN2 bli_gks_register_cntx(BLIS_ARCH_ZEN2, bli_cntx_init_zen2, bli_cntx_init_zen2_ref, bli_cntx_init_zen2_ind); endif
frame/include/bli_arch_config.h : ifdef BLIS_CONFIG_ZEN2 CNTX_INIT_PROTS(zen2) endif ifdef BLIS_FAMILY_ZEN2 include "bli_family_zen2.h" endif
frame/include/bli_type_defs.h : added BLIS_ARCH_ZEN2 in arch_t enum. BLIS_NUM_ARCHS 20

Change-Id: I2a2d9b7266673e78a4f8543b1bfb5425b0aa7866

commit 17b878b66d917d50b6fe23721d8579e826cb3e8c
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Wed May 22 14:02:53 2019 +0530

adding license same as in ut-austin-amd-branch

Change-Id: I6790768d2bf5d42369d304ef93e34701f95fbaff

commit df755848b8a271323e007c7a628c64af63deab00
Merge: ca4b33c0 c72ae27a
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Wed May 22 13:30:07 2019 +0530

Merge branch 'amd-staging-rome2.0' of ssh://git.amd.com:29418/cpulibraries/er/blis into rome2.0

Change-Id: Ie8aad1ab810f0f3c0b90ec67f9dd3dfb8dcc74cc

commit c72ae27adee4726679ee004d02c972582b5285b4
Author: Nisanth M P <nisanth.padinharepattamd.com>
Date: Mon Mar 19 12:49:26 2018 +0530

Re-enabling the small matrix gemm optimization for target zen

Change-Id: I13872784586984634d728cd99a00f71c3f904395

commit ab0818af80f7f683080873f3fa24734b65267df2
Author: sraut <Biplab.Rautamd.com>
Date: Wed Oct 3 15:30:33 2018 +0530

Review comments incorporated for small TRSM.

Change-Id: Ia64b7b2c0375cc501c2cb0be8a1af93111808cd9

commit 32392cfc72af7f42da817a129748349fb1951346
Author: Jeff Hammond <jeff.r.hammondintel.com>
Date: Tue May 14 15:52:30 2019 -0400

add info about CXX in configure (311)

commit fa7e6b182b8365465ade178b0e4cd344ff6f6460
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 1 19:13:00 2019 -0500

Define _POSIX_C_SOURCE in bli_system.h.

Details:
- Added
ifndef _POSIX_C_SOURCE
define _POSIX_C_SOURCE 200809L
endif
to bli_system.h so that an application that uses BLIS (specifically,
an application that includes blis.h) does not need to remember to
define the macro itself (either on the command line or in the code
that includes blis.h) in order to activate things like the pthreads.
Thanks to Christos Psarras for reporting this issue and suggesting
this fix.
- Commented out include <sys/time.h> in bli_system.h, since I don't
think this header is used/needed anymore.
- Comment update to function macro for bli_?normiv_unb_var1() in
frame/util/bli_util_unb_var1.c.

commit 3df84f1b5d5e1146bb01bfc466ac20c60a9cc859
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 27 21:27:32 2019 -0500

Minor bugfixes in sup dgemm implementation.

Details:
- Fixed an obscure but in the bli_dgemmsup_rv_haswell_asm_5x8n() kernel
that only affected the beta == 0, column-storage output case. Thanks
to the BLAS test drivers for catching this bug.
- Previously, bli_gemmsup_ref_var1n() and _var2m() were returning if
k = 0, when the correct action would be to scale by beta (and then
return). Thanks to the BLAS test drivers to catching this bug.
- Changed the sup threshold behavior such that the sup implementation
only kicks in if a matrix dimension is strictly less than (rather than
less than or equal to) the threshold in question.
- Initialize all thresholds to zero (instead of 10) by default in
ref_kernels/bli_cntx_ref.c. This, combined with the above change to
threshold testing means that calls to BLIS or BLAS with one or more
matrix dimensions of zero will no longer trigger the sup
implementation.
- Added disabled debugging output to frame/3/bli_l3_sup.c (for future
use, perhaps).

commit ecbdd1c42dcebfecd729fe351e6bb0076aba7d81
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 27 19:38:11 2019 -0500

Ceased use of BLIS_ENABLE_SUP_MR/NR_EXT macros.

Details:
- Removed already limited use of the BLIS_ENABLE_SUP_MR_EXT and
BLIS_ENABLE_SUP_NR_EXT macros in bli_gemmsup_ref_var1n() and
bli_gemmsup_ref_var2m(). Their purpose was merely to avoid a long
conditional that would determine whether to allow the last iteration
to be merged with the second-to-last iteration. Functionally, the
macros were not needed, and they ended up causing problems when
building configuration families such as intel64 and x86_64.

commit aa8a6bec3036a41e1bff2034f8ef6766a704ec49
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 27 18:53:33 2019 -0500

Fixed typo in --disable-sup-handling macro guard.

Details:
- Fixed an incorrectly-named macro guard that is intended to allow
disabling of the sup framework via the configure option
--disable-sup-handling. In this case, the preprocessor macro,
BLIS_DISABLE_SUP_HANDLING, was still named by its name from an older
uncommitted version of the code (BLIS_DISABLE_SM_HANDLING).

commit b9c9f03502c78a63cfcc21654b06e9089e2a3822
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 27 18:44:50 2019 -0500

Implemented gemm on skinny/unpacked matrices.

Details:
- Implemented a new sub-framework within BLIS to support the management
of code and kernels that specifically target matrix problems for which
at least one dimension is deemed to be small, which can result in long
and skinny matrix operands that are ill-suited for the conventional
level-3 implementations in BLIS. The new framework tackles the problem
in two ways. First the stripped-down algorithmic loops forgo the
packing that is famously performed in the classic code path. That is,
the computation is performed by a new family of kernels tailored
specifically for operating on the source matrices as-is (unpacked).
Second, these new kernels will typically (and in the case of haswell
and zen, do in fact) include separate assembly sub-kernels for
handling of edge cases, which helps smooth performance when performing
problems whose m and n dimension are not naturally multiples of the
register blocksizes. In a reference to the sub-framework's purpose of
supporting skinny/unpacked level-3 operations, the "sup" operation
suffix (e.g. gemmsup) is typically used to denote a separate namespace
for related code and kernels. NOTE: Since the sup framework does not
perform any packing, it targets row- and column-stored matrices A, B,
and C. For now, if any matrix has non-unit strides in both dimensions,
the problem is computed by the conventional implementation.
- Implemented the default sup handler as a front-end to two variants.
bli_gemmsup_ref_var2() provides a block-panel variant (in which the
2nd loop around the microkernel iterates over n and the 1st loop
iterates over m), while bli_gemmsup_ref_var1() provides a panel-block
variant (2nd loop over m and 1st loop over n). However, these variants
are not used by default and provided for reference only. Instead, the
default sup handler calls _var2m() and _var1n(), which are similar
to _var2() and _var1(), respectively, except that they defer to the
sup kernel itself to iterate over the m and n dimension, respectively.
In other words, these variants rely not on microkernels, but on
so-called "millikernels" that iterate along m and k, or n and k.
The benefit of using millikernels is a reduction of function call
and related (local integer typecast) overhead as well as the ability
for the kernel to know which micropanel (A or B) will change during
the next iteration of the 1st loop, which allows it to focus its
prefetching on that micropanel. (In _var2m()'s millikernel, the upanel
of A changes while the same upanel of B is reused. In _var1n()'s, the
upanel of B changes while the upanel of A is reused.)
- Added a new configure option, --[en|dis]able-sup-handling, which is
enabled by default. However, the default thresholds at which the
default sup handler is activated are set to zero for each of the m, n,
and k dimensions, which effectively disables the implementation. (The
default sup handler only accepts the problem if at least one dimension
is smaller than or equal to its corresponding threshold. If all
dimensions are larger than their thresholds, the problem is rejected
by the sup front-end and control is passed back to the conventional
implementation, which proceeds normally.)
- Added support to the cntx_t structure to track new fields related to
the sup framework, most notably:
- sup thresholds: the thresholds at which the sup handler is called.
- sup handlers: the address of the function to call to implement
the level-3 skinny/unpacked matrix implementation.
- sup blocksizes: the register and cache blocksizes used by the sup
implementation (which may be the same or different from those used
by the conventional packm-based approach).
- sup kernels: the kernels that the handler will use in implementing
the sup functionality.
- sup kernel prefs: the IO preference of the sup kernels, which may
differ from the preferences of the conventional gemm microkernels'
IO preferences.
- Added a bool_t to the rntm_t structure that indicates whether sup
handling should be enabled/disabled. This allows per-call control
of whether the sup implementation is used, which is useful for test
drivers that wish to switch between the conventional and sup codes
without having to link to different copies of BLIS. The corresponding
accessor functions for this new bool_t are defined in bli_rntm.h.
- Implemented several row-preferential gemmsup kernels in a new
directory, kernels/haswell/3/sup. These kernels include two general
implementation types--'rd' and 'rv'--for the 6x8 base shape, with
two specialized millikernels that embed the 1st loop within the kernel
itself.
- Added ref_kernels/3/bli_gemmsup_ref.c, which provides reference
gemmsup microkernels. NOTE: These microkernels, unlike the current
crop of conventional (pack-based) microkernels, do not use constant
loop bounds. Additionally, their inner loop iterates over the k
dimension.
- Defined new typedef enums:
- stor3_t: captures the effective storage combination of the level-3
problem. Valid values are BLIS_RRR, BLIS_RRC, BLIS_RCR, etc. A
special value of BLIS_XXX is used to denote an arbitrary combination
which, in practice, means that at least one of the operands is
stored according to general stride.
- threshid_t: captures each of the three dimension thresholds.
- Changed bli_adjust_strides() in bli_obj.c so that bli_obj_create()
can be passed "-1, -1" as a lazy request for row storage. (Note that
"0, 0" is still accepted as a lazy request for column storage.)
- Added support for various instructions to bli_x86_asm_macros.h,
including imul, vhaddps/pd, and other instructions related to integer
vectors.
- Disabled the older small matrix handling code inserted by AMD in
bli_gemm_front.c, since the sup framework introduced in this commit
is intended to provide a more generalized solution.
- Added test/sup directory, which contains standalone performance test
drivers, a Makefile, a runme.sh script, and an 'octave' directory
containing scripts compatible with GNU Octave. (They also may work
with matlab, but if not, they are probably close to working.)
- Reinterpret the storage combination string (sc_str) in the various
level-3 testsuite modules (e.g. src/test_gemm.c) so that the order
of each matrix storage char is "cab" rather than "abc".
- Comment updates in level-3 BLAS API wrappers in frame/compat.

commit 0d549ceda822833bec192bbf80633599620c15d9
Author: Isuru Fernando <isurufgmail.com>
Date: Sat Apr 27 22:56:02 2019 +0000

make unix friendly archives on appveyor (310)

commit ca4b33c001f9e959c43b95a9a23f9df5adec7adf
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Wed Apr 24 15:02:39 2019 +0530

Added compiler option (-mno-avx256-split-unaligned-store) in the file config/zen/make_defs.mk to improve performance of intrinsic codes, this flag ensures compiler generates 256-bit stores for the equivalent intrinsics code.

Change-Id: I8f8cd81a3604869df18d38bc42097a04f178d324

commit 945928c650051c04d6900c7f4e9e29cd0e5b299f
Merge: 663f6629 74e513eb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 17 15:58:56 2019 -0500

Merge branch 'amd' of github.com:flame/blis into amd

commit 74e513eb6a6787a925d43cd1500277d54d86ab8f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 17 13:34:44 2019 -0500

Support row storage in Eigen gemm test/3 driver.

Details:
- Added preprocessor branches to test/3/test_gemm.c to explicitly
support row-stored matrices. Column-stored matrices are also still
supported (and is the default for now). (This is mainly residual work
leftover from initial integration of Eigen into the test drivers, so
if we ever want to test Eigen with row-stored matrices, the code will
be ready to use, even if it is not yet integrated into the Makefile
in test/3.)

commit b5d457fae9bd75c4ca67f7bc7214e527aa248127
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 16 12:50:01 2019 -0500

Applied forgotten variable rename from 89a70cc.

Details:
- Somehow the variable name change (root_file_name -> root_inputname)
in flatten-headers.py mentioned in the commit log entry for 89a70cc
didn't make it into the actual commit. This commit applies that
change.

commit 89a70cccf869333147eb2559cdfa5a23dc915824
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 11 18:33:08 2019 -0500

GNU-like handling of installation prefix et al.

Details:
- Changed the default installation prefix from $HOME/lib to /usr/local.
- Modified the way configure internally handles the prefix, libdir,
includedir, and sharedir (and also added an --exec-prefix option).
The defaults to these variables are set as follows:
prefix: /usr/local
exec_prefix: ${prefix}
libdir: ${exec_prefix}/lib
includedir: ${prefix}/include
sharedir: ${prefix}/share
The key change, aside from the addition of exec_prefix and its use to
define the default to libdir, is that the variables are substituted
into config.mk with quoting that delays evaluation, meaning the
substituted values may contain unevaluated references to other
variables (namely, ${prefix} and ${exec_prefix}). This more closely
follows GNU conventions, including those used by GNU autoconf, and
also allows make to override any one of the variables *after*
configure has already been run (e.g. during 'make install').
- Updates to build/config.mk.in pursuant to above changes.
- Updates to output of 'configure --help' pursuant to above changes.
- Updated docs/BuildSystem.md to reflect the new default installation
prefix, as well as mention EXECPREFIX and SHAREDIR.
- Changed the definitions of the UNINSTALL_OLD_* variables in the
top-level Makefile to use $(wildcard ...) instead of 'find'. This
was motivated by the new way of handling prefix and friends, which
leads to the 'find' command being run on /usr/local (by default),
which can take a while almost never yielding any benefit (since the
user will very rarely use the uninstall-old targets).
- Removed periods from the end of descriptive output statements (i.e.,
non-verbose output) since those statements often end with file or
directory paths, which get confusing to read when puctuated by a
period.
- Trival change to 'make showconfig' output.
- Removed my name from 'configure --help'. (Many have contributed to it
over the years.)
- In configure script, changed the default state of threading_model
variable from 'no' to 'off' to match that of debug_type, where there
are similarly more than two valid states. ('no' is still accepted
if given via the --enable-debug= option, though it will be
standardized to 'off' prior to config.mk being written out.)
- Minor variable name change in flatten-headers.py that was intended for
32812ff.
- CREDITS file update.

commit 9d76688ad90014a11ddc0c2f27253d62806216b1
Author: kdevraje <Kiran.Devrajegowdaamd.com>
Date: Thu Apr 11 10:22:48 2019 +0530

Fix for single rank crash with HPL application. When computing offset of C buffer, as integer variables are used for a row and column index, the intermediate result value overflows and a negative value gets added to the buffer, when the negative value is too large it would index the buffer out of the range resulting in segmentation fault. Although the crash is a result of dgemm kernel, added similar code in sgemm kernel also.

Change-Id: I171119b0ec0dfbd8e63f1fcd6609a94384aabd27

commit 32812ff5aba05d34c421fe1024a61f3e2d5e7052
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 9 12:20:19 2019 -0500

Minor bugfix to flatten-headers.py.

Details:
- Fixed a minor bug in flatten-headers.py whereby the script, upon
encountering a include directive for the root header file, would
erroneously recurse and inline the conents of that root header.
The script has been modified to avoid recursion into any headers
that share the same name as the root-level header that was passed
into the script. (Note: this bug didn't actually manifest in BLIS,
so it's merely a precaution for usage of flatten-headers.py in other
contexts.)

commit bec90e0b6aeb3c9b19589c2b700fda2d66f6ccdf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 2 17:45:13 2019 -0500

Minor update to docs/HardwareSupport.md document.

Details:
- Added more details and clarifying language to implications of 1m and
the recycling of microkernels between microarchitectures.

commit 89cd650e7be01b59aefaa85885a3ea78970351e4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 2 17:23:55 2019 -0500

Use void_fp for function pointers instead of void*.

Change void*-typed function pointers to void_fp.
- Updated all instances of void* variables that store function pointers
to variables of a new type, void_fp. Originally, I wanted to define
the type of void_fp as "void (*void_fp)( void )"--that is, a pointer
to a function with no return value and no arguments. However, once
I did this, I realized that gcc complains with incompatible pointer
type (-Wincompatible-pointer-types) warnings every time any such a
pointer is being assigned to its final, type-accurate function
pointer type. That is, gcc will silently typecast a void* to
another defined function pointer type (e.g. dscalv_ker_ft) during
an assignment from the former to the latter, but the same statement
will trigger a warning when typecasting from a void_fp type. I suspect
an explicit typecast is needed in order to avoid the warning, which
I'm not willing to insert at this time.
- Added a typedef to bli_type_defs.h defining void_fp as void*, along
with a commented-out version of the aborted definition described
above. (Note that POSIX requires that void* and function pointers
be interchangeable; it is the C standard that does not provide this
guarantee.)
- Comment updates to various _oapi.c files.

commit ffce3d632b284eb52474036096815ec38ca8dd5f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 2 14:40:50 2019 -0500

Renamed armv8a gemm kernel filename.

Details:
- Renamed
kernels/armv8a/3/bli_gemm_armv8a_opt_4x4.c
to
kernels/armv8a/3/bli_gemm_armv8a_asm_d6x8.c.
This follows the naming convention used by other kernel sets, most
notably haswell.

commit 77867478af02144544b4e7b6df5d54d874f3f93b
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Apr 2 13:33:11 2019 -0500

Use pthreads on MinGW and Cygwin (307)

commit 7bc75882f02ce3470a357950878492e87e688cec
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 28 17:40:50 2019 -0500

Updated Eigen results in docs/graphs with 3.3.90.

Details:
- Updated the level-3 performance graphs in docs/graphs with new Eigen
results, this time using a development version cloned from their git
mirror on March 27, 2019 (version 3.3.90). Performance is improved
over 3.3.7, though still noticeably short of BLIS/MKL in most cases.
- Very minor updates to docs/Performance.md and matlab scripts in
test/3/matlab.

commit 20ea7a1217d3833db89a96158c42da2d6e968ed8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 27 18:09:17 2019 -0500

Minor text updates (Eigen) to docs/Performance.md.

Details:
- Added/updated a few more details, mostly regarding Eigen.

commit bfb7e1bc6af468e4ff22f7e27151ea400dcd318a
Merge: 044df950 2c85e1dd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 27 17:58:19 2019 -0500

Merge branch 'dev'

commit 2c85e1dd9d5d84da7228ea4ae6deec56a89b3a8f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 27 16:29:51 2019 -0500

Added Eigen results to performance graphs.

Details:
- Updated the Haswell, SkylakeX, and Epyc performance graphs in
docs/graphs to report on Eigen implementations, where applicable.
Specifically, Eigen implements all level-3 operations sequentially,
however, of those operations it only provides multithreaded gemm.
Thus, mt results for symm/hemm, syrk/herk, trmm, and trsm are
omitted. Thanks to Sameer Agarwal for his help configuring and
using Eigen.
- Updated docs/Performance.md to note the new implementation tested.
- CREDITS file update.

commit bfac7e385f8061f2e6591de208b0acf852f04580
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 27 16:04:48 2019 -0500

Added ability to plot with Eigen in test/3/matlab.

Details:
- Updated matlab scripts in test/3/matlab to optionally plot/display
Eigen performance curves. Whether Eigen is plotted is determined by
a new boolean function parameter, with_eigen.
- Updated runme.m scratchpad to reflect the latest invocations of the
plot_panel_4x5() function (with Eigen plotting enabled).

commit 67535317b9411c90de7fa4cb5b0fdb8f61fdcd79
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 27 13:32:18 2019 -0500

Fixed mislabeled eigen output from test/3 drivers.

Details:
- Fixed the Makefile in test/3 so that it no longer incorrectly labels
the matlab output variables from Eigen-linked hemm, herk, trmm, and
trsm driver output as "vendor". (The gemm drivers were already
correctly outputing matlab variables containing the "eigen" label.)

commit 044df9506f823643c0cdd53e81ad3c27a9f9d4ff
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Mar 27 12:39:31 2019 -0500

Test with shared on windows (306)

Export macros can't support both shared and static at the same time.
When blis is built with both shared and static, headers assume that
shared is used at link time and dllimports the symbols with __imp_
prefix.

To use the headers with static libraries a user can give
-DBLIS_EXPORT= to import the symbol without the __imp_ prefix

commit 5e6b160c8a85e5e23bab0f64958a8acf4918a4ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 26 19:10:59 2019 -0500

Link to Eigen BLAS for non-gemm drivers in test/3.

Details:
- Adjusted test/3/Makefile so that the test drivers are linked against
Eigen's BLAS library for hemm, herk, trmm, and trsm. We have to do
this since Eigen's headers don't define implementations to the
standard BLAS APIs.
- Simplified included headers in hemm, herk, trmm, and trsm source
driver files, since nothing specific to Eigen is needed at
compile-time for those operations.

commit e593221383aae19dfdc3f30539de80ed05cfec7f
Merge: 92fb9c87 c208b9dc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 26 15:51:45 2019 -0500

Merge branch 'master' into dev

commit 92fb9c87bf88b9f9c401eeecd9aa9c3521bc2adb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 26 15:43:23 2019 -0500

Add more support for Eigen to drivers in test/3.

Details:
- Use compile-time implementations of Eigen in test_gemm.c via new
EIGEN cpp macro, defined on command line. (Linking to Eigen's BLAS
library is not necessary.) However, as of Eigen 3.3.7, Eigen only
parallelizes the gemm operation and not hemm, herk, trmm, trsm, or
any other level-3 operation.
- Fixed a bug in trmm and trsm drivers whereby the wrong function
(bli_does_trans()) was being called to determine whether the object
for matrix A should be created for a left- or right-side case. This
was corrected by changing the function to bli_is_left(), as is done
in the hemm driver.
- Added support for running Eigen test drivers from runme.sh.

commit c208b9dc46852c877197d53b6dd913a046b6ebb6
Author: Isuru Fernando <isurufgmail.com>
Date: Mon Mar 25 13:03:44 2019 -0500

Fix clang version detection (305)

clang -dumpversion gives 4.2.1 for all clang versions as clang was
originally compatible with gcc 4.2.1

Apple clang version and clang version are two different things
and the real clang version cannot be deduced from apple clang version
programatically. Rely on wikipedia to map apple clang to clang version

Also fixes assembly detection with clang

clang 3.8 can't build knl as it doesn't recognize zmm0

commit 53842c7e7d530cb2d5609d6d124ae350fc345c32
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Fri Mar 22 13:57:14 2019 +0530

Removed printing alpha and beta values

Change-Id: I49102db510311a30f6a936f9d843f35838f50d23

commit 6805db45e343d83d1adaf9157cf0b841653e9ede
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Fri Mar 22 12:55:35 2019 +0530

Corrected setting alpha & beta values- alpha = -1 and beta = 1 - bli_setc(-1.0, 0, &alpha) should be used rather than bli_setc(0.0, -1.0, &alpha). This corrected now

Change-Id: Ic1102dfd6b50ccf212386a1211c6f31e8d987ef9

commit feefcab4427a75b0b55af215486b85abcda314f7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 21 18:11:20 2019 -0500

Allow disabling of BLAS prototypes at compile-time.

Details:
- Modified bli_blas.h so that:
- By default, if the BLAS layer is enabled at configure-time, BLAS
prototypes are also enabled within blis.h;
- But if the user defines BLIS_DISABLE_BLAS_DEFS prior to including
blis.h, BLAS prototypes are skipped over entirely so that, for
example, the application or some other header pulled in by the
application may prototype the BLAS functions without causing any
duplication.
- Updated docs/BuildSystem.md to document the feature above, and
related text.

commit 20153cd4b594bc34f860c381ec18de3a6cc743c7
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Thu Mar 21 16:23:53 2019 +0530

Modified test_gemm.c file in test folder
A Macro 'FILE_IN_OUT" is defined to read input parameters from a csv file.
Format for input file:
Each line defines a gemm problem with following parameters: m k n cs_a cs_b cs_c
The operation always implemented is C = C - A*B and column-major format.
When macro is disabled - it reverts back to original implementation.
Usage: ./test_gemm_<mkl/blis/openblas>.x input.csv output.csv
GEMM is called through BLAS interface
For BLIS - the test application also prints either 'S' indicating small gemm routine or 'N' - conventional BLIS gemm
for MKL/OpenBLAS - ignore this character

Change-Id: I0924ef2c1f7bdea48d4cdb230b888e2af2c86a36

commit 288843b06d91e1b4fade337959aef773090bd1c9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 20 17:52:23 2019 -0500

Added Eigen support to test/3 Makefile, runme.sh.

Details:
- Added targets to test/3/Makefile that link against a BLAS library
build by Eigen. It appears, however, that Eigen's BLAS library does
not support multithreading. (It may be that multithreading is only
available when using the native C++ APIs.)
- Updated runme.sh with a few Eigen-related tweaks.
- Minor tweaks to docs/Performance.md.

commit 153e0be21d9ff413e370511b68d553dd02abada9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 17:53:18 2019 -0500

More minor tweaks to docs/Performance.md.

Details:
- Defined GFLOPS as billions of floating-point operations per second,
and reworded the sentence after about normalization.

commit 05c4e42642cc0c8dbfa94a6c21e975ac30c0517a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 17:07:20 2019 -0500

CHANGELOG update (0.5.2)

0.5.2

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 17:07:18 2019 -0500

Version file update (0.5.2)

commit 64560cd9248ebf4c02c4a1eeef958e1ca434e510
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 17:04:20 2019 -0500

ReleaseNotes.md update in advance of next version.

Details:
- Updated ReleaseNotes.md in preparation for next version.

commit ab5ad557ea69479d487c9a3cb516f43fa1089863
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 16:50:41 2019 -0500

Very minor tweaks to Performance.md.

commit 03c4a25e1aa8a6c21abbb789baa599ac419c3641
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 16:47:15 2019 -0500

Minor fixes to docs/Performance.md.

Details:
- Fixed some incorrect labels associated with the pdf/png graphs,
apparently the result of copy-pasting.

commit fe6dd8b132f39ecb8893d54cd8e75d4bbf6dab83
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 16:30:23 2019 -0500

Fixed broken section links in docs/Performance.md.

Details:
- Fixed a few broken section links in the Contents section.

commit 913cf97653f5f9a40aa89a5b79e2b0a8882dd509
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 16:15:24 2019 -0500

Added docs/Performance.md and docs/graphs subdir.

Details:
- Added a new markdown document, docs/Performance.md, which reports
performance of a representative set of level-3 operations across a
variety of hardware architectures, comparing BLIS to OpenBLAS and a
vendor library (MKL on Intel/AMD, ARMPL on ARM). Performance graphs,
in pdf and png formats, reside in docs/graphs.
- Updated README.md to link to new Performance.md document.
- Minor updates to CREDITS, docs/Multithreading.md.
- Minor updates to matlab scripts in test/3/matlab.

commit 9945ef24fd758396b698b19bb4e23e53b9d95725
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 19 15:28:44 2019 -0500

Adjusted cache blocksizes for zen subconfig.

Details:
- Adjusted the zen sub-configuration's cache blocksizes for float,
scomplex, and dcomplex based on the existing values for double.
(The previous values were taken directly from the haswell subconfig,
which targets Intel Haswell/Broadwell/Skylake systems.)

commit d202d008d51251609d08d3c278bb6f4ca9caf8e4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 18 18:18:25 2019 -0500

Renamed --enable-export-all to --export-shared=[].

Details:
- Replaced the existing --enable-export-all / --disable-export-all
configure option with --export-shared=[public|all], with the 'public'
instance of the latter corresponding to --disable-export-all and the
'all' instance corresponding to --enable-export-all. Nothing else
semantically about the option, or its default, has changed.

commit ff78089870f714663026a7136e696603b5259560
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 18 13:22:55 2019 -0500

Updates to docs/Multithreading.md.

Details:
- Made extra explicit the fact that: (a) multithreading in BLIS is
disabled by default; and (b) even with multithreading enabled, the
user must specify multithreading at runtime in order to observe
parallelism. Thanks to M. Zhou for suggesting these clarifications
in 292.
- Also made explicit that only the environment variable and global
runtime API methods are available when using the BLAS API. If the
user wishes to use the local runtime API (specify multithreading on
a per-call basis), one of the native BLIS APIs must be used.

commit 3a929a3d0ba0353159a6d4cd188f01b7a390ccfc
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Mon Mar 18 10:51:41 2019 +0530

Fixed code merging: bli_gemm_small.c - missed conditional checks for L!=0 && K!=0. Now they are added. This fix is done to pass blastest

Change-Id: Idc9c9a04d2015a68a19553c437ecaf8f1584026c

commit 663f662932c3f182fefc3c77daa1bf8c3394bb8b
Merge: 938c05ef 6bfe3812
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 16 16:17:12 2019 -0500

Merge branch 'amd' of github.com:flame/blis into amd

commit 938c05ef8654e2fc013d39a57f51d91d40cc40fb
Merge: 4ed39c09 5a5f494e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 16 16:01:43 2019 -0500

Merge branch 'amd' of github.com:flame/blis into amd

commit 6bfe3812e29b86c95b828822e4e5473b48891167
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 15 13:57:49 2019 -0500

Use -fvisibility=[...] with clang on Linux/BSD/OSX.

Details:
- Modified common.mk to use the -fvisibility=[hidden|default] option
when compiling with clang on non-Windows platforms (Linux, BSD, OS X,
etc.). Thanks to Isuru Fernando for pointing out this option works
with clang on these OSes.

commit 809395649c5bbf48778ede4c03c1df705dd49566
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 13 18:21:35 2019 -0500

Annotated additional symbols for export.

Details:
- Added export annotations to additional function prototypes in order to
accommodate the testsuite.
- Disabled calling bli_amaxv_check() from within the testsuite's
test_amaxv.c.

commit e095926c643fd9c9c2220ebecd749caae0f71d42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 13 17:35:18 2019 -0500

Support shared lib export of only public symbols.

Details:
- Introduced a new configure option, --enable-export-all, which will
cause all shared library symbols to be exported by default, or,
alternatively, --disable-export-all, which will cause all symbols to
be hidden by default, with only those symbols that are annotated for
visibility, via BLIS_EXPORT_BLIS (and BLIS_EXPORT_BLAS for BLAS
symbols), to be exported. The default for this configure option is
--disable-export-all. Thanks to Isuru Fernando for consulting on
this commit.
- Removed BLIS_EXPORT_BLIS annotations from frame/1m/bli_l1m_unb_var1.h,
which was intended for 5a5f494.
- Relocated BLIS_EXPORT-related cpp logic from bli_config.h.in to
frame/include/bli_config_macro_defs.h.
- Provided appropriate logic within common.mk to implement variable
symbol visibility for gcc, clang, and icc (to the extend that each of
these compilers allow).
- Relocated --help text associated with debug option (-d) to configure
slightly further down in the list.

commit 5a5f494e428372c7c27ed1f14802e15a83221e87
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 12 18:45:09 2019 -0500

Removed export macros from all internal prototypes.

Details:
- After merging PR 303, at Isuru's request, I removed the use of
BLIS_EXPORT_BLIS from all function prototypes *except* those that we
potentially wish to be exported in shared/dynamic libraries. In other
words, I removed the use of BLIS_EXPORT_BLIS from all prototypes of
functions that can be considered private or for internal use only.
This is likely the last big modification along the path towards
implementing the functionality spelled out in issue 248. Thanks
again to Isuru Fernando for his initial efforts of sprinkling the
export macros throughout BLIS, which made removing them where
necessary relatively painless. Also, I'd like to thank Tony Kelman,
Nathaniel Smith, Ian Henriksen, Marat Dukhan, and Matthew Brett for
participating in the initial discussion in issue 37 that was later
summarized and restated in issue 248.
- CREDITS file update.

commit 3dc18920b6226026406f1d2a8b2c2b405a2649d5
Merge: b938c16b 766769ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 12 11:20:25 2019 -0500

Merge branch 'master' into dev

commit 766769eeb944bd28641a6f72c49a734da20da755
Author: Isuru Fernando <isurufgmail.com>
Date: Mon Mar 11 19:05:32 2019 -0500

Export functions without def file (303)

* Revert "restore bli_extern_defs exporting for now"

This reverts commit 09fb07c350b2acee17645e8e9e1b8d829c73dca8.

* Remove symbols not intended to be public

* No need of def file anymore

* Fix whitespace

* No need of configure option

* Remove export macro from definitions

* Remove blas export macro from definitions

commit 4ed39c0971c7917e2675cf5449f563b1f4751ccc
Merge: 540ec1b4 b938c16b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 8 11:56:58 2019 -0600

Merge branch 'amd' of github.com:flame/blis into amd

commit b938c16b0c9e839335ac2c14944b82890143d02f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 7 16:40:39 2019 -0600

Renamed test/3m4m to test/3.

Details:
- Renamed '3m4m' directory to '3', which captures the directory nicely
since it builds test drivers to test level-3 operations.
- These test drivers ceased to be used to test the 3m and 4m (or even
1m) induced methods long ago, hence the name change.

commit ab89a40582ec7acf802e59b0763bed099a02edd8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 7 16:26:12 2019 -0600

More minor updates and edits to test/3m4m.

Details:
- Further updates to matlab scripts, mostly for compatibility with
GNU Octave.
- More tweaks to runme.sh.
- Updates to runme.m that allow copy-paste into matlab interactive
session to generate graphs.

commit f0e70dfbf3fee4c4e382c2c4e87c25454cbc79a1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 7 01:04:05 2019 +0000

Very minor updates to test/3m4m for ul252.

Details:
- Very minor updates to the newly revamped test/3m4m drivers when used
on a Xeon Platinum (SkylakeX).

commit 7fe44748383071f1cbbc77d904f4ae5538e13065
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Wed Mar 6 16:23:31 2019 +0530

Disabled BLIS_ENABLE_ZEN_BLOCK_SIZES in bli_family_zen.h for ROME tuning

Change-Id: Iec47fcf51f4d4396afef1ce3958e58cf02c59a57

commit 9f1dbe572b1fd5e7dd30d5649bdf59259ad770d5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 5 17:47:55 2019 -0600

Overhauled test/3m4m Makefile and scripts.

Details:
- Rewrote much of Makefile to generate executables for single- and dual-
socket multithreading as well as single-threaded. Each of the three
can also use a different problem size range/increment, as is often
appropriate when doubling/halving the number of threads.
- Rewrote runme.sh script to flexibly execute as many threading
parameter scenarios as is given in the input parameter string
(currently set within the script itself). The string also encodes
the maximum problem size for each threading scenario, which is used
to identify the executable to run. Also improved the "progress" output
of the script to reduce redundant info and improve readability in
terminals that are not especially wide.
- Minor updates to test_*.c source files.
- Updated matlab scripts according to changes made to the Makefile,
test drivers, and runme.sh script, and renamed 'plot_all.m' to
'runme.m'.

commit f5ed95ecd7d5eb4a63e1333ad5cc6765fc8df9fe
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Tue Mar 5 15:01:57 2019 +0530

0.5.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 18 14:56:16 2018 -0600

Version file update (0.5.1)

commit 3ab231afc9f69d14493908c53c85a84c5fba58aa
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 18 14:53:37 2018 -0600

ReleaseNotes.md update in advance of next version.

Details:
- Updated ReleaseNotes.md in preparation for next version.

commit d1aa87164e1e82347d62aa98793963c5265ef7e7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 18 14:52:40 2018 -0600

README.md update (External packages section).

Details:
- Updated External packages section in anticipation of introducing BLIS
into Debian package universe. Thanks to M. Zhou for sponsoring BLIS in
Debian.

commit 7bf901e9265a1acd78e44c06f7178c8152c7e267
Author: sraut <Biplab.Rautamd.com>
Date: Tue Dec 18 14:39:16 2018 +0530

Fix on EPYC machine for multi instance performance issue,
Issue: For the default values of mc, kc and nc with multi instance mode the performance across the cores dip drastically.
Fix: After experimentation found different set of values (mc, kc and nc) which fits in the cache size, and performance across the remains same across all the cores.

Change-Id: I98265e3b7e61cd7602a0cc5596240e86c08c03fe

commit d2b2a0819a2fccad9165bc48c0e172d79a87542c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 17 19:26:35 2018 -0600

Removed stray sections from Multithreading.md.

Details:
- Removed unintended section headers from before table of contents.

commit 93d56319f2953cf0e9df1ff2cda90b8e41351b2c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 17 19:17:30 2018 -0600

Added missing bli_init_once() in bli_thread API.

Details:
- Fixed an issue with specifying threading globally at runtime via
bli_thread_set_num_threads() (the automatic way) or via
bli_thread_set_ways() (the manual way), with bli_thread_init_rntm()
also affected. These functions were not calling bli_init_once() prior
to acting, and therefore their effects on the global rntm_t structure
were being wiped out by the eventual call to bli_init_once(), by some
other BLIS function. Thanks to Ali Emre Gülcü for reporting the
behavior associated with this bug.
- Added additional content to docs/Multithreading.md covering topics of
choosing between OpenMP and pthreads, and specifying affinity via
OpenMP.
- CREDITS file update.

commit 76016691e2c514fcb59f940c092475eda968daa2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Dec 13 17:23:09 2018 -0600

Improvements to bli_pool; malloc()/free() tracing.

Details:
- Added malloc_ft and free_ft fields to pool_t, which are provided when
the pool is initialized, to allow bli_pool_alloc_block() and
bli_pool_free_block() to call bli_fmalloc_align()/bli_ffree_align()
with arbitrary align_size values (according to how the pool_t was
initialized).
- Added a block_ptrs_len argument to bli_pool_init(), which allows the
caller to specify an initial length for the block_ptrs array, which
previously suffered the cost of being reallocated, copied, and freed
each time a new block was added to the pool.
- Consolidated the "buf_sys" and "buf_align" pointer fields in pblk_t
into a single "buf" field. Consolidated the bli_pblk API accordingly
and also updated the bli_mem API implementation. This was done
because I'd previously already implemented opaque alignment via
bli_malloc_align(), which allocates extra space and stores the
original pointer returned by malloc() one element before the element
whose address is aligned.
- Tweaked bli_membrk_acquire_m() and bli_membrk_release() to call
bli_fmalloc_align() and bli_ffree_align(), which required adding an
align_size field to the membrk_t struct.
- Pass the pack schemas directly into bli_l3_cntl_create_if() rather
than transmit them via objects for A and B.
- Simplified bli_l3_cntl_free_if() and renamed to bli_l3_cntl_free().
The function had not been conditionally freeing control trees for
quite some time. Also, removed obj_t* parameters since they aren't
needed anymore (or never were).
- Spun-off OpenMP nesting code in bli_l3_thread_decorator() to a
separate function, bli_l3_thread_decorator_thread_check().
- Renamed:
bli_malloc_align() -> bli_fmalloc_align()
bli_free_align() -> bli_ffree_align()
bli_malloc_noalign() -> bli_fmalloc_noalign()
bli_free_noalign() -> bli_ffree_noalign()
The 'f' is for "function" since they each take a malloc_ft or free_ft
function pointer argument.
- Inserted various printf() calls for the purposes of tracing memory
allocation and freeing, guarded by cpp macro ENABLE_MEM_DEBUG, which,
for now, is intended to be a "hidden" feature rather than one hooked
up to a configure-time option.
- Defined bli_rntm_equals(), which compares two rntm_t for equality.
(There are no use cases for this function yet, but there may be soon.)
- Whitespace changes to function parameter lists in bli_pool.c, .h.

commit f808d829c58dc4194cc3ebc3825fbdde12cd3f93
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 12 15:22:59 2018 -0600

Handle edge cases, zero-filling in packm kernels.

Details:
- Updated the API and semantics of packm kernels such that they must now
handle edge cases, meaning that a c-by-k packm kernel must be able to
pack edge cases that are fewer than c rows/columns and be able to
zero-fill the remaining elements. They must also be able to zero-fill
the equivalent region when copying fewer than k columns/rows (which is
needed by trsm). The new packm kernel API is generally:

void packm_kernel
(
conj_t conja,
dim_t cdim,
dim_t n,
dim_t n_max,
ctype* restrict kappa,
ctype* restrict a, inc_t inca, inc_t lda,
ctype* restrict p, inc_t ldp,
cntx_t* restrict cntx
);

where cdim and n are the dimensions (short and long, respectively) of
the submatrix being copied from the source matrix A, and n_max is the
"full" long dimension (corresponding to the k dimension in gemm) of
the micropanel. The "full" short dimension (corresponding to the
register blocksize MR or NR) is not part of the API because it is
known intrinsically by the packm kernel implementation. Thanks to
Devin Matthews for prompting us to make this change (282).
- Updated all reference packm kernels in ref_kernels/1m according to
above changes, as well as all optimized packm kernels (which only
consisted of those for knl).
- Bumped the major soname version number in 'so_version' to 2. At first
I was considering leaving it unchanged, but I couldn't escape the
reality that the packm kernel API is much closer to an expert API
than it is some obscure helper function interface within the framework
that nobody would ever notice.
- Removed reference packm kernels for mr/nr = 30. The only sub-config
that would have been using those kernels is knc, which is likely no
longer being used by very many people (if any). (This also mostly
offset the larger object code footprint incurred by moving the edge-
case handling into the individual packm kernels.)
- Fixed an obscure race condition for 3mh and 4mh induced methods in
which those implementations were modifying the contexts stored in the
gks rather than a local copy.
- Fixed a minor bug in the testsuite that prevented non-1m-based induced
method implementations of trsm from executing.

commit 02ec0be3ba0b0d6b4186386ae140906a96de919b
Merge: e275def3 c534da62
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 5 19:33:53 2018 -0600

Merge branch 'master' into amd

commit c534da62c0015f91391983da5376c9e091378010
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 5 15:51:05 2018 -0600

Disabled ARM configuration families in registry.

Details:
- Disabled (commented out) the arm32 and arm64 configuration families
in the config_registry file. Having a configuration family registered
only makes sense if BLIS is currently outfitted with runtime hardware
detection logic to choose the appropriate sub-configuration. That
logic is currently missing for ARM architectures, and thus having the
ARM configuration families in the configuration registry only serves
to confuse people. Thanks to Devangi Parikh for suggesting this
change.

commit 6885051a164628904fad0d8a3b39c82f9a7b193c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 5 14:45:39 2018 -0600

Generalizations/cleanup to mixeddt matlab scripts.

Details:
- Parameterized, reorganized, and added comments to matlab scripts in
test/mixeddt/matlab.
- Reordered some lines of code and added comments to plot_l3_perf.m in
test/3m4m/matlab.

commit cbdb0566bf3201a495bbdcb8cb50342fa0098649
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Dec 5 20:06:32 2018 +0000

Updates to 3m4m, mixeddt test driver files.

Details:
- Updated 3m4m and mixeddt Makefiles and runme.sh scripts, mostly to
port recent changes to the former to the latter.
- Disabled (for now) code in 3m4m/test_*.c files that disables all
induced methods except for the one that is requested from the
Makefile via the IND macro. This is done because usually, we want to
test whatever method is enabled automatically for complex datatypes.
(That is, when native complex microkernels are missing, we usually
want to test performance of 1m.)

commit 0645f239fbdf37ee9d2096ee3bb0e76b3302cfff
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 4 14:31:06 2018 -0600

Remove UT-Austin from copyright headers' clause 3.

Details:
- Removed explicit reference to The University of Texas at Austin in the
third clause of the license comment blocks of all relevant files and
replaced it with a more all-encompassing "copyright holder(s)".
- Removed duplicate words ("derived") from a few kernels' license
comment blocks.
- Homogenized license comment block in kernels/zen/3/bli_gemm_small.c
with format of all other comment blocks.

commit 9b688a2d69dd420f4d2582827c5ac87e422cd3bc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 4 13:30:25 2018 -0600

Refer to color mm algorithm in Multithreading.md.

commit 22384fd2b749aa8cfdfad1084ce5e7dbd4ad2d64
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 4 13:09:04 2018 -0600

Minor updates to test_gemm.c in test/mixeddt.

commit 2ba3b1780cbca58e43a3948d67bd07e637036125
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 3 19:40:39 2018 -0600

Removed symbols from libblis-symbols.def.

Details:
- Removed bli_gemm_md_front() and bli_gemm_md_zgemm() symbols from
build/libblis-symbols.def, which will hopefully appease AppVeyor.

commit dcb38c4e59c3395c258799e69bfe2104c578c528
Merge: dc184095 375eb30b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 3 18:06:19 2018 -0600

Merge branch 'dev'

commit 375eb30b0a63ac06a363a5f75f283584258db48b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Dec 3 17:49:52 2018 -0600

Added mixed-precision support to 1m method.

Details:
- Lifted the constraint that 1m only be used when all operands' storage
datatypes (along with the computation datatype) are equal. Now, 1m may
be used as long as all operands are stored in the complex domain. This
change largely consisted of adding the ability to pack to 1e and 1r
formats from one precision to another. It also required adding logic
for handling complex values of alpha to bli_packm_blk_var1_md()
(similar to the logic in bli_packm_blk_var1()).
- Fixed a bug in several virtual microkernels (bli_gemm_md_c2r_ref.c,
bli_gemm1m_ref.c, and bli_gemmtrsm1m_ref.c) that resulted in the wrong
ukernel output preference field being read. Previously, the preference
for the native complex ukernel was being read instead of the pref for
the native real domain ukernel. This bug would not manifest if the
preference for the native complex ukernel happened to be equal to that
of the native real ukernel.
- Added support for testing mixed-precision 1m execution via the gemm
module of the testsuite.
- Tweaked/simplified bli_gemm_front() and bli_gemm_md.c so that pack
schemas are always read from the context, rather than trying to
sometimes embed them directly to the A and B objects. (They are still
embedded, but now uniformly only after reading the schemas from the
context.)
- Redefined cpp macro bli_l3_ind_recast_1m_params() as a static function
and renamed to bli_gemm_ind_recast_1m_params() (since gemm is the only
consumer).
- Added 1m optimization logic (via bli_gemm_ind_recast_1m_params()) to
bli_gemm_ker_var2_md().
- Added explicit handling for beta == 1 and beta == 0 in the reference
gemm1m virtual microkernel in ref_kernels/ind/bli_gemm1m_ref.c.
- Rewrote various level-0 macro defs, including axpyris, axpbyris,
scal2ris, and xpbyris (and their conjugating counterparts) to
explicitly support three operand types and updated invocations to
xpbyris in bli_gemmtrsm1m_ref.c.
- Query and use the storage datatype of the packed object instead of the
storage datatype of the source object in bli_packm_blk_var1().
- Relocated and renamed frame/ind/misc/bli_l3_ind_opt.h to
frame/3/gemm/ind/bli_gemm_ind_opt.h.
- Various whitespace/comment updates.

commit e275def30ac41cadce296560fa67282704f20a02
Merge: 8091998b dc184095
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 30 15:39:50 2018 -0600

Merge branch 'master' into amd

commit dc18409551f341125169fe8d4d43ac45e81bdf28
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 28 11:58:40 2018 -0600

CREDITS file update.

commit ee4d2712963816f84d7e3fdd39d93424e1aaf63d
Merge: e81c4b56 3d7e8bc3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 28 11:52:57 2018 -0600

Merge pull request 287 from SuperFluffy/fix_configuration_links

Fix configuration links

commit 3d7e8bc3b8e77693152138e75676f71573e5e6cd
Author: Richard Janis Goldschmidt <janis.beckertgmail.com>
Date: Wed Nov 28 15:56:37 2018 +0100

Fix configuration links

commit 6a4885f8be9ecd81423ebf2eb6da75d7981c979b
Merge: 1d8aae22 e81c4b56
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 27 13:22:59 2018 -0600

Merge branch 'master' into dev

commit e81c4b56660b25a39f8fdc09fbe07459c5bd8e8e
Merge: 757043ea cfbdb58d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 21 17:00:49 2018 -0600

Merge pull request 285 from isuruf/pthread

Move LDFLAGS to the end

commit cfbdb58de2e44f2e3a3d8b14fceece7aef4b3006
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 14:23:39 2018 -0600

Move LDFLAGS to the end

Otherwise the linker will drop flags like -lpthread

commit 757043eae8630c0a76e9bb04f2cb0bd72439a86a
Merge: e769bf46 7af8fa01
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 21 13:07:26 2018 -0600

Merge pull request 283 from isuruf/patch-3

Fix MinGW and Cygwin build failures

commit 7af8fa01373b7bb30fa3b1fd110fd201c87ea225
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 02:10:05 2018 -0600

Fix blis dll path

commit 2acd8dcd23805203a6821358c5e3e09d521fecdf
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 02:02:18 2018 -0600

Fix install path of dll.a

commit b7b0ad22b151e89e2a6c7782cf4d8d47b4e60734
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:54:44 2018 -0600

Test mingw

commit bafe521ed0012b7b8814404b78a6c576d8386370
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:54:36 2018 -0600

Fixes for mingw

commit be831879bd03edcddff8a345161f749ad92215af
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:39:32 2018 -0600

test gcc shared

commit f6b924648c79c4b1c3d3c7fbf85372680aff8362
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:39:19 2018 -0600

Don't use .def for gcc

commit ce6e4eae6d5e977e6f699acc9cf239be8ac53771
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:34:56 2018 -0600

test no threading

commit c9169b4685bfe81bc562cf9128b35a6a9884799b
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:17:36 2018 -0600

Add mingw64 path

commit 0f753090eaf4264b743a49ce15de97514bcbe112
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:14:52 2018 -0600

Fix PATH

commit d424470b1f2fa8717fa54c0245b21341504665f6
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 01:04:26 2018 -0600

Check openmp and pthreads threading

commit c73e7601e58239e2dedec6c9f1b752e949254a42
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 00:50:33 2018 -0600

Revert "enable rdp"

This reverts commit 368274bcbd0c9232521d14fa28304f35ced0e6d7.

commit 6209b2e6060b89e65f3405c31333af8952dd63c0
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 00:50:22 2018 -0600

Remove conda

commit 0b1b344447b8a2fcd635a48f0ce7ce89b2107dc4
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 00:42:39 2018 -0600

Fix make name

commit 7a9838983ba8dd32ac9f87712255721542ff561f
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 00:35:27 2018 -0600

Use m2w64-make

commit 4c1dedd6a90087807f16353a5d0bcaaade35a7a5
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 00:28:20 2018 -0600

No activate on gcc

commit 368274bcbd0c9232521d14fa28304f35ced0e6d7
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Nov 20 23:40:26 2018 -0600

enable rdp

commit 707a5e7f9b07f554e1e9289dd0ce3b7dc4fded6e
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Nov 20 23:39:31 2018 -0600

No conda for mingw build

commit 65b0565c0ad9162d4474bd84eabde491fa971538
Author: Isuru Fernando <isurufgmail.com>
Date: Tue Nov 20 23:19:38 2018 -0600

Check MinGW-w64

commit 9ddffba5847080e0d77d9e6059d05dc4b1d89ba5
Author: Isuru Fernando <isurufgmail.com>
Date: Wed Nov 21 00:23:34 2018 -0600

Fix MinGW build failure

Fixes https://github.com/flame/blis/issues/278

commit 1d8aae220bc52ce8e3a8afaa64b57e5d83480bdc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 20 18:42:07 2018 -0600

Track internal scalar datatypes.

Details:
- Added a num_t datatype bitfield to the obj_t in the form of a new
info2 field in the obj_t. This change was made primarily so that in
the case of mixed-datatype gemm, the alpha scalar would not need to
be cast to the storage datatype of B (or A) before then being cast to
the computation datatype just before the macrokernel is called. This
double-casting regime could result in loss of precision if the storage
datatype of B (or A) is less than the computation precision. In
practice, it was likely not going to be a big deal since most usage of
alpha is for -1.0, 0.0, and 1.0 (or integer multiples thereof), which
can all be represented exactly in single or double precision.
- The type of objbits_t was changed to uint32_t, so the new format
potentially takes up the same space as the previous obj_t definition,
assuming no padding inserted by the compiler. Shrinking info to 32
bits and spilling over into a second field was chosen over using the
high 32 bits of a single 64-bit objbits_t info field because many of
the bitwise operations are performed with enums such as num_t, dom_t,
and prec_t, which may take on the type of 32-bit ints. It's easier to
just keep all of those bitwise operations in 32 bits than perform a
million typecasts throughout bli_type_defs.h and bli_obj_macro_defs.h
to ensure that the integers are treated as 64-bit for the purposes of
the ANDs, ORs, and bitshifts.
- Many comment updates.
- Thanks to Devin Matthews and Devangi Parikh for their feedback and
involvement during this commit cycle.

commit e769bf46b0931d68031af212110484ec98e16908
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 20 16:16:53 2018 -0600

Tweak testsuite to issue FAIL for Nan, Inf (279).

Details:
- Adjusted the definition for libblis_test_get_string_for_result() in
testsuite/src/test_libblis.c so that the "FAIL" string is returned if
the computed residual contains either NaN or Inf. Previously, a
residual containing NaN would result in the selection of the "PASS"
string. Thanks to Devin Matthews for reporting this issue (279).
- Expounded on comment for the macro definitions of bli_isnan() and
bli_isinf() in bli_misc_macro_defs.h to make it more obvious why they
must remain macros.

commit 279deae18fb8b8106161863b46fcb38232314de4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 16 11:34:19 2018 -0600

Added 4x5 matlab plotting scripts to test/3m4m.

Details:
- Added a new directory, test/3m4m/matlab, containing matlab scripts for
plotting 4x5 panels of performance graphs (using the subplot()
function) for gemm, hemm, herk, trmm, and trsm across all four
floating-point datatypes. I expect to further refine these scripts as
time goes on, but their current state constitutes a good start.

commit 7b02c726650336c12286c8ba166d1d0fdf7601a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 14 13:49:55 2018 -0600

CREDITS file update.

commit 84dd298a27033945fa2d3b6e5dce1fe625cd2a0a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 14 13:47:45 2018 -0600

Patch to fix msys2/Windows build failure (277).

Details:
- Expanded cpp guard in frame/include/bli_x86_asm_macros.h to also check
__MINGW32__ in addition to _WIN32, __clang__, and __MIC__. Thanks to
Isuru Fernando for suggesting this fix, and also to Costas Yamin for
originally reporting the issue (277).

commit 8091998b6500e343c2024561c2b1aa73c3bafb0b
Merge: 333d8562 7b5ba731
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 14 12:36:35 2018 -0600

Merge branch 'master' into amd

commit 7b5ba7319b3901ad0e6c6b4fa3c1d96b579efbe9
Merge: ce719f81 52392932
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 14 12:32:01 2018 -0600

Merge branch 'dev' of github.com:flame/blis into dev

commit 52392932dc1ea3c16220cc4e6978efcb2f5f0616
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 13 22:23:38 2018 +0000

Minor fixes to test/3m4m drivers.

Details:
- Cleanups to Makefile to allow all test drivers to be built for
OpenBLAS and MKL in addition to BLIS.
- Fixed copy-paste typos in test_hemm in calls to ssymm_() and dsymm_().
- Fixed incorrect types for betap in BLAS cpp macro branch of
test_herk.c.

commit 4f12e36a0d0e6df146314b4e50e36c5e7a1af3d3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 13 14:23:12 2018 -0600

Fixed number of columns in first output line.

Details:
- In previous commit, forgot to remove output column corresponding to
the k dimension.

commit a2e0cdd7debf8109198536d55af05d5631072fb2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 13 14:15:11 2018 -0600

Added hemm test driver to test/3m4m.

Details:
- Added a new test_hemm.c test driver to test/3m4m, which was modeled
after the driver by the similar name in test. Also updated Makefile
so that blis-nat-[sm]t would trigger builds for the new driver.

commit 0f9b53e84b48d8d73a56cc9889eae3595ca58a78
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 13 13:03:15 2018 -0600

Fixed a bug in high-level mixeddt conditional.

Details:
- Fixed a bug in frame/3/bli_l3_oapi.c in the conditional that divides
use of induced method (1m) execution from native execution. The former
was intended to only be used in cases where all storage datatypes are
complex and the datatype of C is equal to the computation datatype.
(If mixed datatypes are detected, native execution would be used.)
However, the code in bli_gemm() was erroneously checking the execution
datatype instead of the computation datatype, which at that point is
guaranteed to be equal to the storage datatype even if the computation
datatype contains a different value. Thanks to Devangi Parikh for
helping in isolating this bug.

commit 333d8562f04eea0676139a10cb80a97f107b45b0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Nov 11 14:28:53 2018 -0600

Added debug output to bli_malloc.c.

Details:
- Added debug output to bli_malloc.c in order to debug certain kinds of
memory behavior in BLIS. The printf() statements are disabled and must
be enabled manually.
- Whitespace/comment updates in bli_membrk.c.

commit ce719f816d1237f5277527d7f61123e77180be54
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 10 14:48:43 2018 -0600

More edits to mixeddt matlab scripts.

Details:
- Renamed scripts in test/mixeddt/matlab:
plot_case_all.m -> plot_dom_all.m
plot_case_md.m -> plot_dom_case.m
plot_all_md.m -> plot_dt_all.m
- Added plot_dt_select.m in order to plot select graphs for the main
body of the mixeddt paper, and added additional related legend
handling in plot_gemm_perf.m.
- Added test/mixeddt/matlab/output and a .gitkeep file within in order
to force git to recognize the directory.

commit bf99e7c14baf45725b698d06ad043b531e3a2763
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 8 18:47:17 2018 -0600

Minor updates to test/mixeddt driver.

Details:
- Cleaned up test/mixeddt Makefile in preparation for gathering new
data for mixeddt paper, including renaming implementations to
"internal" and "ad-hoc" to match the terminology to be used in the
paper.
- Added new matlab scripts for generating 8 figures, each covering all
mixed-precision cases for each mixed-domain case.
- Updated the runme.sh script according to changes to Makefile.
- Fixed a minor bug in test_gemm.c that may have given incorrect
performance in complex, homogeneous storage datatype cases where
the computation precision was equal to the storage precisions.
(Examples: zzzd, cccs.)

commit 4bbb454bf3c361af9e97bfa394a73d610cd9002a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 3 19:11:01 2018 -0500

Testsuite docs update for mixed-datatype gemm.

Details:
- Updated docs/Testsuite.md to include mention of the new mixed-domain
and mixed-precision settings, including descriptions.
- Updated docs/MixedDatatypes.md to include a brief section on running
the testsuite to exercise mixed-datatype functionality, which mostly
amounts to a link to the Testsuite.md document.
- Minor verbiage change to testsuite output to correct a misleading
label associated with the value returned by the query function
bli_info_get_simd_num_registers(). (The function does not return the
number of SIMD registers present in the hardware, but rather a maximum
assumed value for the purposes of allocating temporary microtile
workspace on the function stack.)

commit 16401ae922b1285437cf5f6867b2764650a95fb0
Merge: f19c33af 2d403a15
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Nov 3 19:09:43 2018 -0500

Merge branch 'dev'

commit 2d403a1535380a2ebe2ae2c0f5ac54ba7564fbeb
Merge: e90e7f30 4a12979f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 1 20:18:53 2018 -0500

Merge pull request 275 from RhysU/patch-1

Spelling in FAQ

commit 4a12979f65697ed79ba290efd59f4b994ac9429b
Author: Rhys Ulerich <rhys.ulerichgmail.com>
Date: Thu Nov 1 20:20:59 2018 -0400

Spelling in FAQ

commit f19c33af4cbe6f5705b96fbf2b8799c3c2bd75c3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 26 17:07:15 2018 -0500

Disallow 64b BLAS integers + 32b BLIS integers.

Details:
- Print an error message from configure if the user attempts to
explicitly configure BLIS for simultaneous use of 64-bit integers in
the BLAS API with 32-bit integers in the BLIS API.
- Added cpp macro conditional to bli_type_defs.h to mandate that BLIS
integers be 64 bits if the BLAS integers are 64 bits. This and the
above item take care of issue 274. Thanks to Devin Matthews and
Jeff Hammond for suggesting these safeguards.
- Slight reorganization and relabeling (for clarity) of BLAS/CBLAS
sections and BLIS integer size line of the testsuite configuration
output.
- Very minor edits to docs/MixedDatatypes.md.

commit e90e7f309b3f2760a01e8e09a29bf702754fa2b5 (origin/win-pthreads)
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 25 14:09:43 2018 -0500

CHANGELOG update (0.5.0)

0.5.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 25 14:09:40 2018 -0500

Version file update (0.5.0)

commit 75da7f2a208ad7d26ed9c6d3e10d08b2a1caf9d6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 25 14:02:41 2018 -0500

ReleaseNotes.md update in advance of next version.

Details:
- Updated ReleaseNotes.md in preparation for next version.
- Updated docs/FAQ.md to reflect recent developments, and other edits.
- Minor updates to RELEASING.

commit 6fbc456fb3f4401ec951a618990f15a84fdfa236
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 25 13:20:25 2018 -0500

Added SALT testing to Travis CI.

Details:
- Modified .travis.yml to automatically employ the simulation of
application-level threading within the testsuite, with supporting
changes to common.mk, the top-level Makefile, and
travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
'.salt' suffix (similar to those with the '.fast' suffix) for
testing application-level threading.
- Updated docs/BuildSystem.md to document the new make targets
'testblis-salt' and 'checkblis-salt'.

commit 0e27963a6770e6b64f3299ad0613d5df45d8b6ae
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 24 12:16:19 2018 -0500

Add bli_pthread_mutex_trylock().

Details:
- Added the missing bli_pthread_mutex_trylock() function and prototype
to the non-Windows sections of bli_pthread.c and .h. This function
isn't needed by BLIS, but I figured why not make the Windows and
non-Windows sections consistent with one another.

commit 4b683740c12f83804a51ec610b16ce28607d5c85
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 24 11:56:16 2018 -0500

Defined bli_pthread_cond_*() and related defs.

Details:
- Added function definitions for bli_pthread_cond_*() as well as related
types and constants to bli_pthread.c, and corresponding prototypes to
bli_pthread.h.

commit 4b4f8072b9bb495b3e01d45698b0bad3dac31ba8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 24 11:31:46 2018 -0500

Define bli_pthreads barrier types on OS X.

Details:
- Fully define bli_pthreads barrier-related types on OS X. Only typedef
those types in terms of pthreads types on non-Windows, non-Apple OSes
(i.e. Linux).

commit ad98790dcef6bd9aab7f13d615b987b5daa58757
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 23 20:35:05 2018 -0500

Fix names of Windows pthread initializer macros.

Details:
- Renamed the PTHREAD_ initializer macros in the Windows cpp case to use
BLIS_ prefixes to match their non-Windows counterparts.

commit 06c23954e6b17219a50c3d37821544a46defaf89
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 23 19:16:54 2018 -0500

Defined unified bli_pthreads_*() API for all OSes.

Details:
- Expanded the bli_pthread_*() -> pthread_*() wrappers in
frame/thread/bli_pthread.c to include cases for Windows taken from
frame/base/bli_pthread_wrap.c. Now, bli_thread_*() is always defined
and always used by BLIS and the BLIS testsuite (in lieu of calling
pthreads directly, as before). The implementation used in this new
API depends on whether we are building for Windows, and to a lesser
extent, whether we are building on OS X. For the core API, Windows
uses Windows threads, non-Windows (Linux, OS X) uses pthreads.
OS X and Windows get barriers implemented in terms of other
bli_pthread_*() functions, and Linux gets barriers implemented in
terms of pthread_barrier*(). This commit addresses issue 273.
- Fixed a bug in the Linux definition of bli_pthread_mutex_unlock(),
which was erroneously calling pthread_mutex_lock().
- Minor changes to configure so that the auto-detection executable
can be built given the above changes (most notably, turning on
POSIX extensions via -D_GNU_SOURCE).
- Removed temporary play-test code for shiftd that accidentally got
committed into test/3m4m/test_gemm.c.

commit 0ae9585da1e3db1cf8034d4b16305a5883beb0d3
Author: pradeeptrgit <pradeep.raoamd.com>
Date: Tue Oct 23 09:36:23 2018 +0530

Update version number to 1.2

Change-Id: Ibb31f6683cdecca6b218bc2f0c14701d7e92ebf3

commit eac7d267a017d646a2c5b4fa565f4637ebfd9da7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 22 18:10:59 2018 -0500

Unconditionally define bli_l3_thread_entry().

Details:
- Define a dummy bli_l3_thread_entry() function when multithreading is
disabled altogether, or enabled via OpenMP. This function was
originally necessary when multithreading is enabled via pthreads.
By defining the function no matter the threading options given, it is
less likely that an AppVeyor Windows build will complain due to a
missing symbol in the DLL. (To be clear: AppVeyor was working fine
before, but a problem may have arisen if it were switched to an
OpenMP build.)
- Removed the prototype for bli_l3_thread_entry() from
bli_thrcomm_pthreads.c and placed it in bli_thrcomm.h.
- Regenerated the symbols list file build/libblis-symbols.def.

commit 4ee986f0a74207f4ca29df077929134725d62b80
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 22 14:09:44 2018 -0500

Added mixed-datatype testing to Travis CI (271).

Details:
- Modified .travis.yml to automatically test the mixed-datatype support
of the gemm operation, with supporting changes to common.mk, the
top-level Makefile, and travis/do_testsuite.sh.
- Added a new pair of input files to testsuite directory with the
'.mixed' suffix (similar to those with the '.fast' suffix) for testing
mixed-datatype gemm.
- Updated docs/BuildSystem.md to document the new make targets
'testblis-md' and 'checkblis-md'.

commit c3c6ebc9c6244053d654a9b0c955acb2fef42ee8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Oct 21 18:48:54 2018 -0500

Fixed thrinfo_t printing for small problems.

Details:
- Fixed a bug in the code that prints out the communicator and work ids
from the various threads' thrinfo_t nodes. This bug manifested when
the dimension being parallelized was not large enough such that every
thread was assigned actual work (since the minimum amount of work is
determined by the register blocksize in the dimension being
parallelized). In those cases, the threads that receive no work in
that dimension do not finish building their thrinfo_t tree, leaving
lower-level nodes non-existent. (The bug itself was usally observed as
a segfault when the printing code attempted to dereference all the way
down the thrinfo_t tree.) The solution involves explicitly checking
each node as it is dereferenced, and if at any time NULL is found, all
subsequent communicator and work ids are set to -1.

commit 73a222c0d99dcc221be7dea10eaebf844f31f72e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Oct 20 14:13:04 2018 -0500

Minor edits to 'configure --help' text.

commit 14f3d5e6df183819a0c393b2661ad15df0786544
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 20:39:35 2018 -0500

Refresh libblis-symbols.def post-merge 090e4f0.

commit 090e4f08fc2f429a1b2db77b0a6f8276f892a7ac
Merge: c9be5889 0854e880
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 18:41:10 2018 -0500

Merge branch 'master' into dev

commit 0854e880b0848e0c2e3d0644c93c80b0fd13c0dc
Merge: 4e38a8d4 343a2715
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 18:05:00 2018 -0500

Merge pull request 261 from flame/win-pthreads

Implement missing pthreads function on Windows

commit c9be5889fbe947c64ef75740662e4d63032f4c35
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 17:42:40 2018 -0500

Added "Known issues" section to Multithreading.md.

Details:
- Added known issues section to Multithreading.md.
- Trivial changes to MixedDatatypes.md, Sandboxes.md.

commit 343a2715ebee28d250ee41b914abdcd1dc77c344
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 16:59:19 2018 -0500

Whitespace changes to configure, bli_pthread_wrap.

Details:
- Mostly whitespace changes (spaces to tabs) to configure and
bli_pthread_wrap.c and .h.

commit 3678a1cd518df9447b4b1ea86885eb2ba8abcf6e
Merge: 85397cd4 4e38a8d4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 16:11:31 2018 -0500

Merge branch 'master' into win-pthreads

commit 4e38a8d4eebb18ead74e644fac76a4fde8e7f6c6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 15:54:15 2018 -0500

Implemented python version checking in configure.

Details:
- Added python version checking to configure script. (Recall that python
is needed to execute the flatten-headers.py script.) Minimum versions
of python needed are currently as follows:
python2: 2.7 or later
python3: 3.5 or later
The standard search order for python interpeters is:
python python3 python2
The PYTHON environment variable is also supported and will be checked
before the standard search order list.
- Updated BuildSystem.md to include: a minimum make version; mention
that the C compiler must actually be a C99 compiler; and the caveat
that Windows builds do not require pthreads since BLIS can provide
an implementation of pthreads internally.

commit 85397cd4fa52f6c4c33f4fb715478c55533c680e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 13:12:43 2018 -0500

Added explanatory comment to bli_pthread.c.

Details:
- Added a verbose comment to bli_pthread.c that explains why a bli_
wrapper to pthreads APIs is useful.

commit 53c07035ef61cc9b8469636d4d8fa5085f37652d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 19 12:53:03 2018 -0500

Refresh libblis-symbols.def from bb6df28.

Details:
- Forgot to regenerate the symbols file after the previous commit
(bb6df281) in which shiftd operation was introduced.

commit 473ce54f5fbea4860ac0514e7e8b022c1ea03e63
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 18 19:03:56 2018 -0500

Added bli_pthread_*() API.

Details:
- Defined a bli_pthread_*() API so that the testsuite, when being linked
against a Windows DLL, will be able to access pthreads functionality
without those pthreads functions being explicitly exported by the DLL.
Instead, we export the bli_pthread_*() layer, which uses types and
functions that are identical to pthreads, but adds a 'bli_' prefix.
Only a few basic functions are present in the bli_pthreads_*() API
for now. Thanks to Devin Matthews and Isuru Fernando for their help
on a related PR (261) that this commit will hopefully facilitate.
- Updated testsuite so that it calls bli_pthread_*() layer instead of
pthread_*() functions directly.
- Regenerated build/libblis-symbols.def.
- Comment updated to build/regen-symbols.sh.

commit bb6df2814fcaa2fa62a549379f61be2f8667a598
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 18 17:11:39 2018 -0500

Defined a new level-1d operation: shiftd.

Details:
- Defined a new level-1d operation called 'shiftd', including object and
typed APIs. This operation adds a scalar value to every element along
an arbitrary diagonal of a matrix. Currently, shiftd is implemented in
terms of the addv kernel. (The scalar is passed in as the x vector
with an increment of zero.)
- Replaced ad-hoc usage of setd and addd (after creating a temporary
matrix object) with use of shiftd, which is much more concise, in
various test driver files in the testsuite. Similar changes were made
to the standalone test drivers and the example code.
- Added documentation entries in BLISObjectAPI.md and BLISTypedAPI.md
for bli_shiftd() and bli_?shiftd(), respectively.
- Added observed object properties to level-1d documentation in
BLISObjectAPI.md.

commit 53e0a0c9b38e8525c7224e280342ef56328af567
Merge: 1c7247b6 ec676799
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 18 14:54:59 2018 -0500

Merge branch 'master' into win-pthreads

commit ec67679990660a60362a49406595383672812287
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 18 14:27:02 2018 -0500

Refreshed Windows symbol list; added regen script.

Details:
- Moved windows/build/libblis-symbols.def to build/libblis-symbols.def.
Updated link commands in common.mk accordingly.
- Added a new script build/regen-symbols.sh that will regenerate the
libblis-symbols.def file in its new location after building a
haswell-targeted shared library. Thanks to Isuru Fernando for
providing the symbol generation command.
- Ran the new script to refresh the symbols file.

commit fdad54ab8eee4a7efd04ec4afb3e6902eb22e60a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 18 12:43:22 2018 -0500

Removed old symbol from libblis-symbols.def.

Details:
- Removed bli_gemm_ker_var1() from windows/build/libblis-symbols.def
since this function is no longer compiled.

commit 49d3f9fcbb4a75553439f97c099ea48d85763eea
Merge: 779d64dc 3c527256
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 17 18:00:40 2018 -0500

Merge branch 'master' into dev

commit 3c52725693d0d7726e1c8fb224f9b1ef786db8b9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 17 14:56:22 2018 -0500

Renamed/moved l3 zen ukernels to haswell kernel set.

Details:
- Renamed the microkernels in kernels/zen/3 to kernels/haswell/3 and
then updated the file contents to use the 'haswell' infix.
- Updated bli_cntx_init_zen.c and bli_cntx_init_haswell.c according to
above function renames.
- Moved/updated the corresponding prototypes in bli_kernels_zen.h to
bli_kernels_haswell.h.
- Updated config_registry according to above changes.
- NOTE: This rename reflects the fact that haswell microkernels are
specifically written to overcome the floating-point latency for FMA
instructions on Intel Haswell-like architectures, which can issue two
FMA instructions per cycle. These ukernels happen to work fine on AMD
Zen-based architectures. However, Zen only issues one FMA per cycle,
which, while halving its floating-point throughput, gives it extra
flexibility in the design of its microkernels--namely, mr and nr can
be smaller and still overcome the floating-point latency for those
single-issue cores. A smaller value of mr and nr allows for a larger
value of kc, which may be useful in some situations. In the future,
we may write such Zen-specific microkernels to take advantage of this
additional flexibility.

commit 71c5832d5f5596f25204980803423d08143a4010
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 17 14:11:01 2018 -0500

Consolidated slab/rr-explicit level-3 macrokernels.

Details:
- Consolidated the *sl.c and *rr.c level-3 macrokernels into a single
file per sl/rr pair, with those files named as they were before
c92762e. The consolidation does not take away the *option* of using
slab or round-robin assignment of micropanels to threads; it merely
*hides* the choice within the definitions of functions such as
bli_thread_range_jrir(), bli_packm_my_iter(), and bli_is_last_iter()
rather than expose that choice explicitly in the code. The choice of
slab or rr is not always hidden, however; there are some cases
involving herk and trmm, for example, that require some part of the
computation to use rr unconditionally. (The --thread-part-jrir option
controls the partitioning in all other cases.)
- Note: Originally, the sl and rr macrokernels were separated out for
clarity. However, aside from the additional binary code bloat, I later
deemed that clarity not worth the price of maintaining the additional
(mostly similar) codes.

commit 57eab3a4f0e43099fc2ff189df9fcc0d7801c2cd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 17 11:29:20 2018 -0500

CREDITS file update.

commit 6722ec21817cbab9d86ee63f00984eb407b5e627
Author: Ye Luo <xw111luoyegmail.com>
Date: Wed Oct 17 11:26:00 2018 -0500

Fix bgclang compilation on BGQ (270)

* Fix bgq kernels

* Support bgq with bgclang

commit 1c7247b6d146fc728d7c4240e4e069e33f8f8868
Merge: c1bc5530 6c5a1aaf
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 16 14:44:32 2018 -0500

Merge branch 'win-pthreads' of github.com:flame/blis into win-pthreads

commit c1bc5530d51bf55b4aa3c35165f6d4452a0fd779
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 16 14:44:10 2018 -0500

Don't call pthread_once in auto-detect.

commit b9c61d03f542a2e92551ff0595415bec3076ab25
Merge: 5a1e461f 3612ecac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 16 14:39:57 2018 -0500

Merge branch 'nested-omp-patch'

commit 5a1e461ffe09ed200ee2fc7aafccf6dd7e8c0080
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 16 14:21:45 2018 -0500

Execute flatten-headers.py via $(PYTHON).

Details:
- Execute build/flatten-headers.py python script via $(PYTHON) in
common.mk. This allows distributions that define the current/preferred
python interpreter in the PYTHON environment variable to use that
interpreter when executing flatten-headers.py. Thanks to Isuru
Fernando for this suggestion, and for Dave Love for submitting the
initial issue/request.

commit 6c5a1aaff540b19672e91501e894ed695aee322b
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 16 10:15:59 2018 -0500

Fix type in bli_pthread_wrap.c

commit 29e6245816760b1bd4ac738d7d3e11a9d9d13473
Merge: 0b73209f ed657714
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 16 10:12:25 2018 -0500

Merge branch 'master' into win-pthreads

commit 0b73209f6b22cc024169146d343627f6999b63d8
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 16 10:02:06 2018 -0500

Add missing argument to WaitForSingleObject and use $is_win in configure
to turn off pthreads.

commit ed65771482a705f7ed028d822489766327b44e76
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 15 17:54:45 2018 -0500

Fixed merge fail on testsuite threading macros.

Details:
- Applied the following C preprocessor macro renames

BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N

in src/test_libblis.c. This is apparently the result of a failure by
git to properly merge the 'master' and 'amd' branches in the previous
commit. (The 'master' branch contained a commit, 53a9ab1, in which
these same cpp macros were renamed throughout the source distribution.

commit dc5fd898af8c74c2e2a75fc647157da0d04dd922
Merge: 667d3929 637c2ce7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 15 17:41:35 2018 -0500

Merge branch 'amd'

commit 779d64dc3091dea6b7530283304e52878151d218
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 15 17:13:18 2018 -0500

Added entry for xpbym to input.operations.fast.

Details:
- Forgot to add an entry for the new xpbym operation to
input.operations.fast in previous commit.

commit 5fec95b99f61761963834f62a9867f797687813c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 15 16:37:39 2018 -0500

Implemented mixed-datatype support for gemm.

Details:
- Implemented support for gemm where A, B, and C may have different
storage datatypes, as well as a computational precision (and implied
computation domain) that may be different from the storage precision
of either A or B. This results in 128 different combinations, all
which are implemented within this commit. (For now, the mixed-datatype
functionality is only supported via the object API.) If desired, the
mixed-datatype support may be disabled at configure-time.
- Added a memory-intensive optimization to certain mixed-datatype cases
that requires a single m-by-n matrix be allocated (temporarily) per
call to gemm. This optimization aims to avoid the overhead involved in
repeatedly updating C with general stride, or updating C after a
typecast from the computation precision. This memory optimization may
be disabled at configure-time (provided that the mixed-datatype
support is enabled in the first place).
- Added support for testing mixed-datatype combinations to testsuite.
The user may test gemm with mixed domains, precisions, both, or
neither.
- Added a standalone test driver directory for building and running
mixed-datatype performance experiments.
- Defined a new variation of castm, castnzm, which operates like castm
except that imaginary values are not touched when casting a real
operand to a complex operand. (By contrast, in these situations castm
sets the imaginary components of the destination matrix to zero.)
- Defined bli_obj_imag_is_zero() and substituted calls in lieu of all
usages of bli_obj_imag_equals() that tested against BLIS_ZERO, and
also simplified the implementation of bli_obj_imag_equals().
- Fixed bad behavior from bli_obj_is_real() and bli_obj_is_complex()
when given BLIS_CONSTANT objects.
- Disabled dt_on_output field in auxinfo_t structure as well as all
accessor functions. Also commented out all usage of accessor
functions within macrokernels. (Typecasting in the microkernel is
still feasible, though probably unrealistic for now given the
additional complexity required.)
- Use void function pointer type (instead of void*) for storing function
pointers in bli_l0_fpa.c.
- Added documentation for using gemm with mixed datatypes in
docs/MixedDatatypes.md and example code in examples/oapi/11gemm_md.c.
- Defined level-1d operation xpbyd and level-1m operation xpbym.
- Added xpbym test module to testsuite.
- Updated frame/include/bli_x86_asm_macros.h with additional macros
(courtsey of Devin Matthews).

commit 3612ecac98a9d36c3fcd64154121d420bb69febd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 11 15:16:41 2018 -0500

Added comments to nested OpenMP handling code.

Details:
- Added comments to bli_thrcomm_openmp.c relating to changes made in
6ac0c80 and 1064d79.

commit 667d3929ee20e94849b4e25b693b4037b7e3f350
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 11 11:47:57 2018 -0500

Added Fortran APIs for some thread functions.

Details:
- Defined Fortran-77 compatible APIs for bli_thread_set_num_threads()
and bli_thread_set_ways(). These wrappers are defined in
frame/compat/blis/thread/b77_thread.c. Thanks to Kay Dewhurst for
suggesting these new interfaces.
- Added missing prototype for bli_thread_set_ways() in bli_thread.h and
removed prototypes for non-existent functions bli_thread_set_*_nt().
- CREDITS file update.

commit 1064d79711f03a0541b92d8b8b9b7e25e04097a5
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 11 11:14:25 2018 -0500

Adjust rntm_t struct as well.

commit 6ac0c805609b85616ddb32e50101c4f9feb25a35
Author: Devin Matthews <damatthewssmu.edu>
Date: Thu Oct 11 10:45:07 2018 -0500

Fix OMP nesting problem.

Detect when OpenMP uses fewer threads than requested and correct accordingly, so that we don't wait forever for nonexistent threads. Fixes 267.

commit 78a6935483409ae277c766406e175772e820b1de
Author: sraut <Biplab.Rautamd.com>
Date: Thu Oct 11 10:49:40 2018 +0530

Added comments for the change in syrk small matrix change.

Change-Id: I958939e9953323730da49ef07d1b10e578837d82

commit 53a9ab1c85be14dcfd2560f5b16e898e3e258797
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 10 15:11:09 2018 -0500

Renamed thread auto-factorization macro constants.

Details:
- Renamed the following C preprocessor macros whose fallback/default
values are specified within frame/include/bli_kernel_macro_defs.h:

BLIS_DEFAULT_MR_THREAD_MAX -> BLIS_THREAD_MAX_IR
BLIS_DEFAULT_NR_THREAD_MAX -> BLIS_THREAD_MAX_JR
BLIS_DEFAULT_M_THREAD_RATIO -> BLIS_THREAD_RATIO_M
BLIS_DEFAULT_N_THREAD_RATIO -> BLIS_THREAD_RATIO_N

- Renamed the above cpp macro overrides within the knl, skx, and zen
sub-configurations, as well as invocations of those macros in
bli_rntm.c.
- Moved config/zen/bli_kernel.h to an 'old' directory as it is no longer
used by any code within BLIS.

commit 637c2ce794b0414ba8b25e9a452f7d64f825d63a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 9 17:18:04 2018 -0500

Updated column index range for irun.py -q.

Details:
- Forgot to apply the column index range fix in 10f179f to situations
when "quiet" mode (-q) is requested. This commit applies the new
column index range modifications to the quiet case.

commit e2a59400bdda7ed7ee0ff00edea70c00ed593b6c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 9 15:29:48 2018 -0500

Allow trsm_l parallelism in the jc loop.

Details:
- Previously, trsm was consolidating all ways of parallelism into the jr
loop. This was unnecessary and to some degree detrimental on some
types of hardware. Now, any parallelism bound for the jc loop will be
applied to the jc loop, while all other loops' parallelism is funneled
to the jr loop. Thanks to Devangi Parikh for helping investigate this
issue and suggesting the fix.
- NOTE: This change affects only left-side trsm. However, currently
right-side trsm is currently implemented in terms of the left-side
case, and thus the change effectively applies to both left and right
cases.

commit f1dba506c970f14e612580d3c171e7c5ffd0a5fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 8 17:59:41 2018 -0500

Output threading status/params from testsuite.

Details:
- Updated testsuite to output various parameters related to parallelism
in BLIS. These parameters include:
- threading status: disabled, openmp, or pthreads;
- thread partitioning for jr/ir loops: slab or rr (round-robin);
- ways of parallelism from environment variables, and also actual
values used by gemm, herk, trmm_l, trmm_r, trsm_l, and trsm_r for
square problems (assuming all dimensions are set to 1000);
- automatic thread factorization parameters.
- Also output the status of two relatively new configure-time options:
libmemkind and the sandbox.

commit 10f179fb13fc1179921a4ef8efdd2174f01e07da
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 8 14:36:38 2018 -0500

Updated irun.py to use updated column index range.

Details:
- Updated the irun.py script so that it updates the matlab column index
range (if found) to reflect the additional columns of data that are
substituted in. Thanks to Devangi Parikh for recognizing and reporting
this issue.

commit c244a716c97849dee41f52b5f424116aae1b710b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Oct 7 20:59:40 2018 -0500

Added missing -r option to configure --help output.

Details:
- Added inadvertantly-omitted mention of -r option-equivalent to
--thread-part-jrir to the output for 'configure --help'. Also made
minor edits to the same text.

commit c92762ecdca1eb0b08c8acd583b4739a1e3fbd39
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Oct 7 20:30:32 2018 -0500

Added option of slab or rr partitioning in jr/ir.

Details:
- Updated existing macrokernel function names and definitions to
explicitly use slab assignment of micropanels to threads, then created
duplicate versions of macrokernels that explicitly use round-robin
assignment instead of slab. NOTE: As in ac18949, trsm_r macrokernels
were not substantially updated in this commit because they are
currently disabled in bli_trsm_front.c.
- Updated existing packing function (in blk_packm_blk_var1.c) to
explicitly use slab partitioning, and then duplicated for round-robin.
- Updated control tree initialization to use the appropriate macrokernel
and packm function pointers depending on which method (slab or rr) was
enabled at configure-time.
- Updated configure script to accept new --thread-part-jrir=[slab|rr]
option (-m [slab|rr] for short), which allows the user to explicitly
request either slab or round-robin assignment (partitioning) of
micropanels to threads.
- Updated sandbox/ref99 according to above changes.
- Minor updates to build/add-copyright.py.

commit 98e01ea04bfe1032e5bd4781043afd84f864a19e
Merge: ac18949a 541b8a3b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 4 20:44:12 2018 -0500

Merge branch 'master' into amd

commit 541b8a3b3e9af4078f5e6fb2f9608d681839952a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 4 20:39:06 2018 -0500

Removed 1h short-circuit from bli_clock_min_diff().

Details:
- Removed a guard from bli_clock_min_diff() that would return 0 if the
time delta was greater than 60 minutes. This was originally intended
to disregard extremely large values under the assumption that the
user probably didn't intend to run a test that long. However, since
it is in bli_clock_min_diff(), it doesn't actually help short-circuit
an implementation that is hanging or looping infinitely, since such
an implementation would first have to finish before the
bli_clock_min_diff() is called. Thanks to Kiran Varaganti for
reporting this issue.

commit f0c3ef359f7c6c1687fb2671cb35deb346e00597
Author: Kiran V <Kiran.Varagantiamd.com>
Date: Thu Oct 4 16:32:21 2018 +0530

This is a fix to floating-point exception error for BLIS SGEMM with larger matrix sizes.
BUG No: CPUPL-197 fixed by Thangaraj Santanu
The bli_clock_min_diff() function in BLIS assumed that if the time taken is greater than 1 hour then the reading must be wrong. However this is not the case in general, while the other checks such as time taken closer to zero or nsec is ofcourse valid.
gerrit review: http://git.amd.com:8080/#/c/118694/1/frame/base/bli_clock.c

Change-Id: I9dc313d7c5fdc20684f67a516bf3237de3e0694a

commit 8bf30eb4735872388b5317883d99b775a344ce25
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Wed Oct 3 22:22:29 2018 -0400

Fixed runme.sh in test/studies/thunderx2

Details:
- Fixed the setting of threads for a single core run.

commit f6f2456ba2afa8f85f43c7c2c90acc439d61d94f
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Wed Oct 3 21:43:46 2018 -0400

Fixed the Makefile in test/studies/thunderx2

Details:
- Fixed target for make-all-st and make-all-mt so that the armpl
targets are built

commit 743a1a6dec1bd3908f0f15513b501c9bd59715b3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 3 14:40:10 2018 -0500

Fixed misleading version query from gcc 7+.

Details:
- gcc 7 introduced new behavior to the -dumpversion option whereby only
the major version component is output. However, as part of this
change, gcc 7 also introduced a new option, -dumpfullversion, which is
guaranteed to always output the major, minor, and revision numbers. If
we are using gcc 7 or later, we re-query the version string with this
new option and then re-parse the result so as to avoid misleading
output from configure (e.g. using gcc 7.3.0 is reported as 7.7.7).

commit de07840ba5672b9d7b2ed2b918974e98c3f249fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 3 13:57:25 2018 -0500

Whitespace, https updates to README.md.

Details:
- Reformatted to fit all lines within 80 columns, unless a link is too
long to fit on a single line.
- Changed some links from http to https.

commit 80a8b3dd8034ec8bc03d31be3f9c837c3f6fc94b
Author: sraut <Biplab.Rautamd.com>
Date: Wed Oct 3 15:30:33 2018 +0530

Review comments incorporated for small TRSM.

Change-Id: Ia64b7b2c0375cc501c2cb0be8a1af93111808cd9

commit b8dfd82e0d1afda4ee5436662d63515a59b2dee3
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 2 15:37:12 2018 -0500

Get pthreads via blis.h in the test driver.

commit d0c0c20b7bd3ecf914b5910a50f618fb7d7aa355
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 2 15:16:00 2018 -0500

There seems to be a problem with _POSIX_BARRIERS on Travis.

commit 0904d9e4df0c8a256ac35c491f14a587ebe9fca2
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 2 15:04:36 2018 -0500

*Always* use Windows primitives instead of pthreads.

commit 998317d309934cd7129f8c818ea6e5f07534ebc8
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 2 14:43:24 2018 -0500

Remove pthreads from appveyor build.

commit 627d0c5bfd4b7b149803587391c93b164c11ced5
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 2 14:40:55 2018 -0500

Combine the alternative barrier implementation for macOS with the pthread wrapper for Windows. Also implement pthread_{create,join} for Windows.

commit 81d2c064a209df7eca7d6103696ca3a137a7f82e
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 2 11:46:36 2018 -0500

Add wrapper for basic pthreads functionality (mutex, once) with MSVC.

commit d33f130ea621fca1dccb30631f454d237918eb04
Author: Devin Matthews <damatthewssmu.edu>
Date: Tue Oct 2 11:45:43 2018 -0500

Some configure changes:

1) Allow environment variables to be set anywhere in the argument list.
2) Allow any environment variable to be set.
3) Allow LIBPHTREAD to be set to null without getting defaulted to -lpthread.

commit 9d5f1c4f3bf70c2c0ea84bfa326a0113ae2d176c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 1 17:39:26 2018 -0500

Patch to avoid gcc warning in blastest/f2c/open.c.

Details:
- Use the modulo operator to limit the size of an integer that is given
to sprintf(). This avoids a warning in some versions of gcc about the
integer potentially overflowing the available space in the string into
which the integer is being printed.

commit 0c3cd00ba76de607e807f8deb04b1a2ce18ea7a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 1 16:18:25 2018 -0500

More README.md updates.

Details:
- Replaced much of "Getting Started" section with a shortened version of
the bullet list of documentation currently shown in the github wiki
page. Thanks to Devangi Parikh for her feedback in this change.

commit 8eaf34bd23b30a1857a50d7142ee9811895f24bf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 1 14:29:07 2018 -0500

Very minor README.md update.

commit 599090e0eb41b2706fa1231fa7b90096f3281678
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 1 14:04:30 2018 -0500

README.md update.

Details:
- Added language mentioning SHPC group to Introduction.

commit ee46fa3efb6e920fa6c3d0b0601007f5de31deb5
Author: sraut <Biplab.Rautamd.com>
Date: Mon Oct 1 16:30:30 2018 +0530

Small TRSM optimization changes :- 1) single precision small trsm kernels for XAt=B case are further optimized for performance. 2) double precision small trsm kernels for AX=B and XAtB cases are implemented. 3) single precision small trsm kernels for AutX=B are implemented in intrinsics to improve the current performance.

Change-Id: Ic9d67ae6d8522615257dde018903f049dcffa2cf

commit 08045a6c52b6e025652c5b18eb120c0f4e61cf6f
Author: sraut <Biplab.Rautamd.com>
Date: Mon Oct 1 15:38:23 2018 +0530

Corrected the fix made for blastest level-3 failure to check m,n,k non-zero condition in bli_gemm_small.c

Change-Id: Idaf9f2327c3127b04a2738ae8a058b83d6c57934

commit ac18949a4b9613741b9ea8e5026d8083acef6fe4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Sep 30 18:54:56 2018 -0500

Multithreading optimizations for l3 macrokernels.

Details:
- Adjusted the method by which micropanels are assigned to threads in
the 2nd (jr) and 1st (ir) loops around the microkernel to (mostly)
employ contiguous "slab" partitioning rather than interleaved (round
robin) partitioning. The new partitioning schemes and related details
for specific families of operations are listed below:
- gemm: slab partitioning.
- herk: slab partitioning for region corresponding to non-triangular
region of C; round robin partitioning for triangular region.
- trmm: slab partitioning for region corresponding to non-triangular
region of B; round robin partitioning for triangular region.
(NOTE: This affects both left- and right-side macrokernels:
trmm_ll, trmm_lu, trmm_rl, trmm_ru.)
- trsm: slab partitioning.
(NOTE: This only affects only left-side macrokernels trsm_ll,
trsm_lu; right-side macrokernels were not touched.)
Also note that the previous macrokernels were preserved inside of
the 'other' directory of each operation family directory (e.g.
frame/3/gemm/other, frame/3/herk/other, etc).
- Updated gemm macrokernel in sandbox/ref99 in light of above changes
and fixed a stale function pointer type in blx_gemm_int.c
(gemm_voft -> gemm_var_oft).
- Added standalone test drivers in test/3m4m for herk, trmm, and trsm
and minor changes to test/3m4m/Makefile.
- Updated the arguments and definitions of bli_*_get_next_[ab]_upanel()
and bli_trmm_?_?r_my_iter() macros defined in bli_l3_thrinfo.h.
- Renamed bli_thread_get_range*() APIs to bli_thread_range*().

commit b952ca8feb6f17f71a4512649c2aa72bdee9c8f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 28 16:12:32 2018 -0500

CREDITS file update.

commit 7d96fc437ebaa9dd2d7071865b5df16402fadd64
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 28 15:40:45 2018 -0500

Allow slashes ('/') in version tags.

Details:
- Updated the configure script to allow slashes in version string. This
is needed so that downstream maintainers (such as those for Debian)
can create local tags such as "upstream/0.4.1". Thanks to M. Zhou for
reporting this issue via PR 256 and providing me the information
needed to debug the problem.

commit 5fdddf6f37c64da093c7f59e3a85214e819ae652
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 28 11:25:54 2018 -0500

Removed 'debian' directory.

Details:
- Removed the top-level 'debian' directory. This directory is apparently
no longer needed (issue 257). Thanks to M. Zhou and Nico Schlömer for
their contributions.

commit 9814cfdf3157ef4726ee604fc895d56e8063d765
Author: Meghana <meghana.vankadariamd.com>
Date: Fri Sep 28 11:02:39 2018 +0530

fixed blastest level-3 failure by adding ((M&N&K) != 0) to check condition in bli_gemm_small.c

Change-Id: I85e4a32996ebb880f3c00bd293edc38f74700fe6

commit 86330953b14c180862deef3ccdcc6431259be27b
Merge: 7af5283d 807a6548
Author: praveeng <praveen.gamd.com>
Date: Fri Sep 28 10:08:06 2018 +0530

Resolved conflicts and modified bli_trsm_small.c

Change-Id: I578d419cff658003e0fdd4c4cdc93145d951ce31

commit 60b2650d7406d266feffe232c2d5692a9e3886d0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 24 15:04:45 2018 -0500

Added statistics-collecting irun.py script.

Details:
- Added irun.py script to 'build' directory. This irun.py script is a
python script for repeatedly invoking a test driver executable, such
as those found in test/3m4m, and replace the performance output column
with four columns that aggregate statistics. Specifically, the script
reports the minimum, average, maximum, and standard deviation for each
problem size. This script is useful especially (though not
exclusively) when trying to determine the impact of relatively minor
changes to the code, or other small optimizations that may be
difficult to distinguish from "noise." One way this "noise" manifests
is that a test executable may run slightly slower or faster for all
problem sizes (and all implementations) tested by the executable over
the life of a single execution. The cause of these minor
across-the-board pertubations in the overall performance signatures is
unknown, though we hypothesize that it may relate to any number of
issues such as operating system scheduling, where in memory the
program is loaded, or how the CPU clock frequency is throttled at the
time of execution. Regardless of the source of these subtle
performance anomalies, the statistical properties reported by the
irun.py script help the user to more precisely characterize the
underlying performance exhibited by any given test driver, which
allows him or her to make better judgments about the true difference
in performance between two implementations, or minor changes within a
single implementation.

commit 807a654888117fb3a27ea36384f1c1c11b882cd5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 20 15:41:05 2018 -0500

Fixed confusing configure message for libmemkind.

Details:
- Corrected feedback echoed to user by configure when libmemkind is
found but not explicitly requested. In these cases, configure would
echo a message that it had received an explicit request to enable
libmemkind, which was not accurate, even if the end result was the
same--that libmemkind is enabled by default when it is found. Thanks
To Devangi Parikh for reporting this issue.

commit 02adab427c779b0aaf38a5877a5f0246b1909e8f
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Thu Sep 20 14:38:50 2018 -0400

Created a 'thunderx2' subdirectory within test/studies

Details:
- Created a 'thunderx2' subdirectory within test/studies to house
various level-3 test driver used to measure performance on
ThunderX2.

commit d7537fb51dac0636591fc7c68261a2322642ab3c
Merge: dad07245 c03728f1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 12 15:24:20 2018 -0500

Merge branch 'dev'

commit dad07245dbcfaf35232ec379ba756eb133c361c1
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Wed Sep 12 04:16:58 2018 -0500

Fixed yet another bug in runme script in test/studies

Details:
- Fixed another copy-paste bug

commit e669057fe35f2037d8111af687d84a0ecf6d7a2a
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Tue Sep 11 22:29:42 2018 -0500

Fixed bug in runme script in test/studies

Details:
- Fixed bug in runme script for skx studies that set the number of
threads incorrectly

commit 232fdc3df3e01ae3f86d53767bd14eb93b511e6e
Author: Devangi N. Parikh <dnpcs.utexas.edu>
Date: Mon Sep 10 18:45:50 2018 -0500

Updated runme script in test/studies.

Details:
- Updated runme script for skx studies to run multithreading tests
on 1 and 2 sockets.

commit c03728f1f45edb5e434db90ab8a77ba0184a682b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 10 17:54:27 2018 -0500

Various minor cleanups.

Details:
- Rewrote bli_winsys.c to define bli_setenv() and bli_sleep()
unconditionally, but differently for Windows and non-Windows, but
then disabled the definition of bli_setenv() entirely since BLIS
no longer needs to set environment variables. Updated bli_winsys.h
accordingly, and call bli_sleep() from within testsuite instead of
sleep() directly.
- Use
if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS != 200809L)
instead of
if !defined(_POSIX_BARRIERS) || (_POSIX_BARRIERS < 0)
when guarding against local definition of pthread barrier in
testsuite. (The description for unistd.h implies that _POSIX_BARRIERS
should always be set to 200809L when barriers are supported, though I
won't be surprised if we encounter a case in the future where it is
set to something else such as 1 while still supported.)
- Removed old _VERS_CONF_INST definitions and installation rules in
top-level Makefile. These are no longer needed because we no longer
output libraries with the version and configuration name as
substrings.
- Comment/whitespace updates in Makefile, config.mk.in, common.mk,
configure, bli_extern_defs.h, and test_libblis.h.
- Added mention of 1m to README.md and other trivial tweaks.

commit e249a00a82908054ecd307cf602c8801275903e8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 10 16:48:35 2018 -0500

Imported skx dgemm ukernel from skx-redux branch.

Details:
- Added the new bli_dgemm_skx_asm_16x14.c microkernel from the skx-redux
branch, along with appropriate blocksizes in bli_cntx_init_skx.c and
a prototype in bli_kernels_skx.h. (Devin has not yet written the
sgemm analague, so for now we will continue using the older sgemm
ukernel.)
- Updated frame/include/bli_x86_asm_macros.h with a minor change that
was present within the skx-redux branch.

commit e93b01ff60bf9742baa5eefd93e208d1219e7a43
Author: Isuru Fernando <isurufgmail.com>
Date: Sun Sep 9 15:57:43 2018 -0500

Windows DLL support (246)

* Enable shared

* Enable rdp

* Add support for dll

* Use libblis-symbols.def

* Fix building dlls

* Fix libblis-symbols.def

* Fix soname

* Fix Makefile error

* Fix install target

* Fix missing symbols

* Add BLIS_MINUS_TWO

* Add path to dll

* Fix OSX soname

* Add declspec for dll

* Add -DBLIS_BUILD_DLL

* Replace enable_shared in config

* switch to auto for now

* blis_ -> bli_

* Remove BLIS_BUILD_DLL in make check

* change auto->haswell

* enable_shared_01

* Add wno-macro-redefined

* print out.cblat3

* BLIS_BUILD_DLL -> BLIS_IS_BUILDING_LIBRARY

* Use V=1

* Remove fpic for windows

* Remember LIBPTHREAD

* Remove libm for windows

* Remember AR

* Fix remembering libpthread

* Add Wno-maybe-uninitialized in only gcc

* Don't do blastest for shared for now

* Fix install target

And remove unnecessary change

* test auto and x86_64

* Fix install target again

* Use IS_WIN variable

* Remove leading dot from LIBBLIS_SO_MAJ_EXT

* Make is_win yes/no

* Add comments for windows builds

* Change if else blocks location

commit 1330d5c4bc3b644ec0af54c3939a5b9f00eacd9c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 7 19:37:59 2018 -0500

Employ "user" cflags for tl Makefile test targets.

Details:
- Use get-user-cflags-for() to generate cflags when compiling BLAS test
drivers and BLIS testsuite from top-level Makefile. Meant to include
these changes in previous commit (4b5437e). Thanks to Isuru Fernando
for pointing out this oversight.

commit 4b5437ec7afb2befffffbb83f7872bcb4fc61e51
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 7 17:24:32 2018 -0500

Define a cpp macro specific to BLIS compilation.

Details:
- Tweaked the cflags functions in common.mk so that a new preprocessor
macro, BLIS_IS_BUILDING_LIBRARY, is defined, but only when BLIS
itself is being built. This macro will not be defined when, for
example, the testsuite or example code compiles code local to those
applications. This was done in part by defining a new cflags function
get-user-cflags-for(), which is now the designated function for
application Makefiles if they wish to inherit a basic set of CFLAGS
from BLIS. (The compiler flags returned are identical to that of
get-frame-cflags-for() except that -DBLIS_IS_BUILDING_LIBRARY is
omitted.)
- Updated all test driver-like makefiles to call get-user-cflags-for()
instead of get-frame-cflags-for().

commit cc2cca4f56eb30212a0dce3e5c121e64d9e59560
Merge: e19e7212 fb81c7fc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 6 17:12:13 2018 -0500

Merge branch 'dev'

commit e19e7212872da3d464734199193436faa51f0da0
Merge: 97965b09 b3d0702c
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Thu Sep 6 14:58:49 2018 -0700

Merge pull request 244 from kali/pthread-barrier-osx

add an adhoc impl for pthread_barrier

commit b3d0702cf2ef6dda19a23dd8a677be1b6f73c322
Merge: 4e7d0670 97965b09
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Thu Sep 6 14:58:23 2018 -0700

Merge branch 'master' into pthread-barrier-osx

commit 4e7d06700f176a62952d7d51e41fdcbc6b7a9d5f
Author: Mathieu Poumeyrol <kalizoy.org>
Date: Thu Sep 6 23:48:31 2018 +0200

second __APPLE__

commit fb81c7fc665d68e6a2add163feb29acc0bce8936
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 6 16:29:39 2018 -0500

Defined cortexa53 sub-configuration.

Details:
- Added a new sub-configuration 'cortexa53', which is a mirror image
of cortexa57 except that it will use slightly different compiler
flags. Thanks to Mathieu Poumeyrol for making this suggestion after
discovering that the compiler flags being used by cortexa57 were
not working properly in certain OS X environments (the fix to which
is currently pending in pull request 245).

commit 24ecc0d94aaa9ab4df1ae6d199c4ec6d7783169f
Author: Mathieu Poumeyrol <kalizoy.org>
Date: Thu Sep 6 22:10:16 2018 +0200

use _POSIX_BARRIERS instead of __APPLE__

commit 97965b09059a610db06fb7a22bdfa79c0d37d673
Author: Mathieu Poumeyrol <kaliusers.noreply.github.com>
Date: Thu Sep 6 21:10:29 2018 +0200

cortexa9 and cortexa53 travis build + qemu test (245)

commit a6802eab7d94b5a9de633c53beca8245b74f5dc6
Author: Mathieu Poumeyrol <kalizoy.org>
Date: Thu Sep 6 17:16:35 2018 +0200

reinstantiate test on macos

commit d688a2b7e5a19cba44ea398a99e325e19b8fce50
Author: Mathieu Poumeyrol <kalizoy.org>
Date: Thu Sep 6 15:25:16 2018 +0200

add an adhoc impl for pthread_barrier

commit ab9f9e684dc3ffbb70cc45b21c67af5d916919e5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 30 15:14:02 2018 -0500

CHANGELOG update (0.4.1)

Page 2 of 7

Releases

Has known vulnerabilities

Previous Next

Blis

Page 2 of 7

0.7.0

0.6.1

0.6.0

0.5.2

0.5.1

0.5.0

Page 2 of 7

Links

Releases