Blis

Latest version: v1.2.0

Safety actively analyzes 714815 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 7

0.2.1

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 5 14:41:34 2016 -0500

Version file update (0.2.1)

commit 87fddeab3c8a5ccb1bbf02e5f89db1464e459ba9
Merge: 86969873 6f71cd34
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 5 13:35:01 2016 -0500

Merge branch 'compose'

commit 6f71cd344951854e4cff9ea21bbdfe536e72611d (origin/compose)
Merge: c0630c40 8d55033c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 4 15:53:46 2016 -0500

Merge pull request 94 from flame/distcomm

Implemented distributed thrinfo_t management.

commit 86969873b5b861966d717d8f9f370af39e3d9de6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 4 14:24:59 2016 -0500

Reclassified amaxv operation as a level-1v kernel.

Details:
- Moved amaxv from being a utility operation to being a level-1v operation.
This includes the establishment of a new amaxv kernel to live beside all
of the other level-1v kernels.
- Added two new functions to bli_part.c:
bli_acquire_mij()
bli_acquire_vi()
The first acquires a scalar object for the (i,j) element of a matrix,
and the second acquires a scalar object for the ith element of a vector.
- Added integer support to bli_getsc level-0 operation. This involved
adding integer support to the bli_*gets level-0 scalar macros.
- Added a new test module to test amaxv as a level-1v operation. The test
module works by comparing the value identified by bli_amaxv() to the
the value found from a reference-like code local to the test module
source file. In other words, it (intentionally) does not guarantee the
same index is found; only the same value. This allows for different
implementations in the case where a vector contains two or more elements
containing exactly the same floating point value (or values, in the case
of the complex domain).
- Removed the directory frame/include/old/.

commit 8d55033c966feed99fcca2a58017c3ab5b1646dc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 27 15:20:58 2016 -0500

Implemented distributed thrinfo_t management.

Details:
- Implemented Ricardo Magana's distributed thread info/communicator
management. Rather that fully construct the thrinfo_t structures, from
root to leaf, prior to spawning threads, the threads individually
construct their thrinfo_t trees (or, chains), and do so incrementally,
as needed, reusing the same structure nodes during subsequent blocked
variant iterations. This required moving the initial creation of the
thrinfo_t structure (now, the root nodes) from the _front() functions
to the bli_l3_thread_decorator(). The incremental "growing" of the tree
is performed in the internal back-end (ie: _int()) function, and so
mostly invisible. Also, the incremental growth of the thrinfo_t tree is
done as a function of the current and parent control tree nodes (as well
as the parent thrinfo_t node), further reinforcing the parallel
relationship between the two data structures.
- Removed the "inner" communicator from thrinfo_t structure definition,
as well as its id. Changed all APIs accordingly. Renamed
bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
- Defined bli_l3_thrinfo_print_paths(), which prints the information
in an array of thrinfo_t* structure pointers. (Used only as a
debugging/verification tool.)
- Deprecated the following thrinfo_t creation functions:
bli_packm_thrinfo_create()
bli_l3_thrinfo_create()
because they are no longer used. bli_thrinfo_create() is now called
directly when creating thrinfo_t nodes.

commit fd04869ae4d4a3b0ebb9052557c296456bce7c0d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 27 14:14:11 2016 -0500

Changed configure's 'omp' threading to 'openmp'.

Details:
- Changed the configure script so that the expected string argument to the
-t (or --enable-threading=) option that enables OpenMP multithreading is
'openmp'. The previous expected string, 'omp', is still supported but
should be considered deprecated.

commit 9424af87209e4e435e2e742430945152690170b0
Merge: efa7341d c0630c40
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 27 12:51:08 2016 -0500

Merge branch 'compose'

commit 7f32dd57c6bd41c0704341752842277dd6a4c8eb
Author: Shaden Smith <shadencs.umn.edu>
Date: Sat Sep 17 11:33:57 2016 -0500

Adds sanity check to configuration choice.

commit efa7341df0b0115926aa8a6e8a4ebfb24fdbf11e
Merge: 121c39d4 e1453f68
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 16 11:01:57 2016 -0500

Merge pull request 92 from ShadenSmith/readme_fix

Fixes broken URL in README.md

commit e1453f68f6afd90ae9a29b7a5faa46aa79bbf741
Author: Shaden Smith <ShadenTSmithgmail.com>
Date: Fri Sep 16 09:29:28 2016 -0500

Fixes broken URL in README.md

commit b922d7563422e14c49a4677bc6ae088a408861ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 23 13:38:36 2016 -0500

Avoid compiling BLAS/CBLAS files when disabled.

Details:
- Updated the top-level Makefile, build/config.mk.in template, and
configure script so that object files corresponding to source files
belonging to the BLAS compatibility layer are not compiled (or archived)
when the compatibility layer is disabled. (Same for CBLAS.) Thanks
to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
of converting (overwriting) some, such as enable_blas2blis and
enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
now stored in new variables that live alongside the originals (with the
suffix "_01"). This is convenient since some values need to be
sed-substituted into the config.mk.in template, which requires "yes" or
"no", while some need to be written to the bli_config.h.in template,
which requires "0" or "1".

Updated BLIS4 TOMS citation in README.md.

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
architectures. As with their real domain brethren, these kernels perfer
row storage, (though this doesn't affect most users due to high-level
optimizations in most level-3 operations that induce a transpose to
whatever storage preference the kernel may have).

Change-Id: I512ab90784ecbb7cdaee24928d2ccebb544ba5c1

commit 69826110bab2a064ec76457c24843d28f2581281
Merge: 64598ee4 a58dd35e
Author: Pradeep Rao <Pradeep.Raoamd.com>
Date: Wed Sep 14 03:26:25 2016 -0400

Merge "Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision" into amd-staging

commit c0630c4024b08750043a2942a3e8a037aa6b6259
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 12 13:59:02 2016 -0500

Added debugging printf()'s to bli_l3_thrinfo.c.

Details:
- Added optional printf() statements to print out thread communicator
info as the thrinfo_t structure is built in bli_l3_thrinfo.c.
- Minor changes to frame/thread/bli_thrinfo.h.

commit 7b3bf1ffcd7160ccbf6c2518af6d88f6742e4977
Merge: 35509818 121c39d4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 6 15:47:13 2016 -0500

Merge branch 'master' into compose

commit 121c39d455f2db6f7ce6802ba7f73ad5e088c68c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 5 13:11:42 2016 -0500

Added complex gemm micro-kernels for haswell.

Details:
- Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
architectures. As with their real domain brethren, these kernels perfer
row storage, (though this doesn't affect most users due to high-level
optimizations in most level-3 operations that induce a transpose to
whatever storage preference the kernel may have).

commit 35509818cbea1598b123421f81c42120889a03c3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 31 17:34:15 2016 -0500

Added, moved some thread barriers.

Details:
- Removed thread barriers from the end of the loop bodies of
bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
and bli_trsm_blk_var2().
- Moved the thread barrier at the end of bli_packm_int() to the
end of bli_l3_packm(), and added missing barriers to that function.
- Removed the no longer necessary (and now incorrect) ochief guard
in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
- Thanks to Tyler Smith for help with these changes.

commit 64598ee4cfb86f64abbd4bcef5a82ba0d5565b67
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Aug 31 12:54:50 2016 +0530

fixed the symlink issue

Change-Id: I2186d529f295c576597c189e1ae219bc1a83f955

commit abd61f9fa75d77a96d1491b3e035451ee73238fe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 30 12:34:19 2016 -0500

Updated BLIS4 TOMS citation in README.md.

commit 8a2373f26ba8fcd5b2d7b2cc72cb8b2e1f841a03
Author: sthangar <Santanu.Thangarajamd.com>
Date: Mon Aug 29 14:10:45 2016 +0530

Norm 2 optimization

Change-Id: Ide9decaccd20bf0ccc32c9abb6556e038dceed2b

commit fdc663902347aa252ea88cf09ce24ab748958dff
Author: sthangar <Santanu.Thangarajamd.com>
Date: Mon Aug 29 10:43:38 2016 +0530

Placed 1 and 1f AMD optimized AVX routines under zen folder

Change-Id: I26795211ef11d232ed794ce36dd0a9c1f8706328

commit 701b9aa3ff028decbf90efac0dca5bd64fe26269
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 26 19:04:45 2016 -0500

Redesigned control tree infrastructure.

Details:
- Altered control tree node struct definitions so that all nodes have the
same struct definition, whose primary fields consist of a blocksize id,
a variant function pointer, a pointer to an optional parameter struct,
and a pointer to a (single) sub-node. This unified control tree type is
now named cntl_t.
- Changed the way control tree nodes are connected, and what computation
they represent, such that, for example, packing operations are now
associated with nodes that are "inline" in the tree, rather than off-
shoot braches. The original tree for the classic Goto gemm algorithm was
expressed (roughly) as:

blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
| |
-> packb -> packa

and now, the same tree would look like:

blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2

Specifically, the packb and packa nodes perform their respective packing
operations and then recurse (without any loop) to a subproblem. This means
there are now two kinds of level-3 control tree nodes: partitioning and
non-partitioning. The blocked variants are members of the former, because
they iteratively partition off submatrices and perform suboperations on
those partitions, while the packing variants belong to the latter group.
(This change has the effect of allowing greatly simplified initialization
of the nodes, which previously involved setting many unused node fields to
NULL.)
- Changed the way thrinfo_t tree nodes are arranged to mirror the new
connective structure of control trees. That is, packm nodes are no longer
off-shoot branches of the main algorithmic nodes, but rather connected
"inline".
- Simplified control tree creation functions. Partitioning nodes are created
concisely with just a few fields needing initialization. By contrast, the
packing nodes require additional parameters, which are stored in a
packm-specific struct that is tracked via the optional parameters pointer
within the control tree struct. (This parameter struct must always begin
with a uint64_t that contains the byte size of the struct. This allows
us to use a generic function to recursively copy control trees.) gemm,
herk, and trmm control tree creation continues to be consolidated into
a single function, with the operation family being used to select
among the parameter-agnostic macro-kernel wrappers. A single routine,
bli_cntl_free(), is provided to free control trees recursively, whereby
the chief thread within a groups release the blocks associated with
mem_t entries back to the memory broker from which they were acquired.
- Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
function pointer stored in the current control tree node (rather than
index into a local function pointer array). Before being invoked, these
function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
families) or trsm_voft (for trsm family) type, which is defined in
frame/3/bli_l3_var_oft.h.
- Retired herk and trmm internal back-ends, since all execution now flows
through gemm or trsm blocked variants.
- Merged forwards- and backwards-moving variants by querying the direction
from routines as a function of the variant's matrix operands. gemm and
herk always move forward, while trmm and trsm move in a direction that
is dependent on which operand (a or b) is triangular.
- Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
each of which takes additional arguments and hides complexity in managing
the difference between the way ranges are computed for the four families
of operations.
- Simplified level-3 blocked variants according to the above changes, so that
the only steps taken are:
1. Query partitioning direction (forwards or backwards).
2. Prune unreferenced regions, if they exist.
3. Determine the thread partitioning sub-ranges.
<begin loop>
4. Determine the partitioning blocksize (passing in the partitioning
direction)
5. Acquire the curren iteration's partitions for the matrices affected
by the current variants's partitioning dimension (m, k, n).
6. Call the subproblem.
<end loop>
- Instantiate control trees once per thread, per operation invocation.
(This is a change from the previous regime in which control trees were
treated as stateless objects, initialized with the library, and shared
as read-only objects between threads.) This once-per-thread allocation
is done primarily to allow threads to use the control tree as as place
to cache certain data for use in subsequent loop iterations. Presently,
the only application of this caching is a mem_t entry for the packing
blocks checked out from the memory broker (allocator). If a non-NULL
control tree is passed in by the (expert) user, then the tree is copied
by each thread. This is done in bli_l3_thread_decorator(), in
bli_thrcomm_*.c.
- Added a new field to the context, and opid_t which tracks the "family"
of the operation being executed. For example, gemm, hemm, and symm are
all part of the gemm family, while herk, syrk, her2k, and syr2k are
all part of the herk family. Knowing the operation's family is necessary
when conditionally executing the internal (beta) scalar reset on on
C in blocked variant 3, which is needed for gemm and herk families,
but must not be performed for the trmm family (because beta has only
been applied to the current row-panel of C after the first rank-kc
iteration).
- Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
to comform with the new control tree design, and renamed the macro-
kernel codes corresponding to 3m2 and 4m1b.
- Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
- Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
frame/base/bli_auxinfo.h.
- Fixed a minor bug whereby the storage-to-ukr-preference matching
optimization in the various level-3 front-ends was not being applied
properly when the context indicated that execution would be via an
induced method. (Before, we always checked the native micro-kernel
corresponding to the datatype being executed, whereas now we check
the native micro-kernel corresponding to the datatype's real projection,
since that is the micro-kernel that is actually used by induced methods.
- Added an option to the testsuite to skip the testing of native level-3
complex implementations. Previously, it was always tested, provided that
the c/z datatypes were enabled. However, some configurations use
reference micro-kernels for complex datatypes, and testing these
implementations can slow down the testsuite considerably.

commit a58dd35ed7b5b77a6b272655d2edd7a822b8fa87
Author: Kiran Varaganti <Kiran.Varagantiamd.com>
Date: Fri Aug 26 14:55:12 2016 +0530

Implemented trsm single precision for lower triangular matrices, files added bli_trsm_l_int_6x16.cfiles modified bli_kernel.h to enable optimized trsm microkernel and test_trsm.c is modified to test trsm single precision

Change-Id: Ibddf989f4aad577e89558673e1038cf6ece654d9

commit 73517f522b69de429dd7f3df60a70c068149ab28
Merge: c6f5c215 50293da3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 23 13:46:59 2016 -0500

Merge branch 'master' into compose

commit 50293da38d5f2b7be9bbc94b9e85aacb6a10f672
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 23 13:38:36 2016 -0500

Avoid compiling BLAS/CBLAS files when disabled.

Details:
- Updated the top-level Makefile, build/config.mk.in template, and
configure script so that object files corresponding to source files
belonging to the BLAS compatibility layer are not compiled (or archived)
when the compatibility layer is disabled. (Same for CBLAS.) Thanks
to Devin Matthews for suggesting this optimization.
- Slight change to the way configure handles internal variables. Instead
of converting (overwriting) some, such as enable_blas2blis and
enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
now stored in new variables that live alongside the originals (with the
suffix "_01"). This is convenient since some values need to be
sed-substituted into the config.mk.in template, which requires "yes" or
"no", while some need to be written to the bli_config.h.in template,
which requires "0" or "1".

commit 22dd6a353ddb56614309c01533b1a94c9fd32bca
Merge: cdfb3c3f f20ed388
Author: praveeng <praveen.gamd.com>
Date: Tue Aug 23 15:15:35 2016 +0530

Merge master code as on 2016_08_23 to amd-staging branch by praveeng

Changes to be committed:
modified: frame/thread/bli_mutex_openmp.h
modified: frame/thread/bli_mutex_pthreads.h

Change-Id: Ica522edbb1d0173f53f38d5057b1f7aef73666be

commit c6f5c215ee793d03ea834469fc2adc53feaffc42
Merge: d52cb767 16a4c7a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 22 17:33:02 2016 -0500

Merge branch 'master' into compose

commit f20ed3885d628992fab88690f629a5a2bab3eb88
Merge: 02ac597e 4bc842ca
Author: praveeng <praveen.gamd.com>
Date: Mon Aug 22 15:27:33 2016 +0530

Merge branch 'master' of https://github.com/clMathLibraries/blis-amd for "Fixed bugs in bli_mutex_init() and friends."

commit 02ac597e4b9be2670d9fff65d28552f8e1ec81b3
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 84e41cc73c9c87ce64582acd4264b8e1b5316482
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 30ccfcee82db93d0109d1571242e2db925e95d0a
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit aeca25cd63fc8971f8fe7809599c57853f976548
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 6b2274864b36fd1019d97bcc4ca6dd7a57ef16d9
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit daa7a9ecb25982f2551adbd95e65f8ba97cfe944
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 5f66a4aa05aeffcb6eb587851d78d9527319466c
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit c6cbd78d2388c08824822b91a1c36ac4349bb67f
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 9219a9060762525f87ebbf556d78fe8621858513
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 728573296efa7cf14d2381570e116509dfe2a240
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit ad7862e291c240505c733a41d231b1a126ade73c
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit ad4b471a25ce77867295e5529dfc787e7c18b03f
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 55d641363fcd8bdfdabbd7c22822fa2d0b7f3fa6
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit f3b6b15f6d591d323802bd6c81c522a02056506d
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 16a4c7a823d60707ed9272f5d36e5c5d54c0ba4b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 19 11:38:36 2016 -0500

Fixed bugs in bli_mutex_init() and friends.

Details:
- Fixed a couple of bugs that affected OpenMP and POSIX threads
configurations that resulted in compiler errors and warnings due
to type mismatch, and in the case of pthreads, a missing function
argument. The bugs are fairly recent, introduced in a017062.

commit c8e4ef93953ba2b79fb7e0973c08469c0e28a2cd
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:13:03 2016 -0500

Add prefetchw to 30x8 kernel.

commit 4b5a2f3d6e7ffeb5cc2be8448554f5c2083ad68f
Merge: 380736bf 9f52a587
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:09:51 2016 -0500

Merge remote-tracking branch 'origin/knl' into knl

Conflicts:
kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c

commit 380736bfe955efbdd7274c90b6fd635688e83bc4
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:08:28 2016 -0500

Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug.

commit 9f52a587dee855daa73c194e41b6951416544e9a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 16:03:53 2016 -0500

Try prefetchw[t1] instead of regular prefetch for C.

commit 8945a1512d366bc6a8a85718d12cbf5de6f2898b
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Aug 3 11:28:24 2016 -0500

This version gets ~1550 GFLOPs on KNL wuth 16x4.

commit cdfb3c3f29d321033fca106aa58ab67ead90a95d
Merge: 50a2f2ef 4bc842ca
Author: praveeng <praveen.gamd.com>
Date: Fri Jul 29 12:45:04 2016 +0530

Merge master code as on 2016_07_29 to amd-staging branch by praveeng

Change-Id: Ic78b84d8b8d10158fb2a612f9a64bbc7b1f9b486

commit 4bc842ca3a64e658c0808bfe4c5693a5ace97923
Merge: 117f8838 b0d510bf
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 17:32:12 2016 +0530

Merge branch 'master' of publicrepo

commit 117f8838511a478aa16137e770d27dd21f4227c5
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 2fcdc28f1055d385b2e662aa920fb97c472394d7
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 1b5d104afe0628b8b6c0650f1e58cfb08be67004
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit d81273047bff56501e9413a90991d3d1f8b56a06
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 65905c3011a11cda95761681d4ae84337e46bdb5
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 23cca231be10fe1797aed451bcbc69d38c78bc0c
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 922e3091702f25e3287b417719a33adbd5bbf138
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit b0d510bf0e4dfd177f9e4ae0069f41921e2ecdc1
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:11:08 2016 +0530

Revert commits 357c990bdd7bd5667aac5adf1bab3712973e7414

Change-Id: I12a34456d7eed93fda4369e76bcddb42ba7ccb99

commit 5ebeece5b4a8df81d59ca7558b278a4263d15128
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 28 15:01:36 2016 +0530

Revert commits 8aee306

Change-Id: I3dd999c77c6779332a40dbb84371ca487216f189

commit 6ce4c022ebdea00c2b951090e3c2e9e88735b9ce
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 16:26:36 2016 -0500

Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved.

commit d52cb7671509592a8078729477b40b60380518a2
Merge: 95abea46 c31b1e7b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 27 16:04:55 2016 -0500

Merge branch 'master' into compose

commit c31b1e7b9d659b96433a87e5aecb90e457a104cc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 27 15:58:07 2016 -0500

Relax alignment restrictions for sandybridge ukrs.

Details:
- Relaxed the base pointer and leading dimension alignment restrictions
in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
instead of vmovaps/vmovapd. These change mimic those made to the haswell
microkernels in e0d2fa0 and ee2c139.
- Updated testsuite modules as well as standalone test drivers in 'test'
directory to use DBL_MAX as the initial time candidate. Thanks to Devin
Matthews for suggesting this change.
- Inserted include "float.h" into bli_system.h (to gain access to DBL_MAX).
- Minor update (vis-a-vis contexts) to driver code in test/3m4m.

commit b8f2b55532849d45d379afbdd05a52ff6100800d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 15:22:55 2016 -0500

Try an 8x24 kernel for the hell of it.

commit 7ede5863ae3567f7c0852efc2d5cd649ca19e0f3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 13:41:27 2016 -0600

Allocate pack buffer on MCDRAM for KNL.

commit ad89ed2e829c7b261d8ba0998a3cb83ad576ee04
Merge: 2c9de740 81e2b05f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 11:45:40 2016 -0500

Merge branch 'knl' of github.com:devinamatthews/blis into knl

commit 2c9de740edb66c4692c200731763bbd1d3171ccb
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 11:44:54 2016 -0500

This version gets ~26GF on one core.

commit 81e2b05f31bca4e1e1676e7b533d1868d9f9be33
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jul 27 11:39:05 2016 -0500

Add optimized packing kernels for KNL.

commit a7d8ca97b8d835c32d90ff20a565c82733f014a8
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 15:15:13 2016 -0500

All fixed.

commit 963d0393b023f4134bb0c682923faf9964c0e645
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 14:40:53 2016 -0500

Add 24xk pack kernel.

commit 117b76739afba481768897d2580f8365d3345417
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 13:53:07 2016 -0500

In the midst of debugging.

commit 8c0a4fd1d3535d608a9a309a61ffee0a73c3646f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 13:09:24 2016 -0500

Fix some row/column confusion.

commit c44f9f96930312125b15e64c326ab5ab5cc02633
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 12:02:24 2016 -0500

Simplify displacements -- clang assembler was badly botching EVEX compressed displacements giving false alarms for instruction length.

commit e0cce177cc1b47ec9f11ac0556241feaa3564df1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Jul 25 10:02:25 2016 -0500

Minor fixes for 8x24 KNL kernel.

commit 50a2f2efcbeb46537f1deaa8e44dc579a4e49eb8
Merge: 1aa77dfc cfd46c88
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 17:01:20 2016 +0530

Merge master code as on 2016_07_25 to amd-staging branch by praveeng

Change-Id: I84886ae241db2aac0bef6b7ef399f04aa8bca16d

commit cfd46c88d59c8f61d5e7cf768d606e4c44623584
Merge: f493bf4d a017062f
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 15:38:13 2016 +0530

Merge remote-tracking branch 'publicrepo/master'

commit f493bf4d704fe0e967783cd6e6877d3302c056a1
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 25 14:14:00 2016 +0530

removed changes from readme file which are giving confilcts

Change-Id: Ic71ad1313e1404fed444e899466043704d875af6

commit 65735bbedf75784c48bd11e05b3fdc98fc66b4bc
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Jul 24 21:50:32 2016 -0500

Switch to 24x8 kernel, unrolled by 16.

commit 45d5dc97177117220bd9dd0abf85aafc185acad1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sun Jul 24 14:25:26 2016 -0500

Add 24x8 "KNC-style" kernel for KNL.

commit 95abea46f86816fddfc9ff0abfa52880801461be
Merge: d0dfe5b5 a017062f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Jul 23 15:38:33 2016 -0500

Merge branch 'master' into compose

commit a017062fdf763037da9d971a028bb07d47aa1c8a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 17:02:59 2016 -0500

Integrated "memory broker" (membrk_t) abstraction.

Details:
- Integrated a patch originally authored and submitted by Ricardo Magana
of HP Enterprise. The changeset inserts use of a new object type, membrk_t,
(memory broker) that allows multiple sets of memory pools on, for example,
separate NUMA nodes, each of which has a separate memory space.
- Added membrk field to cntx_t and defined corresponding accessor macros.
- Added membrk field to mem_t object and defined corresponding accessor macros.
- Created new bli_membrk.c file, which contains the new memory broker API,
including:
bli_membrk_init(), bli_membrk_finalize()
bli_membrk_acquire_[mv](), bli_membrk_release(),
bli_membrk_init_pools(), bli_membrk_reinit_pools(),
bli_membrk_finalize_pools(),
bli_membrk_pool_size()
- In bli_mem.c, changed function calls to
bli_mem_init_pools() -> bli_membrk_init()
bli_mem_reinit_pools() -> bli_membrk_reinit()
bli_mem_finalize_pools() -> bli_membrk_finalize()
- In bli_packv_init.c, bli_packm_init.c, changed function calls to:
bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]()
bli_mem_release() -> bli_membrk_release()
- Added bli_mutex.c and related files to frame/thread. These files define
abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or
single-threaded execution. This new API is employed within functions
such as bli_membrk_acquire_[mv]() and bli_membrk_release().

commit 8ff2e069c48c12fd06b9c48c6b3aeb4ea9b0e6e1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 16:22:26 2016 -0500

Add 4x unrolled variant for KNL microkernel.

commit 9cb2ed9b0c25f31a22c1c9719b062fa665ad7adf
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 16:10:30 2016 -0500

Git rid of one RBX update.

commit 451bde076f0320d60cd2475cfb048ac4a2b798bb
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 15:43:00 2016 -0500

Add some more knobs to twiddle for KNL microkernel.

commit 8c6e621c099521e7a4d87e007bb8224faa5f33a3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 15:05:15 2016 -0500

Make knl conform to new kernel dir structure.

commit ce7214c6618d6f22f4ce2ee452336236916d1f30
Merge: 119d0399 ce59f811
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 14:59:53 2016 -0500

Merge remote-tracking branch 'origin/master' into knl

commit ce59f81108ec9aea918a7e77030da8acfdd397ce
Merge: ff41153f 707a2b7f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 14:48:14 2016 -0500

Merge pull request 88 from devinamatthews/32bit-dim_t

Handle 32-bit dim_t in 64-bit microkernels.

commit 707a2b7faca137cca7cab7b11a12c44ddaf7ad53
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 13:49:44 2016 -0500

Somehow forgot the most important microkernel.

commit 47ec045056351ac4f0791c071fa0daaa81699c8c
Merge: 08f1d6b6 ff41153f
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 13:45:23 2016 -0500

Merge remote-tracking branch 'upstream/master' into 32bit-dim_t

commit 08f1d6b6fa344275de0f675f69737145ccf6646a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 13:44:37 2016 -0500

Use 64-bit intermediate variable for k for architectures that do 64-bit loads in case dim_t is 32-bit.

commit ff41153f4eb7f38ed94bdd9a3fd81fb979f3f401
Merge: f9214ced e0d2fa0d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 13:21:03 2016 -0500

Merge pull request 86 from devinamatthews/haswell-vmovups

Remove alignment restrictions on C in haswell kernel.

commit e0d2fa0d835ab49366aeb790363bb2b571d36ed8
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 12:56:51 2016 -0500

Relax alignment restrictions for haswell sgemm.

commit f9214ced97392861f5a0ea72abfcf6f41faf674c
Merge: 413d62ac 08666eaa
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jul 22 12:16:39 2016 -0500

Merge pull request 85 from devinamatthews/qopenmp

Change -openmp to -fopenmp for icc.

commit ee2c139df6ad53c6aec8a67ab23b3b1912e8d259
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 12:06:03 2016 -0500

Remove alignment restrictions on C in haswell kernel.

commit 08666eaa20d8a31f2f92f944e5bfa7c1558c53e4
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 11:07:34 2016 -0500

Change -openmp to -fopenmp for icc.

commit 119d0399428905053265f3aca1cc8cc1fde3b363
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Jul 22 10:23:31 2016 -0500

Add 8x24 KNL kernel.

commit 1aa77dfc1dc183d16e0b6a1196d9c263f021e83d
Merge: 9101a9c8 ec9f5983
Author: praveeng <praveen.gamd.com>
Date: Thu Jul 21 14:22:40 2016 +0530

Merge master code as on 2016_07_21 to amd-staging branch by praveeng

Change-Id: Ic7d0a21101358f08147736e7f1884e7409937344

commit b58cda9eba0c1e175460aae109baf792d29ba5bf
Merge: 318f063d 413d62ac
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Jul 19 14:09:09 2016 -0500

Merge remote-tracking branch 'origin/master' into knl

Conflicts:
frame/base/bli_threading.h
frame/include/blis.h
frame/thread/bli_thread.c

commit ec9f59836b32260c29ff1cd24e629c7d8de14992
Merge: 197e182f 763babe4
Author: praveeng <praveen.gamd.com>
Date: Mon Jul 18 12:56:25 2016 +0530

Merge branch 'master' of https://github.com/clMathLibraries/blis-amd

commit 197e182fcbf1340fd4a202fac58bea6cfcfa9e2f
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 41fb32711031e7ec86b062aa7f53255d1f5905e2
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit d0dfe5b5372cc7558ee9c4104b29f82eecc7ed61
Merge: 31def12e 413d62ac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 14 11:01:06 2016 -0500

Merge branch 'master' into compose

commit 9101a9c880e3934f8a63ffc7fe15f5fc1077a73d
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Jul 13 16:51:14 2016 +0530

Checked in optimized 1V kernels along with benchmark codes. Also incorporated review comments for 1F kernels

Change-Id: I035c0d39e6b0bed28e6e2041242186c49f6ed55b

commit 763babe488880b42c86c7fc207aa7665bd0ff9f7
Merge: 357c990b 413d62ac
Author: praveeng <praveen.gamd.com>
Date: Wed Jul 13 11:57:19 2016 +0530

Merge remote-tracking branch 'publirepo/master'

commit 413d62aca28edabba56605a9f87d5b715831e1db
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 12 15:02:52 2016 -0500

README update (use official ACM TOMS links).

commit dfa431f696db2df4065ea454df268a2e0bc02eac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 12 14:21:19 2016 -0500

README update (BLIS2 TOMS article now in-print).

commit 357c990bdd7bd5667aac5adf1bab3712973e7414
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 16:51:23 2016 +0530

first commit

Change-Id: Ib50c81acda3b2c1583da3d421efc0ca547ef68e2

commit 8aee306300adb099b66036f2c2f7f3996433cf49
Author: praveeng <praveen.gamd.com>
Date: Tue Jul 5 15:00:31 2016 +0530

small modification to readme for git push test

Change-Id: I68506a49586b07eaa907f3f85304ee40d4c92d0a

commit 31def12e2629f187e40f93f6bae9e26a6c2660e2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 30 15:19:20 2016 -0500

First phase of control tree redesign.

Details:
- These changes constitute the first set of changes in preparation to
revamping the structure and use of control trees in BLIS. Modifications
in this commit don't affect the control tree code yet, but rather lay
the groundwork.
- Defined wrappers for the following functions, where the the wrappers
each take a direction parameter of a new enumerated type (BLIS_BWD or
BLIS_FWD), dir_t, and executes the correct underlying function.
- bli_acquire_mpart_*() and _vpart_*()
- bli_*_determine_kc_[fb]()
- bli_thread_get_range_*() and bli_thread_get_range_weighted_*()
- Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving)
blocked variants for trmm and trsm, and renamed gemm and herk variants
accordingly. The direction is now queried via routines such as
bli_trmm_direct(), which deterines the direction from the implied side
and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD.
- Defined wrappers to parameter-specific macrokernels for herk, trmm, and
trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying
macrokernel based on the implied parameters. The same logic used to
choose the dir_t in _direct() functions is used here.
- Simplified the function pointer arrays in _int() functions given the
consolidation and dir_t querying mentioned above.
- Function signature (whitespace) reformatting for various functions.
- Removed old code in various 'old' directories.

commit 405c9d46344d93c3eab5572b233900b50ca50d68
Author: sthangar <Santanu.Thangarajamd.com>
Date: Wed Jun 22 12:18:54 2016 +0530

Check-in the fused kernels optimized for Zen

Change-Id: I7b2f467b960e7b9a285f06e47be87de122e5fa24

commit 232754feecf29452987666b9f5ebba2619bfd0b0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jun 21 14:25:39 2016 -0500

Fixed compiler warning in rand[vm], randn[vm].

Details:
- Fixed compiler warnings about unused variables related to the disabling
of normalization in the structured cases of the rand[vm] and randn[vm]
operations.

commit a89555d1605574f3685813dcc972b636dd61264d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 17 14:08:35 2016 -0500

Added randn[vm] operations, support in testsuite.

Details:
- Defined a new randomization operation, randn, on vectors and matrices.
The randnv and randnm operations randomize each element of the target
object with values from a narrow range of values. Presently, those
values are all integer powers of two, but they do not need to be powers
of two in order to achieve the primary goal, which is to initialize
objects that can be operated on with plenty of precision "slack"
available to allow computations that avoid roundoff. Using this method
of randomization makes it much more likely that testsuite residuals of
properly-functioning operations are close to zero, if not exactly zero.
- Updated existing randomization operations randv and randm to skip
special diagonal handling and normalization for matrices with structure.
This is now handled by the testsuite modules by explicitly calling a
testsuite function that loads the diagonal (and scales off-diagonal
elements).
- Added support for randnv and randnm in the testsuite with a new switch
in input.general that universally toggles between use of the classic
randv/randm, which use real values on the interval [-1,1], and
randnv/randnm, which use only values from a narrow range. Currently,
the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as
well as 0.0.
- Updated testsuite modules so that a testsutie wrapper function is called
instead of directly calling the randomization operations (such as
bli_randv() and bli_randm()). This wrapper also takes a bool_t that
indicates whether the object's elements should be normalized. (NOTE: As
alluded to above, in the test modules of triangular solve operations such
as trsv and trsm, we perform the extra step of loading the diagonal.)
- Defined a new level-0 operation, invertsc, which inverts a scalar.
- Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely
but possible divide-by-zero.
- Updated function signature and prototype formatting in testsuite.

commit 318f063dcbd8b594969e401bc99146d24b01066a
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Jun 8 17:46:50 2016 -0500

Add new KNL microkernel derived from Haswell.

commit 096895c5d538a7f8817603d7cf28c52e99340def
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Jun 6 13:32:04 2016 -0500

Reorganized code, APIs related to multithreading.

Details:
- Reorganized code and renamed files defining APIs related to multithreading.
All code that is not specific to a particular operation is now located in a
new directory: frame/thread. Code is now organized, roughly, by the
namespace to which it belongs (see below).
- Consolidated all operation-specific *_thrinfo_t object types into a single
thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
functions (aside from a few general purpose bli_thrinfo_*() functions).
- Renamed thread_comm_t object type to thrcomm_t.
- Renamed many of the routines and functions (and macros) for multithreading.
We now have the following API namespaces:
- bli_thrinfo_*(): functions related to thrinfo_t objects
- bli_thrcomm_*(): functions related to thrcomm_t objects.
- bli_thread_*(): general-purpose functions, such as initialization,
finalization, and computing ranges. (For now, some macros, such as
bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
appropriate.)
- Renamed thread-related macros so that they use a bli_ prefix.
- Renamed control tree-related macros so that they use a bli_ prefix (to be
consistent with the thread-related macros that were also renamed).
- Removed undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
undef was a temporary fix to some macro defaults which were being applied
in the wrong order, which was recently fixed.

commit 232530e88ff99f37abcae5b6fb5319a9a375a45f
Merge: 4bcabd1b eef37f8b
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed Jun 1 15:14:10 2016 -0500

Merge commit 'refs/pull/81/head' of https://github.com/flame/blis

Conflicts:
frame/base/bli_threading_pthreads.c
frame/base/bli_threading_pthreads.h

commit 4bcabd1bf60688c38cf562459fc5e8be8b831756
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed Jun 1 13:27:28 2016 -0500

Use spin locks instead of pthread barriers

commit eef37f8b4d81845a6ba4bf25586d32b50c3e8a68
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Sun May 29 22:28:13 2016 -0700

use GCC intrinsic instead of pthread_mutex for atomic increment and fetch

commit 9dcd6f05c4c3ff2ce7cd87a9951a96ebef22681e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 24 13:15:32 2016 -0500

Implemented developer-configurable malloc()/free().

Details:
- Replaced all instances of bli_malloc() and bli_free() with one of:
- bli_malloc_pool()/bli_free_pool()
- bli_malloc_user()/bli_free_user()
- bli_malloc_intl()/bli_free_intl()
each of which can be configured to call malloc()/free() substitutes,
so long as the substitute functions have the same function type
signatures as malloc() and free() defined by C's stdlib.h. The _pool()
function is called when allocating blocks for the memory pools (used
for packing buffers, primarily), the _user() function is called when
obj_t's are created (via bli_obj_create() and friends), and the _intl()
function is called for internal use by BLIS, such as when creating
control tree nodes or temporary buffers for manipulating internal data
structures. Substitutes for any of the three types of bli_malloc() may
be specified by defining the following pairs of cpp macros in
bli_kernel.h:
- BLIS_MALLOC_POOL/BLIS_FREE_POOL
- BLIS_MALLOC_USER/BLIS_FREE_USER
- BLIS_MALLOC_INTL/BLIS_FREE_INTL
to be the name of the substitute functions. (Obviously, the object
code that contains these functions must be provided at link-time.)
These macros default to malloc() and free(). Subsitute functions are
also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
- Removed definitions for bli_malloc() and bli_free().
- Note that bli_malloc_pool() and bli_malloc_user() are now defined in
terms of a new function, bli_malloc_align(), which aligns memory to an
arbitrary (power of two) alignment boundary, but does so manually,
whereas before alignment was performed behind the scenes by
posix_memalign(). Currently, bli_malloc_intl() is defined in terms
of bli_malloc_noalign(), which serves as a simple wrapper to the
designated function that is passed in (e.g. BLIS_MALLOC_INTL).
Similarly, there are bli_free_align() and bli_free_noalign(), which
are used in concert with their bli_malloc_*() counterparts.

commit 9dd440109a9d964f5cd286e9f83c487ad703e1e4
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Sat May 21 15:21:58 2016 -0700

fix 404 link to BuildSystem

Google Code is dead. Long live GitHub!

commit d309f20b7376a68efa3b864ad790c2021c071655
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 18 15:13:53 2016 -0500

Added alignment switch to testsuite.

Details:
- Added a new input parameter to input.general that globally toggles
whether testsuite tests are performed on objects whose buffers and
leading dimensions have been aligned, and changed the implementation
of libblis_test_mobj_create() to employ alignment (or not) regardless
of whether row, column, or general storage is being tested.
- Updated configure script's "--help" text to indicate default behavior
for internal integer type size and BLAS/CBLAS integer type size
options.

commit 32db0adc218ea4ae370164dbe8d23b41cd3526d3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 17 15:20:16 2016 -0500

Generate prototypes for user-defined packm kernels.

Details:
- Created template prototypes for packm kernels (in bli_l1m_ker.h), and
then redefined reference packm kernels' prototyping headers in terms of
this template, as is already done for level-1v, -1f, and -3 kernels.
- Automatically generate prototypes for user-defined packm kernels in
bli_kernel_prototypes.h (using the new template prototypes in
bli_l1m_ker.h).
- Defined packm kernel function types in bli_l1m_ft.h, including for
packm kernels specific to induced methods, which are now used in
bli_packm_cxk.c and friends rather than using a locally-defined
function type.
- In bli_packm_cxk.c, extended function pointer for packm kernels array
from out to index 31 (from previous maximum of 17). This allows us to
store the unrolled 30xk kernel in the array for use (on knc, for
example). Note: This should have been done a long time ago.

commit e3bd5ca64ae7c190ba689396c0de687b829a11fe
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu May 12 20:54:13 2016 -0500

Fix SIMD definitions in KNL config, and a couple of fixes to C update.

commit 4fe02e3d497995d94d34d3fcf5af895084cfc8b9
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu May 12 20:53:58 2016 -0500

Move bli_kernel.h before bli_threading.h in order of inclusion in blis.h.

commit 4bcf1b35abea3f3dfc8f2fe462dcf155cf199e55
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 11 16:09:49 2016 -0500

Fixed bli_get_range_*() bugs in trsm variants.

Details:
- Fixed incorrect calls to bli_get_range_*() from within trsm blocked
variants 1f, 2b, and 2f. The bug somehow went undetected since the
big commit (537a1f4), and, strangely, did not manifest via the BLIS
testsuite. The bug finally came to our attention when running thei
libflame test suite while linking to BLIS. Thanks to Kiran Varaganti
for submitting the initial report that led to this bug.

commit 9cfa33023f123a6c17e987f72fba174ce073f0b6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 11 16:02:30 2016 -0500

Minor updates to bli_f2c.h.

Details:
- Added undef guards to certain define statements in bli_f2c.h,
and renamed the file guard to BLIS_F2C_H. This helps when
including "blis.h" from an application or library that already
includes an "f2c.h" header.

commit a09a2e23eacf5328858c8318bb637c5ff3b71d08
Merge: 4dcd37eb 7c604e1c
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed May 11 10:47:11 2016 -0500

Merge pull request 76 from devinamatthews/move_simd_defs

Move default SIMD-related definitions to bli_kernel_macro_defs.h

commit 4dcd37eb1b12a6e08cc13df7b61391ef8363f5d8
Author: Tyler Smith <tmscs.utexas.edu>
Date: Tue May 10 16:28:59 2016 -0500

fixing knc simd align size

commit 619dee0daec3474b4e5a55df90a61aabcae194f2
Merge: b790b3d9 7c604e1c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 12:13:24 2016 -0500

Merge branch 'move_simd_defs' into knl

commit 7c604e1cbc1609b6e12d3ee973c08b7af5035be4
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 12:11:55 2016 -0500

Move default SIMD-related definitions to bli_kernel_macro_defs.h. Otherwise, configurations which customize these fail as these are now defined in bli_kernel.h.

commit b790b3d9e1820f3b691676de48c291cae083452d
Merge: 4f8c05c9 a7be2d28
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 11:49:47 2016 -0500

Merge branch 'master' into knl

commit a7be2d28e8930b154d0da1d6929b54a96e210af6
Merge: 97b512ef 4b1e55ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue May 10 11:48:51 2016 -0500

Merge pull request 74 from devinamatthews/fix_common_symbols

Default-initialize all extern global variables to avoid generating common symbols.

commit 4b1e55edbfe0e1cb2e7b9428424903497cb7a841
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue May 10 10:08:47 2016 -0500

Default-initialize all extern global variables to avoid generating common symbols. Fixes 73.

commit 97b512ef62c7e25c97ed5e9eca81cd7015b2ac91
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri May 6 10:24:30 2016 -0500

Include headers from cblas.h to pull in f77_int.

Details:
- Added include statements for certain key BLIS headers so that the
definition of f77_int is pulled in when a user compiles application
code with only include "cblas.h" (and no other BLIS header). This
is necessary since f77_int is now used within the cblas API.

commit c3a4d39d03665135f1616588b5ef7c3e9ef5688d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed May 4 17:22:56 2016 -0500

Updates to haswell gemm micro-kernels.

Details:
- Added two new sets of [sd]gemm micro-kernels for haswell architectures,
one that is 4x24/4x12 (s and d) and one that is 6x16/6x8.
- Changed the haswell configuration to use the 6x16/6x8 micro-kernels
by default.
- Updated various Makefiles, in test, test/3m4m, and testsuite.

commit 0b01d355ae861754ae2da6c9a545474af010f02e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 27 15:21:10 2016 -0500

Miscellaneous cleanups, fixes to recent commits.

Details:
- Fixed a typo in bli_l1f_ref.h, introduced into bbb8569, that only
manifested when non-reference level-1f kernels were used.
- Added an undef BLIS_SIMD_ALIGN_SIZE to bli_kernel.h of dunnington
configuration to prevent a compile-time warning until I can figure out
the proper permanent fix.
- Moved frame/1f/kernels/bli_dotxaxpyf_ref_var1.c out of the compilation
path (into 'other' directory). _ref_var2 is used by default, which is
the variant that is built on axpyf and dotxf instead of dotaxpyv.
- Removed section of frame/include/bli_config_macro_defs.h pertaining to
mixed datatype support.

commit ed7326c836f427e2f8420b015220ce293207b10c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 27 14:57:40 2016 -0500

Added 'restrict' to l1v/l1f code in 'kernels' dir.

Details:
- Added 'restrict' keyword to existing kernel definitions in 'kernels'
directory. These changes were meant for inclusion in bbb8569.

commit bbb8569b2a08c3bcd631d5a05eb389d01d94ac07
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 27 14:13:46 2016 -0500

Use 'restrict' in all kernel APIs; wspace changes.

Details:
- Updated level-1v, level-1f kernel function types (bli_l1?_ft.h) and
generic kernel prototypes (bli_l1?_ker.h) to use 'restrict' for all
numerical operand pointers (ie: all pointers except the cntx_t).
- Updated level-1f reference kernel definitions to use 'restrict' for
all numerical operand pointers. (Level-1v reference kernel definitions
were already updated in bdbda6e.)
- Rewrote the level-1v and level-1f reference kernel prototypes in
bli_l1v_ref.h and bli_l1f_ref.h, respectively, to simply include
bli_l1v_ker.h and bli_l1f_ker.h with redefined function base names
(as was already being done for the level-3 micro-kernel prototypes
in bli_l3_ref.h), rather than duplicate the signatures from the
_ker.h files.
- Added definitions to frame/include/bli_kernel_prototypes.h for axpbyv
and xpbyv, which were probably meant for inclusion in bdbda6e.
- Converted a number of instances of four spaces, as introduced in
bdbda6e, to tabs.

commit 4ea419c72c789825e1f93a1eee88219bbf873930
Merge: f1e9be2a bdbda6e6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 26 12:50:45 2016 -0500

Merge pull request 70 from devinamatthews/daxpby

Give the level1v operations some love

commit bdbda6e6acc682ab1b6ca680edebd09ae12a832c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 25 11:05:57 2016 -0500

Give the level1v operations some love:

- Add missing axpby and xpby operations (plus test cases).
- Add special case for scal2v with alpha=1.
- Add restrict qualifiers.
- Add special-case algorithms for incx=incy=1.

commit f1e9be2aba1a057eedb947bbae96848597777408
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 22 15:34:02 2016 -0500

Minor tweak to test/Makefile.

Details:
- Just committing a minor change to test/Makefile that has been lingering
in my local working copy for longer than I can remember.

commit aa0bceec277938328dabeb744680623f24fb0b61
Merge: 4136553f e2784b4c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 22 12:01:31 2016 -0500

Merge branch 'master' of github.com:flame/blis

commit 4136553f0d0661a668dfdb9edcd7ce1c5773dde7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 22 11:53:53 2016 -0500

Clear level-3 cntx_t's via memset() before use.

Details:
- In all level-3 operations' _cntx_init() functions, replaced calls to
bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all
level-3 operations' _cntx_finalize() functions, removed calls to
bli_cntx_obj_finalize(), leaving those function definitions empty.
- Changed the definition of bli_cntx_obj_clear() so that the clearing
occurs via a single call to memset().

commit 4f8c05c9e2ef4cbb82b35a3ebf1f0a0ac665830e
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Apr 21 10:00:59 2016 -0500

Rearrange KNL dgemm kernel again to streamline usage of ymm register. sgemm and dgemm now both working with Intel SDE.

commit e2784b4c921f706e756df3e146e20a4cb63f53e3
Merge: dd0ab1d9 a9b6c3ab
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 20 18:34:09 2016 -0500

Merge pull request 67 from devinamatthews/cblas-f77-int

Change CBLAS integer type to f77_int

commit a9b6c3abda6222a8b240361643932e83cf726c4f
Merge: e4c54c81 dd0ab1d9
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 20 16:00:10 2016 -0500

Merge remote-tracking branch 'origin/master' into cblas-f77-int

Conflicts:
config/haswell/bli_config.h

commit e4c54c81463c2a19c9bb6b1f0f1be3fa9d018a45
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 20 15:56:46 2016 -0500

Change integer type in CBLAS function signatures to f77_int, and add proper const-correctness to BLAS layer.

commit dd0ab1d93f33abca6af9edd7b8e52da62dcfa5b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 20 14:38:23 2016 -0500

Converted some bli_cntx query functions to macros.

Details:
- Commented out several datatype-aware query functions (those ending in
_dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and
added equivalent cpp query macros to bli_cntx.h.
- Added 'bli_config.h' to .gitignore.

commit 7193230f7d35edbd1d2f77842a613971f1603463
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 20 09:37:30 2016 -0500

Work around missing VPMULLQ on KNL.

commit a30ccbc4c6a6e6460e78af6b5c530ee0d06f98fb
Merge: eb2f18e4 0e1a9821
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 19 15:04:33 2016 -0500

Merge pull request 66 from devinamatthews/blas-configure

Add configure options and generate bli_config.h automatically.

commit bd44cf13e886069bc66c10ac0db178be96629a0d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Apr 19 13:43:04 2016 -0500

Fix copy-paste errors in KNL kernels.

commit eb2f18e4844d985715df20798f50f9cc12e3b5ad
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 19 12:50:32 2016 -0500

More compile-time fixes to bgq gemm ukernel code.

commit 0e1a9821d860f6c1d818baf4c48d21a23726c132
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Apr 19 11:44:37 2016 -0500

Add configure options and generate bli_config.h automatically.

Options to configure have been added for:
- Setting the internal BLIS and BLAS/CBLAS integer sizes.
- Enabling and disabling the BLAS and CBLAS layers.

Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file.

Lastly, support for OMP in clang has been added (closes 56).

commit a11eec05928ddc5c43fa5dbcd35f2edd24ff35a1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 13:13:36 2016 -0500

Add sgemm ukernels for KNL. vpmullq is not implemented on KNL -- needs workaround.

commit ff84469a4575f1ef8a0010046fde52240a312cae
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 18 12:29:09 2016 -0500

Applied various compilation fixes to bgq kernels.

commit c38e0dab05b2dc36672eab96e1248fb7fb2d785b
Merge: bd5e2296 cbcd0b73
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:21:35 2016 -0500

Merge remote-tracking branch 'origin/master' into knl

commit bd5e2296e98e042c31f1e8ece2c1ca8e4bdc2d4c
Merge: 4745def0 49f85177
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:15:22 2016 -0500

Merge remote-tracking branch 'origin/knl' into knl

commit 4745def0c87377ae83ad73ac514d7de08a96b2ac
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:15:05 2016 -0500

Add 64-bit offset vector so we can use vgatherqpd.

commit 49f85177f886f38889b60503a4e12fa7f04be1fd
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 18 10:14:11 2016 -0500

KNL ukernel compiles with gcc.

commit cbcd0b739dc54bd14fbb46aeda267c26725cd70f
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Mon Apr 18 03:12:57 2016 -0500

Changing ifdef for OSX pthread barriers

commit 58b2c3cf040134d1be913c585a3c6905629116c0
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sat Apr 16 16:12:24 2016 -0500

Rewrite of KNL kernel in GNU extended asm syntax.

commit dd62080cea78f3a23616200d6640e52c102b2bb9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 15 11:15:41 2016 -0500

Compile-time fix to bgq l1f kernels.

Details:
- Fixed an old reference to bli_daxpyf_fusefac, which no longer exists,
by replacing it with the axpyf fusing factor (8), and cleaned up the
relevant section of config/bgq/bli_kernel.h.
- Removed most of the details of the level-3 kernels from the template
kernel code in config/template/kernels/3 and replaced it with a
reference to the relevant kernel wiki maintained on the BLIS github
website.

commit d5a915dd8d7a6ead42a68772e4420eb3647e6f1a
Merge: 4320b725 41694675
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 14 12:56:36 2016 -0500

Merge branch 'master' of github.com:flame/blis

commit 4320b725a1f8fd34101470b6cf52ad504a79c517
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 14 12:51:29 2016 -0500

Use kernel CFLAGS on "ukernels" directories.

Details:
- Updated the top-level Makefile so that the CFLAGS variable designated
for kernel source code is applied not only to source code in
directories named "kernels" but source code in any directory that
contains the substring "kernels", such as "ukernels".
- Formally disabled some code in gen-make-frag.sh script that was already
effectively disabled. The code was related to handling "noopt" and
"kernel" directories, which is now handled independently within the
top-level Makefile without needing to place these source files into
a spearate makefile variable.

commit 41694675e4cb56e2e0323c7a7db48e0819606a31
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Apr 13 15:51:08 2016 -0500

pthreads bugfixes

Getting pthreads to work on my Mac
Implemented a pthread barrier when _POSIX_BARRIER isn't defined
Now spawn n-1 threads instead of n threads so that master thread isn't just spinning the whole time
Add -lpthread instead of -pthread to LDFLAGS (for clang)

commit f756dbfa0d542cbc497724981520c83abf049c4b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 13 11:25:33 2016 -0500

Removed stale include from bgq configuration.

Details:
- Removed an old include statement ("bli_gemm_8x8.h") from the
bli_kernel.h file in the bgq configuration. It turns out this
file was no longer needed even prior to 537a1f4.

commit 0bd4169ea75f690714e7d2912229932a75d8a7e2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 18:08:32 2016 -0500

Fixed context-broken dunnington/penryn kernels.

Details:
- Added missing context parameters to several instances where simpler
kernels, or reference kernels, are called instead of executing the
main body code contained in the kernel function in question.
- Renamed axpyv and dotv kernel files to use "opt" instead of "int"
substring, for consistency with level-1f kernels.

commit 7912af5db45b7372d19a9a3dfeb82df302a05628
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 17:32:13 2016 -0500

CHANGELOG update (0.2.0)

0.2.0

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 17:32:09 2016 -0500

Version file update (0.2.0)

commit 537a1f4f85ce1aa008901857cb3182e6b4546d7f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Apr 11 17:21:28 2016 -0500

Implemented runtime contexts and reorganized code.

Details:
- Retrofitted a new data structure, known as a context, into virtually
all internal APIs for computational operations in BLIS. The structure
is now present within the type-aware APIs, as well as many supporting
utility functions that require information stored in the context. User-
level object APIs were unaffected and continue to be "context-free,"
however, these APIs were duplicated/mirrored so that "context-aware"
APIs now also exist, differentiated with an "_ex" suffix (for "expert").
These new context-aware object APIs (along with the lower-level, type-
aware, BLAS-like APIs) contain the the address of a context as a last
parameter, after all other operands. Contexts, or specifically, cntx_t
object pointers, are passed all the way down the function stack into
the kernels and allow the code at any level to query information about
the runtime, such as kernel addresses and blocksizes, in a thread-
friendly manner--that is, one that allows thread-safety, even if the
original source of the information stored in the context changes at
run-time; see next bullet for more on this "original source" of info).
(Special thanks go to Lee Killough for suggesting the use of this kind
of data structure in discussions that transpired during the early
planning stages of BLIS, and also for suggesting such a perfectly
appropriate name.)
- Added a new API, in frame/base/bli_gks.c, to define a "global kernel
structure" (gks). This data structure and API will allow the caller to
initialize a context with the kernel addresses, blocksizes, and other
information associated with the currently active kernel configuration.
The currently active kernel configuration within the gks cannot be
changed (for now), and is initialized with the traditional cpp macros
that define kernel function names, blocksizes, and the like. However,
in the future, the gks API will be expanded to allow runtime management
of kernels and runtime parameters. The most obvious application of this
new infrastructure is the runtime detection of hardware (and the
implied selection of appropriate kernels). With contexts in place,
kernels may even be "hot swapped" at runtime within the gks. Once
execution enters a level-3 _front() function, the memory allocator will
be reinitialized on-the-fly, if necessary, to accommodate the new
kernels' blocksizes. If another application thread is executing with
another (previously loaded) kernel, it will finish in a deterministic
fashion because its kernel information was loaded into its context
before computation began, and also because the blocks it checked out
from the internal memory pools will be unaffected by the newer threads'
reinitialization of the allocator.
- Reorganized and streamlined the 'ind' directory, which contains much of
the code enabling use of induced methods for complex domain matrix
multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
those APIs' functionality is now mostly subsumed within the global
kernel structure.
- Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
that will reinitialize a memory pool if the necessary pool block size
has increased.
- Updated bli_mem.c to use bli_pool_reinit_if() instead of
bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
usage of contexts where appropriate to communicate cache and register
blocksizes to bli_mem_compute_pool_block_sizes().
- Simplified control trees now that much of the information resides in
the context and/or the global kernel structure:
- Removed blocksize object pointers (blksz_t*) fields from all control
tree node definitions and replaced them with blocksize id (bszid_t)
values instead, which may be passed into a context query routine in
order to extract the corresponding blocksize from the given context.
- Removed micro-kernel function pointers (func_t*) fields from all
control tree node definitions. Now, any code that needs these function
pointers can query them from the local context, as identified by a
level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
level-1v kernel id (l1vkr_t).
- Removed blksz_t object creation and initialization, as well as kernel
function object creation and initialization, from all operation-
specific control tree initialization files (bli_*_cntl.c), since this
information will now live in the gks and, secondarily, in the context.
- Removed blocksize multiples from blksz_t objects. Now, we track
blocksize multiples for each blocksize id (bszid_t) in the context
object.
- Removed the bool_t's that were required when a func_t was initialized.
These bools are meant to allow one to track the micro-kernel's storage
preferences (by rows or columns). This preference is now tracked
separately within the gks and contexts.
- Merged and reorganized many separate-but-related functions into single
files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
util directories, but has the most obvious effect of allowing BLIS
to compile noticeably faster.
- Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
in an attempt to reduce overhead for memory-bound operations. This
includes removal of default use of object-based variants for level-2
operations. Now, by default, level-2 operations will directly call a
low-level (non-object based) loop over a level-1v or -1f kernel.
- Converted many common query functions in blk_blksz.c (renamed from
bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
respective header files.
- Defined bli_mbool.c API to create and query "multi-bools", or
heterogeneous bool_t's (one for each floating-point datatype), in the
same spirit as blksz_t and func_t.
- Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
and BLIS_SIMD_SIZE. These values are needed in order to compute a third
new parameter, which may be set indirectly via the aforementioned
macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
statically allocate memory in macro-kernels and the induced methods'
virtual kernels to be used as temporary space to hold a single
micro-tile. These values are now output by the testsuite. The default
value of BLIS_STACK_BUF_MAX_SIZE is computed as
"2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
- Cleaned up top-level 'kernels' directory (for example, renaming the
embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
and "haswell," respectively, and gave more consistent and meaningful
names to many kernel files (as well as updating their interfaces to
conform to the new context-aware kernel APIs).
- Updated the testsuite to query blocksizes from a locally-initialized
context for test modules that need those values: axpyf, dotxf,
dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
- Reformatted many function signatures into a standard format that will
more easily facilitate future API-wide changes.
- Updated many "mxn" level-0 macros (ie: those used to inline double loops
for level-1m-like operations on small matrices) in frame/include/level0
to use more obscure local variable names in an effort to avoid variable
shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
which are only output using -Wshadow.)
- Added a conj argument to setm, so that its interface now mirrors that
of scalm. The semantic meaning of the conj argument is to optionally
allow implicit conjugation of the scalar prior to being populated into
the object.
- Deprecated all type-aware mixed domain and mixed precision APIs. Note
that this does not preclude supporting mixed types via the object APIs,
where it produces absolutely zero API code bloat.

commit dd856c2cb75a2221a503a73dde27790c34b91570
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Apr 11 10:39:18 2016 -0500

Translated MIC kernel to KNL and cleaned up a bit. Only real change is lack of swizzle modifiers for FMA instructions (used bcast from memory instead).

commit 7f27431d3fffdda99c282ec412731d0a90cb32a7
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Apr 8 10:04:39 2016 -0500

Copy mic kernel to knl for transliteration.

commit f8f02f0334ac020021e15a415bcd33aeea01deb4
Merge: 32c92d94 d1f8e5d9
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 6 11:37:05 2016 -0500

Merge branch 'master' into const_correctness

commit 32c92d945c55708da0eb63be1771f8c5430e3910
Merge: 62914ccb 20af937b
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Apr 6 11:36:02 2016 -0500

Merge branch 'master' into const_correctness

commit d1f8e5d9b2ecd054ed103f4d642d748db2d4f173
Merge: 20af937b c11d28ee
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Apr 5 12:21:27 2016 -0500

Merge pull request 60 from esauvage/master

sgemm µkernel for bulldozer : bug correction for k%4 != 0

commit c11d28eed89d65494bc4019f04d046520866c0ff
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Sat Apr 2 21:15:48 2016 +0200

cgemm µkernel for bulldozer : bug correction for k%4 != 0

commit 20af937b57f82bb3acb09418d5c0206e1b24f2c7
Merge: 36c3abb0 fc61a114
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 31 14:37:30 2016 -0500

Merge pull request 59 from devinamatthews/fix_testsuite_makefile

Fix testsuite makefile

commit fc61a1143edeba4946d4b9915f1775bb08e643fc
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 31 10:53:01 2016 -0500

Fix formatting in configure.

commit 26379b14de630e3a6c6eef5dfe87ff001558a8a6
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 31 10:45:48 2016 -0500

Adjust paths in common.mk to support building from testsuite dir.

commit 36c3abb05fecb02d4a9ab13b2b69d133adf34583
Merge: 64b41fa5 917ce754
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 31 10:26:17 2016 -0500

Merge pull request 58 from esauvage/master

cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi…

commit 356d854fc9e34642cc46e0e02a8ceb56114878af
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 30 16:33:15 2016 -0500

Make symlink to common.mk in build directory.

commit edbb8470044f82ef959583ee09613a5a985292b5
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Wed Mar 30 16:27:11 2016 -0500

Refactor out some definitions which moved from make_defs.mk to Makefile for use in testsuite Makefile.

commit 917ce75482a543fef46553efff6c246939761e59
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Wed Mar 30 22:03:09 2016 +0200

cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 62914ccbcdb3c594f065dcfa65bd7e7b95c79283
Merge: bbf704bf 64b41fa5
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Tue Mar 29 15:24:25 2016 -0500

Merge branch 'master' into const_correctness

commit 64b41fa554dff44b2f9ad48901b67c63836407a8
Merge: 1b09e343 0171ad58
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 29 15:19:41 2016 -0500

Merge pull request 54 from devinamatthews/more_config_opts

More config opts

commit 1b09e343dfe5b48b4842e2cb96f41c8cc249bad0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 29 12:55:28 2016 -0500

Updated gcc version from 4.8 to 4.9 in .travis.yml.

commit 0171ad58997b3a5a9b76301511dbe0751fffc940
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Mar 28 13:55:06 2016 -0500

Add icc and clang support for Intel architectures, fixes 47. 2bd036f fixes 49 BTW.

commit 3090fff64cc87ff2519a09f38e6b8699cf3cba11
Merge: 8624e365 4ca5d5b1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 28 12:36:25 2016 -0500

Merge pull request 44 from esauvage/master

sgemm micro-kernel for FMA4 instruction set

commit e6e566426ac3ded7ef87cd8ff9be98accfdc4acc
Merge: 469429ec 8624e365
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sat Mar 26 14:10:15 2016 -0500

Merge branch 'master' into more_config_opts

commit 8624e36543160739d954c4dbcc5a5594458f3a12
Merge: a315833f 2bd036f1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Mar 26 13:56:28 2016 -0500

Merge pull request 50 from devinamatthews/fix_noopt_avx

Fix configuration issue where instruction set flags are not specified for debug builds.

commit 469429ec34e5b1a172ce35596f9c7afdaacac131
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 20:45:41 2016 -0500

Fix LD_FLAGS -> LDFLAGS.

commit 8442d65c9ead0376fc5f2dfad62fd4862ab9b2b3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 20:06:48 2016 -0500

Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures.

commit 76099f20be1b49ac960f7e3c5a8296bbf4e1782d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 17:22:58 2016 -0500

Add threading option to configure.

commit ad43eab4c7899d56d8d7caa6e2d92bc0581ea5a5
Merge: 9452bdb3 2bd036f1
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 15:00:02 2016 -0500

Merge branch 'fix_noopt_avx' into more_config_opts

commit 9452bdb3afbf2d7f898134a091d7790817e7be9c
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 14:59:50 2016 -0500

Add options for verbose make output and static/shared linking to configure.

commit 2bd036f1f9ce1ee0864365557f66d9415dd42de3
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 12:16:49 2016 -0500

Fix configuration issue where instruction set flags are not specified for debug builds.

commit bbf704bf7501411964a63a68f1af541f612cf92d
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 25 09:55:35 2016 -0500

Add missing const to bli_read_nway_from_env.

commit a315833f067944fb0bc14cf60f0c7dcb5dc897b6
Merge: 1d1a426d af92773f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Mar 24 12:30:21 2016 -0500

Merge pull request 48 from figual/master

Updated and improved ARMv8 micro-kernels.

commit af92773f4f85a2441fe0c6e3a52c31b07253d08e
Author: figual <figualucm.es>
Date: Wed Mar 23 22:07:02 2016 +0100

Updated and improved ARMv8 micro-kernels.

commit a4d7729776d17d9bdf2341eacd70b9770b9ba8d2
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Mon Mar 21 09:55:21 2016 -0500

Set default value for debug_type variable.

commit 0e2447fa55d8c5fa2b1fc4150073512495c5f9eb
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 17 16:32:05 2016 -0500

Add const correctness to auxinfo_t struct (microkernels need update theoretically).

commit 1d1a426d18ec03754021456862a1f4d1dfec1fbf
Merge: 5a978fff d226dfa0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Mar 7 15:17:53 2016 -0600

Merge pull request 46 from devinamatthews/new-config-opts

Add several changes to the build system.

commit d226dfa05190eb477b33563b1edccf8603973336
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Sat Mar 5 16:18:14 2016 -0600

Add several changes to the build system.

1) Add -- options.
2) Add -d/--enable-debug option to enable debugging symbols with and without optimization.
3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor.
4) Add make V=[0,1] option to control build verbosity.

commit 5a978fffdb8f09a81c89541d541d4a6830cd70a4
Merge: adb2b4e0 63e26423
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Mar 4 17:26:58 2016 -0600

Merge pull request 45 from devinamatthews/high_prec_timers

Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday

commit 63e264239053b913164a849dd8a45829087eaddc
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 4 13:17:50 2016 -0600

Make sure that -lrt is linked on Linux.

commit 44fddd48dc1708a956803d1948f04429ec0d8700
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Fri Mar 4 12:36:38 2016 -0600

Add missing \.

commit 7cabd2131f953de23e7015d760b0ddfda51b1251
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Mar 3 11:43:07 2016 -0600

Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday.

commit adb2b4e096c78e8b2f85fd372cf0d5eb04af5be8
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Mar 2 14:48:12 2016 -0600

Fixing guard for non implemented partitioning through packed matrices

commit 4ca5d5b1fd6f2e4a8b2e139c5405475239581e51
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Tue Mar 1 21:33:01 2016 +0100

sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 627d59b5ba06866b26f46e4434a0435b600925e3
Author: Etienne Sauvage <etienne.sauvagegmail.com>
Date: Mon Feb 29 21:53:12 2016 +0100

symbolic link for bulldozer configuration to kernels

commit 2dc5c0ae038ed175fab85751803ada05734d1ba1
Merge: f2809fc5 3d0fae81
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 29 12:22:51 2016 -0600

Merge pull request 40 from tkelman/bulldozer-symlink

Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer

commit f2809fc5f74466c755da6a5b4632853e634060b5
Merge: f86b94f2 8624a33c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Feb 27 13:06:03 2016 -0600

Merge pull request 39 from devinamatthews/fix_f2c_conflicts

Devin's f2c type namespace update.

Details:
- Added "bla_" prefix to f2c type names to prevent conflicts with external user code.
- Removed most of the body of bli_f2c.h, which was unused.

commit 3d0fae810d942085d8f2d389820b4e0027577db8
Author: Tony Kelman <tonykelman.net>
Date: Thu Feb 25 23:24:03 2016 -0800

Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer

to fix linking issue mentioned in 37 and https://groups.google.com/forum/#!topic/blis-devel/iypwljcaeEI

commit 8624a33ccc12dff6f6c4f92992ca5636af1576a6
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Feb 25 13:51:26 2016 -0600

Fix remaining f2c conflicts.

commit 372eef0b6c0a535bf88d4b46b72f61266e8491ba
Author: Devin Matthews <dmatthewsutexas.edu>
Date: Thu Feb 25 12:01:58 2016 -0600

Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
progress.

commit f86b94f206e2e09fa3221cc55c3dc5b05ca4775a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 23 18:12:34 2016 -0600

Included missing blas2blis integer def to CBLAS.

Details:
- Added include "bli_config_macro_defs" to all cblas_*.c files in
compat/cblas/src. This has the effect of defining
BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does
not define it. Thanks to Tony Kelman for reporting this bug.
- In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int'
to 'f77_int'. This eliminates a compiler warning and a potential
runtime bug and/or crash when the size of an int differs from the size
of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE).

commit 0b126de1342c11c65623bcb38e258e21e9244e3d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 13 16:29:12 2015 -0600

Consolidated packm_blk_var1 and packm_blk_var2.

Details:
- Consolidated the two blocked variants for packm into a single
implementation (packm_blk_var1) and removed the other variant.
- Updated all induced method _cntl_init() functions in frame/cntl/ind/
to use the new blocked variant 1.
- Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
to detect pack_t schemas for induced methods and native execution,
respectively.

commit 30e5eb29e060b97752f702d2ea5d101d950f53b2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 13 12:14:19 2015 -0600

Minor changes to treatment of rs, cs in bli_obj.c.

Details:
- Applied a patch submitted by Devin Matthews that:
- implements subtle changes to handling of somewhat unusual cases of
row and column strides to accommodate certail tensor cases, which
includes adding dimension parameters to _is_col_tilted() and
_is_row_tilted() macros,
- simplifies how buffers are sized when requested BLIS-allocated
objects,
- re-consolidates bli_adjust_strides_*() into one function, and
- defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
environments.

commit f0a4f41b5acf55b41707ec821c4c5f9076dfbc24
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 12 15:22:50 2015 -0600

Fixed unimplemented case in core2 sgemm ukernel.

Details:
- Implemented the "beta == 0" case for general stride output for the
dunnington sgemm micro-kernel. This case had been, up until now,
identical to the "beta != 0" case, which does not work when the
output matrix has nan's and inf's. It had manifested as nan residuals
in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin
Matthews for reporting this bug.

commit 42810bbfa0b8f006ecc5128d903909ec13ea63f9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 12 12:07:46 2015 -0600

Fixed minor bugs for uncommon obj_create cases.

Details:
- Separated bli_adjust_strides() into _alloc() and _attach() flavors so
that the latter can avoid a test performed by the former, in which the
rs and cs are overridden and set to zero if either matrix dimension is
zero. Actually, we also disable this overridding behavior, even for the
_alloc() case, since keeping the original strides (probably) does not
hurt anything. The original code has been kept commented-out, though,
in case an unintended consequence is later discovered.
- Fixed a typo in an error check for general stride cases where rs == cs.

commit 3e6dd11467643fbc2cb45c13cec8dd6024232833
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Nov 3 10:30:08 2015 -0600

Minor re-expression in quadratic partitioning code.

Details:
- Minor change to quadratic equation solution code that avoids
recomputation of the sqrt() parameter when the compiler is not
smart enough to perform this optimization automatically.

commit 0694b722f7e4df00efb32639095a2aca80e67f52
Merge: 3e116f0a 33557ecc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 2 17:24:25 2015 -0600

Merge branch 'master' of github.com:flame/blis

commit 3e116f0a2953f50b3c068759a775ad7ffae04e49
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 2 17:18:23 2015 -0600

Fixed imaginary bug in quadratic partitioning code.

Details:
- Fixed a bug in the relatively new quadratic partitioning code that,
under the right conditions, would perform sqrt() on a negative value.
If the solution is imaginary, we discard it and use an alternate
partition width that assumes no diagonal intersection. That alternate
width is actually already computed, so, the fix was quite simple.
Thanks to Devangi Parikh for reporting this bug.

commit 33557ecccaf49b2569b7f3d7bcea52c2aab94c68
Author: Jeff Hammond <jeff.sciencegmail.com>
Date: Mon Nov 2 12:18:43 2015 -0800

add Travis CI build status icon to the README

commit 4a502fbe77bd0f701108baaa559d9cfb483f88de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Nov 2 13:28:34 2015 -0600

Laid groundwork for runtime memory pool resizing.

Details:
- Changed bli_pool_finalize() so that the freeing begins with the block
at top_index instead of block 0. This allows us to use the function
for terminal finalization as well as temporary cleanup prior to
reinitialization. Also, clear the pool_t struct upon _pool_finalize()
in case it is called in the terminal case with some blocks still
checked out to threads (in which case the threads will see the new
block size as 0 and thus release the block as intended).
- Added bli_pool_reinit(), which calls _pool_finalize() followed by
_pool_init() with new parameters.
- Added bli_mem_reinit(), which is based on bli_pool_reinit().
- Added new wrapper, _mem_compute_pool_block_sizes(), which calls
_mem_compute_pool_block_sizes_dt().
- Updated bli_mem_release() so that the pblk_t is freed, via
_pool_free_block(), if the block size recorded in the mem_t at the
time the pblk_t was acquired is now different from the value in the
pool_t.

commit 37e55ca39bdbddaec03ad30d43e8ad2b3e549c96
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 30 18:25:04 2015 -0500

Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.

Details:
- Fixed a family of bugs in the triangular level-3 operations for
certain complex implementations (3m1 and 4m1a) that only manifest if
one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
- Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
for the triangular case.
- Fixed the incorrect computation of imaginary stride, as stored in
the auxinfo_t struct in trmm and trsm macro-kernels.
- Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
cases where the the register blocksize for the triangular matrix is
odd. Introduced a new byte-granular pointer arithmetic macro,
bli_ptr_add(), that computes the correct value.
- Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
terms of __typeof__, which is used by bli_ptr_add() macro.
- Disabled the row- vs. column-storage optimization in bli_trmm_front()
for singleton problems because the inherent ambiguity of whether a
scalar is row-stored or column-stored causes the wrong parameter
combination code to be executed (by dumb luck of our checking for
row storage first).
- Added commented-out debugging lines to 3m1/4m1a and reference
micro-kernels, and trsm_ll macro-kernel.

commit 46294d80e5a79c598e200e1c8ec2a642ff839971
Merge: d3159c57 a0a7b85a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 27 12:41:23 2015 -0500

Merge pull request 35 from figual/master

Fixed incomplete code in the double precision ARMv8 microkernel.

commit a0a7b85ac3e157af53cff8db0e008f4a3f90372c
Author: Francisco Igual <figualucm.es>
Date: Tue Oct 27 08:59:15 2015 +0000

Fixed incomplete code in the double precision ARMv8 microkernel.

commit d3159c5740c9ee7f8c0b661003aab6f00646ad6f
Merge: b489152e 7e03e45b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 21 14:54:00 2015 -0500

Merge branch 'master' of github.com:flame/blis

commit b489152e112644ec3b6d19e687231a9607f7694f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 21 14:53:17 2015 -0500

Use vzeroall in haswell micro-kernels.

commit 7e03e45bfe6c27c4fdbf06b1caa7f49e9a5fef49
Merge: 77ddb0b1 4f88c29f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 14 13:26:07 2015 -0500

Merge pull request 33 from xianyi/master

Enable Travis CI

commit 4f88c29f9e634cbb6fb22d8c88931f0ec78ad7db
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Wed Oct 14 12:57:50 2015 -0500

Detect Intel Broadwell (using Haswell config).

commit 4b0ac1a9984a93f7ad4369b10fca63991107d9f5
Merge: fe3e355c 77ddb0b1
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Wed Oct 14 12:51:05 2015 -0500

Merge branch 'upstream_master'

commit 77ddb0b1d31ada111dadf392766ba6d9210ed9fb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Oct 13 12:53:06 2015 -0500

Removed flop-counting mechanism.

Details:
- Removed the optional flop-counting feature introduced in commit
7574c994.

commit 276da366187460a4c8e6e0910e79cb39ce780bfe
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 12 11:43:03 2015 -0500

Minor formatting change to README.md.

commit d17057446f5404824478e8a6cd08f242ab75544a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 12 11:39:49 2015 -0500

Added "Getting Started" section to README.md.

Details:
- Added section to README.md file containing links to wikis with brief
descriptions.

commit e7e1f2f7b601b21b50e3cdad8972cb3fe11018d3
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 2 16:51:52 2015 -0500

Minor updates to CREDITS, README files.

commit 55329906ecd7ce1ab910e4d30a29354a9172e7ea
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Sep 26 20:47:19 2015 -0500

Minor edits to README.md, testsuite.

Details:
- Fixed typos in README.md.
- Fixed column heading alignment for testsuite when matlab output is
enabled.
- Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.

commit bbebdb5793a8fd6aaf257012ab0272beaa04a0de
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 25 14:47:27 2015 -0500

Replaced README with README.md.

Details:
- Replaced the old (and short) README file with a much more comprehensive
version written in github-flavored markdown. The new file is based on
content taken from the old Google Code homepage.

commit e2e9d64a63485461192d9c2a6dd0183a8b71013c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 24 12:14:03 2015 -0500

Load balance thread ranges for arbitrary diagonals.

Details:
- Expanded/updated interface for bli_get_range_weighted() and
bli_get_range() so that the direction of movement is specified in the
function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
and also so that the object being partitioned is passed instead of an
uplo parameter. Updated invocations in level-3 blocked variants, as
appropriate.
- (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
carefully take into account the location of the diagonal when computing
ranges so that the area of each subpartition (which, in all present
level-3 operations, is proportional to the amount of computation
engendered) is as equal as possible.
- Added calls to a new class of routines to all non-gemm level-3 blocked
variants:
bli_<oper>_prune_unref_mparts_[mnk]()
where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
dimension is being partitioned. These routines call a more basic
routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
regions from matrices and simultaneously adjust other matrices which
share the same dimension accordingly.
- Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
new pruning routines.
- Fixed incorrect blocking factors passed into bli_get_range_*() in
bli_trsm_blk_var[12][fb].c
- Added a new test driver in test/thread_ranges that can exercise the new
bli_get_range_*() and bli_get_range_weighted_*() under a range of
conditions.
- Reimplemented m and n fields of obj_t as elements in a "dim"
array field so that dimensions could be queried via index constant
(e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
macros accordingly.
- Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
- Added bli_round() macro, which calls C math library function round(),
and bli_round_to_mult(), which rounds a value to the nearest multiple
of some other value.
- Added miscellaneous pruning- and mdim_t-related macros.
- Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
bli_obj_row_off(), bli_obj_col_off().

commit fe3e355c9c5a6f65b8736b009e2d501b62a83ea1
Merge: efa641e3 4dd9dd3e
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Fri Aug 21 14:38:36 2015 -0500

Merge branch 'upstream_master'

commit efa641e36b73abee34166a252e90e28a6281d92d
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Sat Aug 22 03:15:50 2015 +0800

Try to fix the compiling bug on travis.

commit 4dd9dd3e1de626b51bfe85d9ee65f193d60e8d38
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 21 11:52:37 2015 -0500

Fixed minor alignment ambiguity bug in bli_pool.c.

Details:
- Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
pointer arithmetic was performed on a void* as if it were a byte
pointer (such as char*). Some compilers may have already been
interpreting this situation as intended, despite the sloppiness.
Thanks to Aleksei Rechinskii for reporting this issue.
- Redefined pointer alignment macros to typecast to uintptr_t instead of
siz_t.

commit 12ffd568b04feda57147c13b67717416a01c82f8
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Sat Aug 22 00:24:28 2015 +0800

Add Travis CI.

commit ecc3ebb749e0861c27deda52b5f87236ede4901b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 29 13:31:12 2015 -0500

CHANGELOG update (0.1.8)

0.1.8

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 29 13:31:09 2015 -0500

Version file update (0.1.8)

commit ef0fbbbdb6148b96938733fce72cb4ed7dad685e
Merge: fdfe14f1 d4b89136
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 9 13:54:54 2015 -0500

Merge branch 'master' of github.com:flame/blis

commit fdfe14f1e17ba5a2f8dfa0bdb799c6b0e730211b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 9 13:52:39 2015 -0500

Added support for Intel Haswell/Broadwell.

Details:
- Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
and FMA instructions. (Complex support is currently provided by default
induced method, 4m1a.)
- Added a 'haswell' configuration, which uses the aforementioned kernels.
- Inserted auto-detection support for haswell configuration in
build/auto-detect/cpuid_x86.c.
- Modified configure script to explicitly echo when automatic or manual
configuration is in progress.
- Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.

commit d4b891369c1eb0879ade662ff896a5b9a7fca207
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 7 10:06:53 2015 -0500

Added 'carrizo' configuration.

Details:
- Added a new configuration for AMD Excavator-based hardware also known
as Carrizo when referring to the entire APU. This configuration uses
the same micro-kernels as the piledriver, but with different
cache blocksizes.

commit 0b7255a642d56723f02d7ca1f8f21809967b8515
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 19 12:01:50 2015 -0500

CHANGELOG update (0.1.7)

0.1.7

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 19 12:01:49 2015 -0500

Version file update (0.1.7)

commit 7cd01b71b5e757a6774625b3c9f427f5e7664a76
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 19 11:31:53 2015 -0500

Implemented dynamic allocation for packing buffers.

Details:
- Replaced the old memory allocator, which was based on statically-
allocated arrays, with one based on a new internal pool_t type, which,
combined with a new bli_pool_*() API, provides a new abstract data
type that implements the same memory pool functionality but with blocks
from the heap (ie: malloc() or equivalent). Hiding the details of the
pool in a separate API also allows for a much simpler bli_mem.c family
of functions.
- Added a new internal header, bli_config_macro_defs.h, which enables
sane defaults for the values previously found in bli_config. Those
values can be overridden by defining them in bli_config.h the same
way kernel defaults can be overridden in bli_kernel.h. This file most
resembles what was previously a typical configuration's bli_config.h.
- Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
blocks in the memory pool. Also added a corresponding query routine to
the bli_info API.
- Deprecated (once again) the micro-panel alignment feature. Upon further
reflection, it seems that the goal of more predictable L1 cache
replacement behavior is outweighed by the harm caused by non-contiguous
micro-panels when k % kc != 0. I honestly don't think anyone will even
miss this feature.
- Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
bli_cntl_init() instead of bli_init().
- Removed query functions from bli_info.c that are no longer applicable
given the dynamic memory allocator.
- Removed unnecessary definitions from configurations' bli_config.h files,
which are now pleasantly sparse.
- Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
modules. Thanks to Devangi Parikh for pointing out these
miscalculations.
- Comment, whitespace changes.

commit 9848f255a3bab17d1139c391cca13ff3f1ffe6ed
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 11 19:14:22 2015 -0500

Added early return to API-level _init() routines.

Details:
- Added conditional code that returns early from the API-level _init()
routines if the API is already initialized. Actually meant for this to
be included in 5f93cbe8.

commit 5f93cbe870f3478870e15581e7fd450dad5bba1e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jun 11 18:52:12 2015 -0500

Introduced API-level initialization.

Details:
- Added API-level initialization state to _const, _error, _mem, _thread,
_ind, and _cntl APIs. While this functionality will mostly go unused,
adding miniscule overhead at init-time, there will be at least once
instance in the near future where, in order to avoid an infinite loop,
a certain portion of the initialization will call a query function that
itself attempts to call bli_init(). API-level initialization will allow
this later stage to verify that an earlier stage of initialization has
completed, even if the overall call to bli_init() has not yet returned.
- Added _is_initialized() functions for each API, setting the underlying
bool_t during _init() and unsetting it during _finalize().
- Comment, whitespace changes.

commit ee129c6b028bc5ac88da7c74fde72c49803742ff
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 10 12:53:28 2015 -0500

Fixed bugs in _get_range(), _get_range_weighted().

Details:
- Fixed some bugs that only manifested in multithreaded instances of
some (non-gemm) level-3 operations. The bugs were related to invalid
allocation of "edge" cases to thread subpartitions. (Here, we define
an "edge" case to be one where the dimension being partitioned for
parallelism is not a whole multiple of whatever register blocksize
is needed in that dimension.) In BLIS, we always require edge cases
to be part of the bottom, right, or bottom-right subpartitions.
(This is so that zero-padding only has to happen at the bottom, right,
or bottom-right edges of micro-panels.) The previous implementations
of bli_get_range() and _get_range_weighted() did not adhere to this
implicit policy and thus produced bad ranges for some combinations of
operation, parameter cases, problem sizes, and n-way parallelism.
- As part of the above fix, the functions bli_get_range() and
_get_range_weighted() have been renamed to use _l2r, _r2l, _t2b,
and _b2t suffixes, similar to the partitioning functions. This is
an easy way to make sure that the variants are calling the right
version of each function. The function signatures have also been
changed slightly.
- Comment/whitespace updates.
- Removed unnecessary '/' from macros in bli_obj_macro_defs.h.

commit 9135dfd69d39f3bbd75034f479f27a78dbfebcce
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Jun 5 13:37:44 2015 -0500

Minor updates to test/3m4m files.

commit d62ceece943b20537ec4dd99f25136b9ba2ae340
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 3 12:56:45 2015 -0500

Minor update to test/3m4m/runme.sh.

Details:
- Removed some stale script code that should have been removed
during 590bb3b8c.

commit b6ee82a3d421c9c4f1eb6848c7c6e37aa46de799
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jun 3 12:14:23 2015 -0500

Minor cleanup to bli_init() and friends.

Details:
- Spun-off initialization of global scalar constants to bli_const_init()
and of threading stuff to bli_thread_init().
- Added some missing _finalize() functions, even when there is nothing
to do.

commit 1213f5cebabc1637ce9dd45c4bfa87bb93677c29
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jun 2 13:27:47 2015 -0500

POSIX thread bugfixes/edits to bli_init.c, _mem.c.

Details:
- Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex
was used to lock access to initialization/finalization actions.
But everything worked out okay as long as bli_init() was called by
single-threaded code.
- Changed to static initialization for memory allocator mutex in
bli_mem.c, and moved mutex to that file (from bli_init.c).
- Fixed some type mismatches in bli_threading_pthreads.c that resulted
in compiler warnings.
- Fixed a small memory leak with allocated-but-never-freed (and unused)
pthread_attr_t objects.
- Whitespace changes to bli_init.c and bli_mem.c.

commit 590bb3b8c5c0389159c5a9451b6c156c5f237e8a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun May 24 16:02:53 2015 -0500

Backed-out adjusted dim changes to test/3m4m.

Details:
- Reverted most changes applied during commit ec25807b.

commit ec25807b26da943868f0d0517c3720e50181b8f9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 10 13:23:50 2015 -0500

Tweaks to test/3m4m to test with adjusted dims.

Details:
- Updated test/3m4m driver files to build test drivers that allow
comparision of real "asm_blis" results to complex "asm_blis" results,
except with the latter's problem sizes adjusted so that problems are
generated with equal flop counts.

commit 426b6488580a92bf071a62dc319a9c837ce39821
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 8 15:12:21 2015 -0500

Fixed a packing bug that manifested in trsm_r.

Details:
- Fixed a bug that caused a memory leak in the contiguous memory
allocator. Because packm_init() was using simple aliasing when
a subpartition object was marked as zeros by bli_acquire_mpart_*(),
the "destination" pack object's mem_t entry was being overwritten
by the corresponding field of the "source" object (which was likely
NULL). This prevented the block from being released back to the
memory allocator. But this bug only manifested when changing the
location of packing B from outside the var1 loop to inside the
var3 loop, and only for trsm with triangular B (side = right). The
bug was fixed by changing the type of alias used in packm_init()
when handling zero partition cases. Specifically, we now use
bli_obj_alias_for_packing(), which does not clobber the destination
(pack) object's mem_t field. Thanks to Devangi Parikh for this bug
report.

commit c84286d5cef48f16d83831baac1f46b9856b9a36
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Apr 4 15:39:14 2015 -0500

More minor tweaks to test/3m4m.

Details:
- Added a line of output that forces matlab to allocate the entire array
up-front.
- Re-enabled real domain benchmarks in runme.sh, which were temporarily
disabled.

commit 309717c8ebf4ef1369f15cf41340e13c25b41573
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 3 19:28:49 2015 -0500

More tweaks to test/3m4m, configurations.

Details:
- Fixed incorrect number of mc_x_kc memory blocks in
sandybridge/bli_config.h.
- Enabled OpenMP multithreding in piledriver/bli_config.h.
- More updates to test/3m4m driver files.

commit 4baf3b9c69b2f648be9e46e07ccc9859dd675828
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 3 16:44:32 2015 -0500

Tweaked test/3m4m driver, including acml support.

Details:
- Added ACML support to test/3m4m driver Makefile and runme.sh script.

commit a32f7c49ca4ea869d2a6c66818780f4321743d67
Merge: 349e075a 4bfd1ce8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Apr 3 08:28:11 2015 -0500

Merge pull request 23 from xianyi/master

Add auto-detecting CPU on configure stage.

commit 349e075ad6a8e2a1211d94f36d24828c9d44b052
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Apr 2 18:12:28 2015 -0500

Tweaks to sandybridge config, test/3m4m driver.

Details:
- Enable OpenMP support by default in sandybridge's bli_config.h.
- Reorganized sandybridge's bli_kernel.h.
- Updated 3m4m Makefile, runme.sh to also test MKL implementation.

commit 4bfd1ce8ca93f93d170dd2715f0a32027b417b46
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Thu Apr 2 16:40:21 2015 -0500

Detect NEON for cortex-a9 and cortex-a15.

commit aa6eec4f43137057276fe6119bdbfb5c52682527
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Thu Apr 2 16:03:44 2015 -0500

Detect the CPU architecture. Support ARM cores.

Detect the CPU architecture by compiler's predefined macros.
Then, detect the CPU cores.

Support detecting x86 and ARM architectures.

commit 2947cfb749c937b0f62fac36cc92f123bd45b53c
Author: Zhang Xianyi <traits.zhanggmail.com>
Date: Wed Apr 1 12:24:00 2015 -0500

Add auto-detecting CPU on configure stage.
e.g. /Path_to_BLIS/configure auto

Now, it only support detecting x86 CPUs.

commit 26a4b8f6f985597f80e0174990bf541f1d9bafac
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Apr 1 10:44:54 2015 -0500

Implemented 3m2, 3m3 induced algorithms (gemm only).

Details:
- Defined a new "3ms" (separated 3m) pack schema and added appropriate
support in packm_init(), packm_blk_var2().
- Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
as an argument instead of computing it locally. Exception: for trmm,
is_p must be computed locally, since it changes for triangular
packed matrices. Also exposed is_p in interface to dt-specific
packm_blk_var2 (and _var1, even though it does not use imaginary
stride).
- Renamed many functions/variables from _3mi to _3mis to indicate that
they work for either interleaved or separated 3m pack schemas.
- Generalized gemm and herk macro-kernels to pass in imaginary stride
rather than compute them locally.
- Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
and 3m3-specific virtual micro-kernels.
- Added special gemm macro-kernels to support 3m2 and 3m3.
- Added support for 3m2 and 3m3 to testsuite.
- Corrected the type of the panel dimension (pd_) in various macro-
kernels from inc_t to dim_t.
- Renamed many functions defined in bli_blocksize.c.
- Moved most induced-related macro defs from frame/include to
frame/ind/include.
- Updated the _ukernel.c files so that the micro-kernel function pointers
are obtained from the func_t objects rather than the cpp macros that
define the function names.
- Updated test/3m4m driver, Makefile, and run script.

commit ddf62ba7d2da08225b201585b85e06c967767dea
Author: Tyler Smith <tmscs.utexas.edu>
Date: Fri Mar 27 14:27:51 2015 -0500

Refuse to free the packm thread info if it uses the single threaded version

commit 016fc587584d958a0e430a56a5e2c05022ac2f17
Author: Tyler Smith <tmscs.utexas.edu>
Date: Fri Mar 27 14:23:02 2015 -0500

Don't free packm thread info if it is null

commit 00a443c529a60862a57b93e303a0b3212c9b1df4
Author: Tyler Smith <tmscs.utexas.edu>
Date: Fri Mar 27 14:11:07 2015 -0500

Use bli_malloc instead of malloc for the thread info paths

commit f1a6b7d02861ccebdc500ea98778cc0f6cddad17
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 18 15:37:10 2015 -0500

Reorganized code for induced complex methods.

Details:
- Consolidated most of the code relating to induced complex methods
(e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
are now enabled on a per-operation basis. The current "available"
(enabled and implemented) implementation can then be queried on
an operation basis. Micro-kernel func_t objects as well as blksz_t
objects can also be queried in a similar maner.
- Redefined several micro-kernel and operation-related functions in
bli_info_*() API, in accordance with above changes.
- Added mr and nr fields to blksz_t object, which point to the mr
and nr blksz_t objects for each cache blocksize (and are NULL for
register blocksizes). Renamed the sub-blocksize field "sub" to
"mult" since it is really expressing a blocksize multiple.
- Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
trsm to correctly query mr and nr (for purposes of nudging kc).
- Introduced an enumerated opid_t in bli_type_defs.h that uniquely
identifies an operation. For now, only level-3 id values are defined,
along with a generic, catch-all BLIS_NOID value.
- Reworked testsuite so that all induced methods that are enabled
are tested (one at a time) rather than only testing the first
available method.
- Reformated summary at the beginning of testsuite output so that
blocksize and micro-kernel info is shown for each induced method
that was requested (as well as native execution).
- Reduced the number of columns needed to display non-matlab
testsuite output (from approx. 90 to 80).

commit 8d5169ccda954e5f72944308a036dcb7ebfc9097
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Mar 18 11:38:08 2015 -0500

Fixed bug in release of mem_t buffer.

Details:
- Fixed a bug that affects all level-2 and level-3 blocked variants. The
bug only manifested, however, if the packing of operands (A and B in
gemm, for example) spanned multiple nodes in the control tree. Until
recently, the main consumers of packm were level-3 operations, all of
which packed both input operands from blocked variant 1 (B outside of
the loop, and A within the loop). This particular usage masked a flaw
in the code whereby bli_obj_release_pack() would always release the
underlying mem_t buffer (provided it was allocated), even if the buffer
was not allocated in the current variant. This has been fixed by
replacing all calls to bli_obj_release_pack() with calls to a new
function, bli_packm_release(), which takes the same control tree node
argument passed into the object's corresponding call to packm_init()
or packv_init(). bli_packm_release() then proceeds to invoke
bli_obj_release_pack() only if the control tree node indicates that
packing was requested. Thanks to Devangi Parikh for identifying this
bug.

commit c0acca0f5182ba96fd39c9d10b34a896a6e74206
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Mar 3 10:56:22 2015 -0600

Clarified comments in testsuite input.operations.

commit 03ba9a6b17861d9e1adc0cf924439c4d7e860d19
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Feb 24 10:33:28 2015 -0600

Removed some 'old' directories.

commit a86db60ee270cdeb745ae7cf68f9e0becc9f522d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 23 18:42:39 2015 -0600

Extensive renaming of 3m/4m-related files, symbols.

Details:
- Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi
('i' for "interleaved"). Similar changes to 3M/4M macros.
- Renamed all 3m/4m files and functions to 3m1/4m1.
- Whitespace changes.

commit 8cf8da291a0fb2f491f410969a76ec0fbda47faf
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 20 15:24:27 2015 -0600

Minor updates to induced complex mode management.

Details:
- Relocated bli_4mh.c, bli_4mb.c, bli_4m.c, bli_3mh.c, bli_3m.c (and
associated headers) from frame/base to frame/base/induced.
- Added bli_xm.? to frame/base/induced, which implements
bli_xm_is_enabled(), which detects whether ANY induced complex method
is currently enabled.
- The new function bli_xm_is_enabled() is now used in bli_info.c to
detect when an induced complex method is used, so we know when to
return blocksizes from one of the induced methods' blocksize objects.

commit 411e637ee7d1083a84f58f08938d51e63d7c3c9a
Merge: c2569b88 fc0b7712
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Fri Feb 20 20:39:25 2015 -0600

Merge branch 'master' of http://github.com/flame/blis

commit c2569b8803d4ccc1d7b6f391713461b51443601d
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Fri Feb 20 20:38:19 2015 -0600

Fixed a memory leak in freeing the thread infos

commit fc0b771227abf86d81f505b324f69f6e83db1d8f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Feb 20 11:47:44 2015 -0600

Added max(mr,nr) to kc in static mem pools.

Details:
- Changed the static memory definitions to compute the maximum register
blocksize for each datatype and add it to kc when computing the size
of blocks of A and B. This formally accounts for the nudging of kc
up to a multiple of mr or nr at runtime for triangular operations
(e.g. trmm).

commit af32e3a608631953ef770341df10a14a991bf290
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Thu Feb 19 22:51:11 2015 -0600

Fixed a bug with get_range_weighted would return end = 0 for small problem sizes

commit 441d47542a64e131578d00da7404c1ed387a721c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 19 17:06:10 2015 -0600

Renamed 3m and 4m symbols/macros to 3mi and 4mi.

Details:
- Renamed several variables and macros from 3m/4m to 3mi/4mi. This is
because those packing schemas were always implicitly "interleaved".
This new naming scheme will make way for new schemas that separate
instead of interleve the real and imaginary (and summed) parts.
- Expanded the pack format sub-field of the pack schema field of the
info_t to 4 bits (from 3). This will allow for more schema types
going forward.
- Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.

commit 518a1756ccf02122b96fc437b538604a597df42a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Feb 19 14:27:09 2015 -0600

Fixed indexing bug for trmm3 via 3mh, 4mh.

Details:
- Fixed a bug that only affected trmm3 when performed via 3mh or 4mh,
whereby micro-panels of the triangular matrix were packed with "dead
space" between them due to failing to adjust for the fact that pointer
arithmetic was occurring in units of complex elements while the data
being packed consisted of real elements. It turns out that the macro-
kernel suffered from the same bug, meaning the panels were actually
being packed and read consistently. The only way I was able to
discover the bug in the first place was because the packed block of A
was overflowing into the beginning of the packed row panel of B using
the sandybridge configuration.

commit 493087d730f01d5169434f461644e5633f48a42f
Merge: 650d2a6f 25021299
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 18 09:45:51 2015 -0600

Merge branch 'master' of github.com:flame/blis

commit 25021299b670775df8ca9c87910c63d7e74ed946
Merge: fe2b8d39 f05a5763
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 11 20:03:21 2015 -0600

Merge branch 'master' of github.com:flame/blis

commit fe2b8d39a445ac848686e78c7540fd046cb95492
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 11 19:33:10 2015 -0600

Fixed an obscure bug in 3mh/3m/4mh/4m packing.

Details:
- Modified bli_packm_blk_var1.c and _var2.c to increase the triangular
case's panel increment by 1 if it would otherwise be odd. This is
particularly necessary in _var2.c when handling the interleaved 3m
or ro/io/rpi pack schemas, since division of an odd number by 2 can
happen if both the panel length and the panel packing dimension
(register packing blocksize) are odd, thus making their product odd.
- Modified bli_packm_init.c so that panel strides are increased by 1
if they would otherwise be odd, even for non-3m related packing.
- Modified the trmm and trsm macro-kernels so that triangular packed
micro-panels are traversed with this new "increment by 1 if odd"
policy.
- Added sanity checks in trmm and trsm macro-kernels that would result
in an abort() if the conditions that would lead to a "divide odd
integer by 2" scenario ever manifest.
- Defined bli_is_odd(), _is_even() macros in bli_scalar_macro_defs.h.

commit 650d2a6ff2e593151a296ca86b5214afcc747afc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Feb 9 14:59:20 2015 -0600

Added initial support for imaginary stride.

Details:
- Added an imaginary stride field ("is") to obj_t.
- Renamed bli_obj_set_incs() macro to bli_obj_set_strides().
- Defined bli_obj_imag_stride() and bli_obj_set_imag_stride() and
added invocations in key locations.
- Added some basic error-checking related to imaginary stride.
- For now, imaginary stride will not be exposed into the most-used
BLIS APIs such as bli_obj_create(), and certainly not the
computational APIs such as bli_dgemm().

commit f05a57634a7c8e3864b25b3335d1194c1ea1aeb9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Feb 8 19:40:34 2015 -0600

Defined gemm cntl function to query ukrs func_t.

Details:
- Added a new function, bli_gemm_cntl_ukrs(), that returns the func_t*
for the gemm micro-kernels from the leaf node of the control tree.
This allows all the func_t* fields from higher-level nodes in the tree
to be NULL, which makes the function that builds the control trees
slightly easier to read.
- Call bli_gemm_cntl_ukrs() instead of the cntl_gemm_ukrs() macro in
all bli_*_front() functions (which is needed to apply the row/column
preference optimization).
- In all level-3 bli_*_cntl_init() functions, changed the _obj_create()
function arguments corresponding to the gemm_ukrs fields in higher-
level cntl tree nodes to NULL.
- Removed some old her2k macro-kernels.

commit cefd3d5d2001264de17cf63dae541f890cb9daaf
Author: Tyler Smith <tmscs.utexas.edu>
Date: Thu Feb 5 11:09:12 2015 -0600

A couple of functions were incorrectly ifdeffed away on Xeon Phi. Fixed this

commit 7574c9947d57a19f613880e3b9f62f8c8f6df4ec
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Feb 4 12:11:55 2015 -0600

Added basic flop-counting mechanism (level-3 only).

Details:
- Added optional flop counting to all level-3 front-ends, which is
enabled via BLIS_ENABLE_FLOP_COUNT. The flop count can be
reset at any time via bli_flop_count_reset() and queried via
bli_flop_count(). Caveats:
- flop counts are approximate for her[2]k, syr[2]k, trmm, and
trsm operations;
- flop counts ignore extra flops due to non-unit alpha;
- flop counts do not account for situations where beta is zero.

commit ceda4f27d1f1bcf19320e09848e0f2e3b9941e6c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jan 29 13:22:54 2015 -0600

Implemented bli_obj_imag_equals().

Details:
- Implemented a new function, bli_obj_imag_equals(), which compares the
imaginary part of the first argument to the second argument, which may
be a BLIS_CONSTANT or of a regular real datatype.

commit 81114824a05a9053229efd577a8a94a856deda93
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jan 6 12:15:21 2015 -0600

Minor 4m/3m consolidation to mem_pool_macro_defs.h.

Details:
- Merged the 4m and 3m definitions in bli_mem_pool_macro_defs.h to
reduce code and improve readability.

commit 36a9b7b7436d9423ba4de2a9f85cfcd43577b783
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed Dec 17 21:53:50 2014 +0000

reduced the default number of MC by KC blocks for bgq

commit c60619c7c3568f044a849abbab60209aa7455423
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 16 17:08:22 2014 -0600

Minor tweaks for 3m4m test drivers.

Details:
- Changed gemm_kc blocksizes to be reduced by two-thirds instead of
half.
- Changed 3m4m/test_gemm.c driver to divide by 3 instead of 2 when
computing the fixed k dimension.
- Fixed runme.sh so that it would use multiple threads for s/dgemm
cases.

commit c6929ba6a5e6f633a7295e979a2b8df8c7ecdb1b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 16 11:27:50 2014 -0600

Added 4m_1b to test/3m4m test driver and script.

commit 785d480805fc0d6f4251b5499933515740b6b2a7
Merge: 9456f330 4156c088
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 12 14:34:19 2014 -0600

Merge branch 'master' of github.com:flame/blis

commit 9456f330af4617f9ee32972d51f974aa2d84f97b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Dec 12 14:31:57 2014 -0600

Added 4m_1b implementation for gemm.

Details:
- Added yet another 4m-based implementation for complex domain level-3
operations. This method, which the 3m/4m paper identifies as Algorithm
"4m_1b" fissures the first loop around the micro-kernel so that the
real sub-panel of the current micro-panel of B is multiplied against
(both sub-panels of) all micro-panels of A, before doing the same for
the imaginary sub-panel of the micro-panel of B. For now, only gemm is
supported, and 4m_1b (labeled "4mb" within the framework) is not yet
integrated into the test suite.

commit 4156c0880d9aea4ff04a9c4fa139ba8c437d8bfb
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Dec 9 16:03:14 2014 -0600

Fixed obscure level-2 packing / general stride bug.

Details:
- Fixed a bug in certain structured level-2 operations that manifested
only when the structured matrix was provided to BLIS as matrix stored
with general stride. The bug was introduced in c472993b when the
densify field was removed from the packm control tree node and
associated APIs. Since then, the packed object was unconditionally
marked with an uplo field of BLIS_DENSE. This is fine for level-3
operations where micro-panels are always densified, but in level-2
contexts, the underlying unblocked variant (fused or unfused) of
structured operations (e.g. trmv) still needs to know whether to
execute its "lower" or "upper" branches of code. Since this field
was unconditionally being set to BLIS_DENSE, the unblocked variants
were always executed the "else" branch, which happened to be the
"lower" case code. Thus, running an upper case produced the wrong
answer. This most obviously manifested in the form of failures for
trmm, trmm3, and trsm in the test suite.
The bug was fixed by setting the packed object's uplo field to
BLIS_DENSE only if the schema indicated that micro-panels were to be
packed. Otherwise, we can assume we are packing to regular row or
column storage, as is the case with level-2 packing. Thanks to
Francisco Igual for reporting the testsuite failures and ultimately
leading us to this bug.

commit 689f60a578b461119e9ea90c74f642b9eb79addb
Merge: bef24e67 483e4d6a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Dec 7 14:03:30 2014 -0600

Merge pull request 21 from figual/master

Adding armv8a configuration and micro-kernels.

commit 483e4d6a3fdbef9d9ab47fb674c9476c70ca9f0f
Author: Francisco D. Igual <figualucm.es>
Date: Sun Dec 7 20:27:49 2014 +0100

Adding armv8a configuration and micro-kernels.

Only sgemm micro-kernel is fully functional at this point.

commit bef24e67e0f93579c2a80315348dc2e227f72a72
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Nov 26 18:00:56 2014 -0600

Fixed a type of race condition exposed by pthreads implementation.
Lead thread of the inner thread communicator could exit subproblem, move on the next iteration of the loop and modify a1_pack, b1_pack, or c1_pack while other threads were still using those.

Barriers were inserted to fix this.

commit 76bde44411f0e34266bab9d666a54ef22be97320
Merge: e56e6143 f3d729e5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 26 17:25:24 2014 -0600

Merge branch 'master' of github.com:flame/blis

commit f3d729e504ec012e7dc7e02b2ecd42e004c6894d
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed Nov 26 22:25:24 2014 -0600

Added static mutex to bli_init and bli_finalize

commit d71cc797866ff502ad1127527016f463267eef80
Author: Tyler Michael Smith <tmscs.utexas.edu>
Date: Wed Nov 26 21:35:39 2014 -0600

Refactored bli_threading files and added support for pthreads

commit e56e61438ff7fcf25a48c0b7603f18df782b50b6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Nov 26 17:20:35 2014 -0600

Minor cleanups to bli_threading.h and friends.

Details:
- No longer need to define BLIS_ENABLE_MULTITHREADING manually in
bli_config.h; it now gets defined when BLIS_ENABLE_OPENMP or
BLIS_ENABLE_PTHREADS is defined.
- Added sanity check to prevent both BLIS__ENABLE_OPENMP and
BLIS_ENABLE_PTHREADS from being enabled simultaneously.
- Reorganization of bli_threading*.h header files, which led to
simplification of threading-related part of blis.h.
- added "-fopenmp -lpthread" to LDFLAGS of sandybridge make_defs.mk
file.

commit 3be2744cbe2c56d38c23fd818aa5c1f10cc7ea51
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Nov 21 12:28:08 2014 -0600

Update to template gemm ukernel comments.

Details:
- Updated comments on alignment of a1 and b1 to match wiki.

commit 994429c6881b2ade92d9d7949bcaebfbf2cc65eb
Merge: 58796abd 694029d9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 20 13:55:35 2014 -0600

Merge pull request 20 from TimmyLiu/master

define PASTEF773 required by cblas compatibility layer

commit 694029d9d7db857d642ab536955c0621791108c8
Author: Timmy <timmy.liuamd.com>
Date: Wed Nov 19 15:25:14 2014 -0600

define PASTEF773 required by cblas compatiility layer

commit 58796abda66b133346f8d523b39178afc336351f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Nov 6 14:31:52 2014 -0600

Removed KC constraint comments from _kernel.h files.

Details:
- Since 4674ca8c, the constraint that KC be a multiple of both MR and
NR have been relaxed, and thus it was time to remove the comments
from the top of the bli_kernel.h files of all configurations.

commit 7bbc95a54f706d43c7f7951f0e5995f86130cd52
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 29 10:52:23 2014 -0500

Added new piledriver micro-kernels.

Details:
- Added new micro-kernels for the AMD piledriver architecture (one
for each datatype).
- Updates and tweaks to piledriver configuration.
- Added 3xk packm micro-kernel support.
- Explicitly unrolled some of the smaller packm micro-kernels.
- Added notes to avx/sandybridge and piledriver micro-kernel files
acknowledging the influence of the corresponding kernel code in
OpenBLAS.

commit 59613f1d5500f6279963327db2fbc84bc9135183
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 23 17:21:37 2014 -0500

Added separeate micro-panel alignment for A and B.

Details:
- Changed the recently-added micro-panel alignment macros so that we now
have two sets--one for micro-panels of matrix A and one for micro-
panels of matrix B: BLIS_UPANEL_[AB]_ALIGN_SIZE_?.
- Store each set of alignment values into a separate blksz_t object in
bli_gemm_cntl_init().
- Adjusted packm_init() to use the separate alignment values.
- Added query routines for the new alignment values to bli_info.c.
- Modified test suite output accordingly.

commit a8e12884ee1fddd3fd77ca5a68aa0cb857f3af57
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 23 11:35:48 2014 -0500

CHANGELOG update (0.1.6)

0.1.6

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 23 11:35:45 2014 -0500

Version file update (0.1.6)

commit a3e6341bdb0e28411f935d6b4708a6389663e004
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 23 11:13:28 2014 -0500

Factored common code from blocksize functions.

Details:
- Split bli_determine_blocksize_[fb]() into two functions each, the
newer ones ending with the _sub suffix. These new sub-functions are
now called from bli_[gemm|trmm|trsm]_determine_kc_[fb](), which
eliminates redundant code and will allow any future tweaks to the
core sub-functions to automatically be inherited by the operation-
specific versions.

commit 4674ca8cffb58331ff7edf23bbe0e3f6a7558489
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 23 10:50:59 2014 -0500

Extended newly relaxed KC to hemm, symm.

Details:
- These changes were intended for the previous commit.
- Defined bli_gemm_determine_kc_[fb]() and bli_gemm_determine_kc_[fb](),
which determine blocksizes for gemm-based operations, taking special
care to "nudge" the kc dimension up to a multiple of MR or NR for
hemm and symm operations, as needed.
- Changed bli_gemm_blk_var3f.c to call bli_gemm_determine_kc_f().
instead of bli_determine_blocksize_f().
- Comment updates to bli_trmm_blocksize.c, bli_trsm_blocksize.c.

commit ab954ba6f874eaca7b001804491f866ef6b9b327
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Oct 22 17:21:58 2014 -0500

Relaxed constraint that KC be multiple of MR, NR.

Details:
- Relaxed a long-held requirement in register blocksizes that required
the kernel programmer to choose a KC that was divisible by both MR
and NR. This was very constraining on some architectures that did not
use register blocksizes that were powers of two. The constraint is
now enforced only for trmm and trsm, where it is needed, and it is
now handled by "nudging" kc upward at runtime, if necessary, to be a
multiple of MR or NR, as needed.
- Defined bli_trmm_determine_kc_[fb]() and bli_trsm_determine_kc_[fb](),
which determine blocksizes for trmm and trsm, taking special care to
"nudge" the kc dimension up to a multiple of MR or NR, as needed.
- Changed bli_trmm_blk_var3[fb].c to call bli_trmm_determine_kc_[fb]()
instead of bli_determine_blocksize_[fb]().
- Added safeguard to bli_align_dim_to_mult() that returns the dimension
unmodified if the dimension multiple is zero (to avoid division by
zero).
- Removed cpp guard/check for KC % MR == 0 and KC % NR == 0 from
bli_kernel_macro_defs.h.
- Whitespace, variable name changes to bli_blocksize.c.
- Removed old commented code from bli_gemm_cntl.c.

commit 95cdae65d6b88e043ee14bcd53cd2e800d7aecb4
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Oct 22 16:30:16 2014 -0500

Fixed bug in KNC microkernel where k=0 and beta != 1

commit e64dba5633fc49b768b5edc7762f2b5d8a4d0588
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Oct 20 19:23:06 2014 -0500

Re-implemented micro-panel alignment.

Details:
- This commit re-implements a feature that was removed in commit
c2b2ab62. It was removed because, at the time, I wasn't sure how the
micro-panel alignment feature would interact with the 4m method (when
applied at the micro-kernrel level), and so it seemed safer to disable
the feature entirely rather than allow possible breakage. This commit
revisits the issue and safely re-implements the feature in a way that
is compatible with 4m, 3m, 4mh, and 3mh (and native execution).
- Modified the static memory pool to account for micro-panel alignment
space.
- Modified packm_init and blocked variants to align whole micro-panels
by a datatype-specific alignment value that may be set by the
configuration. (If it is not set by the configuration, it will default
to BLIS_SIZEOF_?.)
- Modified macro-kernels so that:
- storage stride is handled properly given the new micro-panel
alignment behavior;
- indexing through 3m/4m/rih-type sub-panels, as is done by trmm and
trsm, is more robust (e.g. will work if the applicable packing
register blocksize is odd);
- imaginary strides are computed and stored within auxinfo_t structs,
which allows the virtual micro-kernels to more easily determine how
to index into the micro-panel operands.
- Modified virtual 3m and 4m micro-kernels to use the imaginary strides
within the auxinfo_t structs instead of panel strides.
- Deprecated the panel stride fields from the auxinfo_t structs.
- Updated test suite to print out the micro-panel alignment values.

commit add16b0e5402924301e7078e4ca5e3ef725bff0b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 17 11:49:24 2014 -0500

Added 3m4m test driver subdir of 'test'.

Details:
- Added a modified test driver for [cz]gemm that will test all 3m/4m
as well as assembly-based and OpenBLAS implementations of gemm
in single and multithreaded modes.

commit e171504a72406c61a173241d8bccf0a5ceb10582
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 17 11:25:59 2014 -0500

Use correct definition of bli_is_last_iter().

Details:
- As intended for previous commit, the new definition of
bli_is_last_iter() is now disabled in favor of the old
definition.

commit 0d954087b2b55d2f5f3c5e57d702b318ca2300f6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 17 11:19:34 2014 -0500

Minor changes and fixes.

Details:
- Redefined bli_is_last_iter() to take thread_id and num_thread
arguments, which allows the macro to correctly compute whether a
given iteration is the last that the thread will compute in that
particular loop. The new definition, however, remains disabled
(commented out) until someone can look at this more closely, as
the new definition seems to actually hurt performance slightly.
- Whitespace and related updates to level-3 macro-kernels.
- Updated test suite so that performance results in the hundreds of
gigaflops does not disrupt the column alignment of the output.

commit d1e86e1876e433f54b501ec5a005b4ba7c5ce4e6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Oct 12 13:43:47 2014 -0500

More minor tweaks to sandybridge/avx micro-kernel.

Details:
- Re-enabled use of b_next for dgemm and cgemm micro-kernels.

commit 7b6fe4cae57cb22c09c1a97595e1a201a02cbcd2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Oct 12 12:01:51 2014 -0500

Minor tweaks to sandybridge/avx micro-kernels.

Details:
- Changed the MC blocksize for zgemm micro-kernel from 128 to 64.
- Removed usage of b_next in all x86_64/avx gemm micro-kernels.

commit a6a156e9feec47154e7a0fd43bcc006b1fc04aba
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 10 14:26:41 2014 -0500

Added cgemm ukernel for avx/sandybridge.

Details:
- Implemented AVX-based cgemm micro-kernel (via GNU extended inline
assembly syntax).
- Updated sandybridge configuration accordingly.

commit 6f8575ab2580e167a022293b76ddf0514f71b613
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Oct 10 10:01:45 2014 -0500

Added zgemm ukernel for avx/sandybridge.

Details:
- Implemented AVX-based zgemm micro-kernel (via GNU extended inline
assembly syntax).
- Updated sandybridge configuration accordingly.

commit 23ce7ee542a12ca40b4b6090ad2558d180e16d37
Merge: 99fd9a39 7a8ad47f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 9 16:41:22 2014 -0500

Merge branch 'master' of github.com:flame/blis

commit 99fd9a39718cb7281f6fb23f9fef7cca4fe514f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 9 16:38:04 2014 -0500

Fixed two minor bugs.

Details:
- Fixed a bug in the test suite for the trsm_ukr and gemmtrsm_ukr test
modules whereby the uplo bits of some packed matrix objects were not
being set properly, resulting in false FAILURE results for those
tests. Thanks to Tyler Smith for bringing this issue to my attention.
- Fixed a bug in bli_obj_alloc_buffer() that caused an unnecessary
"not yet implemented" abort() when creating a 1x1 object with non-unit
strides.

commit 7a8ad47fb2d100a9da93aa8cab774fcceeaab733
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Oct 8 15:52:13 2014 -0500

Minor changes to knc configuration, including preference row major storage
Also fixed a bug in the knc micro-kernel where it would fail if k == 0

commit 76b7c34af0c09f47d9615b18857a356acddc788a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 2 14:15:38 2014 -0500

Fixed a bug in the pack schema-related bit macros.

Details:
- Expanded the BLIS_PACK_SCHEMA_BITS value in bli_type_defs.h to
include all six bits presently used in the pack schema bitfield of
the info field of obj_t structs. Prior to this commit, the macro
constant only included the lowest five bits, which excluded the
"is or is not packed" bit. This manifested as a strange bug in
probably many level-2 codes that invoked packing, though we only
observed it in ger before fixing. Thanks to Devin Matthews for
finding and reporting this bug.

commit a5763e332226598d70c47dfa9cad4578e15ef5f4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Oct 2 13:28:17 2014 -0500

Added extra output to bli_obj_print().

Details:
- Print extra values from info field of obj_t struct within
bli_obj_print().

commit 9bba209fc44fbfce943ba6a51cd8278a0cb6b159
Author: Tyler Smith <tmscs.utexas.edu>
Date: Mon Sep 29 14:56:36 2014 -0500

Fixed bug when packing anywhere besides in blk_var_1 for gemm.

commit 614a4afc9272adb47e5a8b83b39d56c2804d95d6
Merge: b541b667 4a7df04e
Author: Tyler Smith <tmscs.utexas.edu>
Date: Fri Sep 26 10:49:57 2014 -0500

Merge branch 'master' of http://github.com/flame/blis

commit 4a7df04e8a4ffdb9561d26426afd35e4fe15b013
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 22 16:06:15 2014 -0500

Added 30xk support for packm ukernels.

Details:
- Updated bli_kernel_*_macro_defs.h headers to include default
definitions for 30xk packm kernels.
- Extended function pointer arrays in bli_packm_cxk_*() out to 31 and
included 30xk kernels.
- Addex 30xk kernels to frame/1m/packm/ukernels/bli_packm_ref_cxk_*.c.

commit b6d4bd792e0d44ce4b28afef343f5ff3ba89c285
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 22 16:02:37 2014 -0500

Fixed missing tabs from Makefile patch.

commit 32630f9b6f0d5ba28d5b56dae4c7288a37158743
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 19 17:18:20 2014 -0500

Comment update to virtual micro-kernels.

commit 13447cffead7c6d137a7a3ccbf9e552ed0477467
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Sep 19 13:00:48 2014 -0500

Minor bugfix to top-level Makefile.

Details:
- Applied a patch that allows the top-level Makefile to work on certain
systems. The patch simply separates out the source-to-object code
generation rules for .c and .S files into two separate rules. Thanks
to Devin Matthews for submitting this patch.

commit e80a4537846416719c067ae08a53aeda978c572d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 18 10:24:20 2014 -0500

Fixed bug introduced by bugfix in 25b258d.

Details:
- We actually need to check alignment of lda*sizeof(double) and NOT
a+lda because in the latter case, alignment could cancel out and
still allow the optimized code to run when it shouldn't. Thanks
to Devin for pointing this out.

commit 25b258d61f9c8cee64e922f4131784b6edb196dd
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 18 10:10:49 2014 -0500

Fixed a non-fatal problem with bugfix in a68b316c.

Details:
- The bugfix in a68b316c was inadvertantly checkin alignment of the
leading dimension itself, rather than the byte size of the leading
dimension. Now, we simply check alignment of a+lda.

commit 96302d4fc81363410e41c3a3c43a65df44d97ad9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 18 09:43:40 2014 -0500

Renamed bli_info_get_*_ukr_type() functions.

Details:
- Added _string() suffix to bli_info_get_*_ukr_type() function names.
This makes them consistent with the bli_info_get_*_impl_string()
functions.

commit a68b316ca4852509f84ed50e01afac486bf70f58
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 17 11:10:07 2014 -0500

Fixed alignment bugs in level-1f kernels.

Details:
- Fixed bugs whereby the level-1f dotxf, axpyxf, and dotxaxpyf kernels
were attempting to compute problems with unaligned leading dimensions
with optimized code, rather than (correctly) using the reference
implementations. Thanks to Devin Matthews for reporting this bug.

commit 870761eb902e4866090d1d3446a345df3d6d4599
Merge: e9899be0 a2b59a37
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 16 18:20:49 2014 -0500

Merge branch 'master' of github.com:flame/blis

commit e9899be09044829e23386bd73e394f1dd7778210
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 16 18:19:32 2014 -0500

Added high-level implementations of 4m, 3m.

Details:
- Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
high levels, respectively. APIs for trmm and trsm were NOT added due
to the fact that these approaches are inherently incompatible with
implementing 4m or 3m at high levels (because the input right-hand
side matrix is overwritten).
- Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
3m so that all are stylistically consistent.
- Added new "rih" packing kernels (both low-level and structure-aware)
to support both 4mh and 3mh.
- Defined new pack_t schemas to support real-only, imaginary-only, and
real+imaginary packing formats.
- Added various level0 scalar macros to support the rih packm kernels.
- Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
- Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
that order) and execute the first one that is enabled, or the native
implementation if none are enabled.
- Added implementation query functions for each level-3 operation so
that the user can query a string that describes the implementation
that is currently enabled.
- Updated test suite to output implementation types for reach level-3
operation, as well as micro-kernel types for each of the five micro-
kernels.
- Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
- Fixed an obscure bug when packing Hermitian matrices (regular packing
type) whereby the diagonal elements of the packed micro-panels could
get tainted if the source matrix's imaginary diagonal part contained
garbage.

commit a2b59a37f166f70a6dd5793db2530823ef590c2b
Author: Tyler Smith <tmscs.utexas.edu>
Date: Mon Sep 15 10:44:44 2014 -0500

Fixed make defs so that they actually compile for bulldozer

commit 86fc7e40764f78ec217f50216ef4fa5b57dbfbc7
Author: Tyler Smith <tmscs.utexas.edu>
Date: Mon Sep 15 10:35:46 2014 -0500

Added bulldozer configuration and updated piledriver micro-kernel

commit 0644e61a79a57f136be5f4c47b9099cff2af06e0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 11 12:55:34 2014 -0500

Minor updates to bli_packm_init.c.

commit 9dc9b44a057a08e20ad4d423344f0ecad54c1eb2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 11 12:03:28 2014 -0500

Renamed bli_obj_pack_status() to _pack_schema().

Details:
- Renamed the bli_obj_pack_status() macro to bli_obj_pack_schema() in
order to help avoid confusion as to what the macro returns.

commit cf5efdde0588a0d5b6ea57fe7d7be5000be06f8e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Sep 11 11:47:56 2014 -0500

Pass pack_t schemas into ukernels via auxinfo_t.

Details:
- Modified macro-kernels to pass the pack_t schema values for matrices
A and B into the datatype-specific functions, where they are now
inserted into a newly-expanded auxinfo_t struct. This gives gives the
micro-kernels access to the pack_t schema values embedded in the
control trees, which determine the precise format into which the
matrix elements are packed.
- Updated a call to bli_packm_init_pack() in src/test_libblis.c to
remove densify argument. Meant to include this in commit c472993b.

commit cc8d2b82775cca3c2d51bf427f4e77c8024a6d15
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 9 13:48:22 2014 -0500

Updated old test drivers in 'test'.

commit c472993bbccb69e9ffc409c79b742426c8ad2ad4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Sep 9 13:42:04 2014 -0500

Removed densify argument to packm_cntl_obj_create().

Details:
- Removed the "densify" bool_t argument to bli_packm_cntl_obj_create().
This argument was inserted very early in BLIS's development, when it
was anticipated that the developer may sometimes wish to pack a
Hermitian, symmetric, or triangular matrix without making it dense.
But as it turns out, if we are packing a matrix, we always want to
make it dense in some way or another due to the fact that the micro-
kernel only multiplies dense micro-panels. Thus, unless/until there
is a real need for the feature, it seems reasonable to remove it from
the packm_cntl API.

commit 5c43ee387146cd76dc59b730dac6683a8446b834
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 8 15:19:29 2014 -0500

Moved trmm4m/3m_cntl files to 'old' directory.

Details:
- Meant to include this in previous commit.

commit 7b2f469d5465ed73b1ca88124bc9a1987388aa27
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 8 14:49:50 2014 -0500

Retired trmm_t control tree definitions, usage.

Details:
- Replaced all trmm_t control tree instances and usage with that of
gemm_t. This change is similar to the recent retirement of the herk_t
control tree.
- Tweaked packm blocked variants so that the triangular code does NOT
assume that k is a multiple of MR (when A is triangular) or NR (when
B is triangular). This means that bottom-right micro-panels packed for
trmm will have different zero-padding when k is not already a multiple
of the relevant register blocksize. While this creates a seemingly
arbitrary and unnecessary distinction between trmm and trsm packing,
it actually allows trmm to be handled with one control tree, instead
of one for left and one for right side cases. Furthermore, since only
one tree is required, it can now be handled by the gemm tree, and thus
the trmm control tree definitions can be disposed of entirely.
- Tweaked trmm macro-kernels so that they do NOT inflate k up to a
multiple of MR (when A is triangular) or NR (when B is triangular).
- Misc. tweaks and cleanups to bli_packm_struc_cxk_4m.c and _3m.c, some
of which are to facilitate above-mentioned changes whereby k is no
longer required to be a multiple of register blocksize when packing
triangular micro-panels.
- Adjusted trmm3 according to above changes.
- Retired trmm_t control tree creation/initialization functions.

commit 576e9e9255a79dba9cd3c804267f51e0b4aa6e8a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Sep 7 16:12:52 2014 -0500

Retired herk_t control tree definitions, usage.

Details:
- Replaced all herk_t control tree instances and usage with that of
gemm_t, since the two types presently have the same fields. This means
that herk, her2k, syrk, and syr2k can simply use the gemm control tree
as-is, just as hemm and symm have been doing for some time now.
- Retired herk_t control tree creation/initialization functions.
- Retired many _target.c and .h files into 'old' directories.

commit b2fed052c9a23d858ef0afbe220b342bce9aa7f7
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 3 17:07:25 2014 -0500

Minor code cleanup to bli_packm_struc_cxk*.c

Details:
- Realized that we don't need to track rs_p11 and cs_p11 for
Hermitian/symmetric case of bli_packm_struc_cxk*(). They are always
equal to rs_p and cs_p.

commit 023ce770966b3b5a98bba729c5af1f45e15ebb97
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Sep 3 10:47:53 2014 -0500

Minor update to packm_cxk kernels.

Details:
- Changed m and n dimension parameter names to panel_dim and panel_len,
respectively, in packm_cxk, packm_cxk_3m, packm_cxk_4m kernel wrapper
functions. This makes the code a little easier to read since "m" and
"n" have connotations that are not applicable here.
- Comment updates.

commit 189def3667d9218adbeec45e2801fd074341a679
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 1 16:23:17 2014 -0500

Retired portions of bli_kernel_3m/4m_macro_defs.h.

Details:
- Removed sections of bli_kernel_[4m|3m]_macro_defs.h that defined
4m/3m-specific blocksizes after realizing that this can be done in
bli_gemm[4m|3m]_cntl.c, since that is (mostly) the only place they
are used.
- The maximum cache values for 4m/3m are stll needed when computing mem
pool dimensions in bli_mem_pool_macro_defs.h. As a workaround, "local"
definitions in terms of the regular cache blocksizes are now in place.
- Similarly, the register blocksizes for 4m/3m are still needed in
bli_kernel_post_macro_defs.h. As a workaround, "local" definitions in
terms of the regular register blocksizes are now in place.

commit af521ee6f2a77d61c98b833e85c09969987bc00d
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Sep 1 14:06:46 2014 -0500

Changed semantics of blocksize extensions.

Details:
- Changed semantics of cache and register blocksize extensions so that
the extended values are tracked, rather than just the marginal
extensions.
- BLIS_EXTEND_[MKN]C_? has been renamed BLIS_MAXIMUM_[MKN]C_?.
- BLIS_EXTEND_[MKN]R_? has been renamed BLIS_PACKDIM_[MKN]R_?.
- bli_blksz_ext_*() APIs have been renamed to bli_blksz_max_*(). Note
that these "max" query routines grab the maximum value for cache
blocksizes and the packdim value for register blocksizes.
- bli_info_*() API has been updated accordingly.
- All configurations have been updated accordingly.

commit 07f23aefd52f5ba4960dbd46e59b180a2136b8e9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 31 11:58:50 2014 -0500

Pass pack schema into packm_struc_cxk*().

Details:
- Changed the interface to the packm_struc_cxk*() kernels to include
the pack_t schema. This allows the implementation to more easily
determine how the micro-panel is stored (row-stored column panel
or column-stored row panel).
- Updated packm blocked variants to pass in the schema.
- Updated packm_ker_t function pointer definition accordingly.

commit f032ba9b1186cb02184574d339565f53d733aa42
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 30 16:21:20 2014 -0500

Reorganized packm implementation.

Details:
- Reorganized packm variants and structure-aware kernels so that all
routines for a given pack format (4m, 3m, regular) reside in a single
file.
- Renamed _blk_var4 to _blk_var2 and generalized so that it will work
for
both 4m and 3m, and adjusted 4m/3m _cntl_init() functions accordingly.
- Added a new packm_ker_t function pointer type to
bli_kernel_type_defs.h
to facilitate function pointer typecasting in the datatype-specific
packm_blk_var2() functions.
- Deprecated _blk_var3.
- Fixed a bug in the triangular micro-panel packing facility that
affected trmm and trmm3 with unit diagonals.

commit c6793cecb70788bdf2c76ab8102504ea97be9d2a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 28 17:14:48 2014 -0500

Reorganized includes for scalar macro headers.

Details:
- Reordered the include statements in bli_scalar_macro_defs.h so that
conventional, ri-, and ri3-based macros are grouped together.
- Renamed bli_eqri.h (and macros within) to end with 'ris' suffix.

commit b4da8907284345be4374f87a88679c4886ab866e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 28 14:10:32 2014 -0500

Whitespace, comments updates on packm_blk_var?.c.

commit 46e46a1d83da586c3dd9fd7a01eb16067abbaee1
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 28 12:05:45 2014 -0500

Minor updates to packm blocked, cxk_3m/4m code.

Details:
- Added 'const' qualifier to inlined packing code that handles
micro-panel packing that is too large for an existing packm ukernel.
- Comment updates.

commit 908dc688b5979995eaacb3aa937f241551a8df00
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 28 11:55:12 2014 -0500

Pass pack schema into blocked packm routines.

Details:
- Rather than passing the packm blocked routines a boolean value that
represents whether the matrix is being packed to row or column storage,
we now pass in the pack schema itself.

commit a0ff6066e06075ab5f92b19247b39b92ed15f1bf
Merge: c4c99c48 d40b32bc
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 24 15:56:21 2014 -0500

Merge branch 'master' of github.com:flame/blis

commit c4c99c4813bf9817592a7899c5d33412fe22313f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 24 15:52:22 2014 -0500

Renamed packm scalar from beta to kappa.

Details:
- The packm implementation (i.e. sources files in frame/1m/packm and
frame/1m/packm/ukernels), interchangeably used the names "beta" and
"kappa" to refer to the optional scalar to be applied during packing.
This commit renames all uses of "beta" to be "kappa", since "beta"
sometimes evokes the scalar specifically on the output matrix of a
level-2 or level-3 operation.

commit d40b32bc24ffbae24123e054307b3138969bb095
Merge: 9331f794 6c25c379
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 24 13:46:36 2014 -0500

Merge branch 'master' of github.com:flame/blis

commit 6c25c379fadb50834146e1614f7b80c093c2aad0
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 24 13:44:10 2014 -0500

Consolidated unpackm ukernels into single file.

Details:
- Reorganized unpackm ukernels into a single file,
bli_unpackm_ref_cxk.c, in a manner similar to what was done for packm
ukernels in commit 4cc2b46.

commit 9331f79443223fe267676ee54c439e1ed320380c
Merge: 7fc48a7d 670b6392
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 24 10:54:21 2014 -0500

Merge branch 'master' of github.com:flame/blis

commit 670b63926a7f4fc694abc5b1582ef8a4f367f5a8
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 24 10:46:27 2014 -0500

Added whitespace to bli_obj_scalar_ routine calls.

Details:
- Added extra spaces to align arguments of
bli_obj_scalar_init_detached_copy_of(). This misalignment was due to
the fact that the function was previously named
bli_obj_init_scalar_copy_of() and the name change, performed in
b444489f, was done via recursive sed commands which left subsequent
lines untouched.

commit 7fc48a7d920e07fd8e9528ab2565123f8f4e67f9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 23 16:50:58 2014 -0500

Combined 4m/3m bits into an expanded bitfield.

Details:
- Combined the 4m/3m bits into an expanded bitfield, which will encode
the packing "format" of the micro-panels. This will allow for more
easily and compactly encoding additional formats.
- Other minor comment/whitespace updates to bli_type_defs.h.
- Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new
format bitfield.
- Comment update to bli_kernel_post_macro_defs.h.
- Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h.

commit ef0143cc1417e4815e4cafd5a464cc83fe7a1e86
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sat Aug 23 14:02:27 2014 -0500

Renamed _ri, _ri3 packm ukernels to _4m, _3m.

Details:
- Renamed packm ukernels, _cxk dispatcher, and structure-aware _cxk
helper functions to use _4m and _3m instead of _ri and _ri3 suffixes.
- Updated names of cpp macros that correspond to packm ukernels.

commit b0ccac116158b5ed3316d34798748ba0c6d78672
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 21 19:21:52 2014 -0500

Cleaned up front-end layering for 4m/3m.

Details:
- Added an extra layer to level-3 front-ends (examples: bli_gemm_entry()
and bli_gemm4m_entry()) to hide the control trees from the code that
decides whether to execute native or 4m-based implementations. The
layering was also applied to 3m.
- Branch to 4m code based on the return value of bli_4m_is_enabled(),
rather than the cpp macros BLIS_ENABLE_?COMPLEX_VIA_4M. This lays
the groundwork for users to be able to change at runtime which
implementation is called by the main front-ends (e.g. bli_gemm()).
- Retired some experimental gemm code that hadn't been touched in
months.

commit bedec95451cabfa7a8906b51018a5e0572998a5e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 21 18:25:48 2014 -0500

Added bli_4m API for querying 4m enabled state.

Details:
- Added bli_4m.c (and header), which defines a simple API that can be
used to query, enable, and disable 4m-based complex support in BLIS.
The macros BLIS_ENABLE_?COMPLEX_VIA_4M are now used to initialize
the variable that determines the state (enabled or disabled).
- Changed bli_info*() API so that all cache and register blocksize-
related query routines return the blksz_t objects' values as they
exist at runtime, rather than return the values as determined by the
configuration system (e.g. bli_kernel.h, or defaults for those values
not specified). This sets the foundation for being able to change
those blocksizes at runtime.

commit b541b667cabfa6d41b50ad1e49209651ee6812cc
Merge: 699a8151 dd61307f
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Aug 20 14:44:51 2014 -0500

Merge branch 'master' of http://github.com/flame/blis

Conflicts:
frame/3/trsm/bli_trsm_blk_var2b.c
frame/3/trsm/bli_trsm_blk_var2f.c

commit 699a8151ca3d5021e834a1784ef45dcc3a3d17cd
Author: Tyler Smith <tmscs.utexas.edu>
Date: Wed Aug 20 14:43:17 2014 -0500

Some improvements to trsm parallelism

commit dd61307f55bb6bc762fe0ef0446479d6c0536723
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 20 09:52:16 2014 -0500

Minor update to sandybridge MC_S, KC_S.

Details:
- Changed sandybridge MC and KC for single-precision real to 128 and 384,
respectively.
- Updated comments in template configuration's gemm micro-kernel file
to document the new "contiguous row preference" macro.

commit d0eec4bddd740ce360d0f655362c551287cf925b
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 19 15:49:19 2014 -0500

Added optional row preference to ukernel config.

Details:
- Added the ability for the kernel developer to indicate the gemm micro-
kernel as having a preference for accessing the micro-tile of C via
contiguous rows (as opposed to contiguous columns). This property may
be encoded in bli_kernel.h as BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS,
which may be defined or left undefined. Leaving it undefined leads to
the default assumption of column preference.
- Changed conditionals in frame/3/*/*_front.c that induce transposition
of the operation so that the transposition is induced only if there
is disagreement between the storage of C and the preference of the
micro-kernel. Previously, the only conditional that needed to be met
was that C was row-stored, which is to say that we assumed the micro-
kernel preferred column-contiguous access on C.
- Added a "prefers_contig_rows" property to func_t objects, and updated
calls to bli_func_obj_create() in _cntl.c files in order to support
the above changes.
- Removed the row-storage optimization from bli_trsm_front.c because
it is actually ineffective. This is because the right-side case of
trsm flips the A and B micro-panel operands (since BLIS only requires
left-side gemmtrsm/trsm kernels), meaning any transposition done
at the high level is then undone at the low level.
- Tweaked trmm, trmm3 _front.c files to eliminate a possible redundant
invocation of the bli_obj_swap() macro.

commit 4cc2b464f29cafbfef9295b073b857fe0752f710
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Fri Aug 15 11:49:15 2014 -0500

Reorganized packm ukernels.

Details:
- Previously, packm micro-kernels were organized by the implied register
blocksize (panel dimension) assumed by the kernel, meaning conventional,
ri, and ri3 variations of some micro-kernel size were housed in the same
file. This commit reorganizes the micro-kernels so that all sizes reside
in the same file for each format type (conventional, ri, and ri3).

commit fcc10054a11b6fc3976986f57feccf741596cbf6
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 13 12:32:06 2014 -0500

Tweaks to gemm4m, gemm3m virtual ukernels.

Details:
- Fixed a potential, but as-yet unobserved bug in gemm3m that would
allow undesirable inf/NaN propogation, since C was being scaled by
beta even if it was equal to zero.
- In gemm3m micro-kernel, we now avoid copying C to the temporary
micro-tile if beta is zero.
- Rearranged computation in gemm4m so that the temporary C micro-tile
is accessed less, and C is accessed only after the micro-kernel
calls. This improves performance marginally in most situations.
- Comment updates to both gemm4m and gemm3m micro-kernels.

commit cdcbacc2fa871317c8e7ef961ecc6d70ab22dc34
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 12 12:45:38 2014 -0500

Removed redundant redef of packm ukr prototypes.

Details:
- Removed redundant macro code that redefined packm ukernel prototypes
when the previous macro was already sufficient. This helps de-clutter
the packm ukernel prototyping headers a little bit.

commit 82dac98d9032ccb598068a55ddf23d7898491e9e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 12 12:36:25 2014 -0500

Relocated packm ukernel includes.

Details:
- Consolidated the include statements for packm ukernel headers from
bli_packm_cxk.h, bli_packm_cxk_ri.h, and bli_packm_cxk_ri3.h to
bli_packm.h.
- Comment/whitespace updates to bli_packm_blk_var3.c, _var4.c.

commit 7f77856e25aad5fc6f172ed3e57b6351804e31a4
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Aug 12 12:20:15 2014 -0500

Removed unused 4m/3m-related packm macro defs.

Details:
- Removed unused and unneeded s- and d-flavored macro definitions for
packm ukernels related to the complex 4m and 3m methods, as
implemented in BLIS.

commit bc1d86b2d4d436b1dfba2d0098501aaca9cbb8b5
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 7 19:01:20 2014 -0500

Sandy Bridge configuration, micro-kernel update.

Details:
- Minor updates to bli_config and bli_kernel.h for sandybridge
configuration.
- Renamed existing AVX intrinsic-based micro-kernel file to
bli_gemm_int_d8x4.c.
- Added new file, bli_gemm_asm_d8x4.c, which provides assembly-based
gemm micro-kernels for single- and double-precision real.

commit 98ec95877a95242e159b2bf0c879115a59e4c6e2
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 7 18:28:32 2014 -0500

Corrected comment for _obj_is_[row|col]_stored().

Details:
- Fixed a mistake in the comments introduced in the previous commit for
bli_obj_is_row_stored() and bli_obj_is_col_stored().

commit 43d5e419e1b424d2143817103dbee8ead797e8aa
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 7 18:20:40 2014 -0500

Reverted _obj_is_[row|col]_stored() macros.

Details:
- Rolled back recent changes to bli_obj_is_row_stored() and
bli_obj_is_col_stored() so that those macros now only inspect the
strides (row or column). It turns out that the more sophisticated
definitions introduced in a51e32e are not necessary, because these
"obj" macros are virtually never used on packed matrices, and when
they are, they can use bli_obj_is_[row|col}_packed() macros, which
inspect the info bitfield.

commit 45692e3ad4b7e1d05ac4302398df4efce04b4284
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Aug 7 13:21:15 2014 -0500

Reverted some accidental changes.

Details:
- Reverted some changes that were unintentionally included in the
previous commit (9526ce98). Thanks to Tony Kelman for pointing
this out. (Note: a few select changes were not reverted.)

commit 9526ce98812be908bc4915f2849b657fb6ce1b49
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 6 14:13:46 2014 -0500

Updated copyright headers of emscripten configuration files.

commit 30833ed71d56f231ddba21e632bcbbc90b12a97c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Aug 6 12:12:03 2014 -0500

Minor edits to configurations' make_defs.mk files.

Details:
- Redefined CFLAGS, CFLAGS_NOOPT, and CFLAGS_KERNELS so that CFLAGS_NOOPT
is defined first and then the other two are defined in terms of
CFLAGS_NOOPT. This textually cleans up the definitions and makes them a
little easier to read.

commit 9d61afeae2ba70fe1df07e7546f6954ea83aed12
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 4 16:01:59 2014 -0500

CHANGELOG update (0.1.5)

0.1.5

Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 4 16:01:58 2014 -0500

Version file update (0.1.5)

commit 4c6ceea4be35d089630986eb5b959b9e97214077
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Mon Aug 4 15:49:59 2014 -0500

Added CBLAS compatibility layer.

Details:
- Added a new section in bli_config.h files of all configurations for
enabling CBLAS support. (Currently, the default is for the CBLAS layer
to be disabled.)
- Added a directory, frame/compat/cblas, to house CBLAS source code. A
subdirectory 'f77_sub' holds subroutine wrappers corresponding to
subroutines found in CBLAS that allow calling some BLAS routines with
the return value passed as the last argument rather than as an actual
(function) return value. This was probably intended to allow CBLAS to
avoid the whole f2c debacle altogether. However, since BLIS does not
assume the presence of a Fortran compiler, we had to provide similar
routines in C.
- A script, integrate-cblas-tarball.sh, is included to streamline the
integration of future revisions of the CBLAS source code.
- The current tarball, cblas.tgz, that was used with the above script to
generate the present set of CBLAS source code is also included.
- Updated blis.h to include necessary CBLAS-related headers.

commit caab62dac0fb0bd0d674118f409c81680db94d29
Merge: 383631b5 db97ce97
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Aug 3 14:36:18 2014 -0500

Merge pull request 19 from kevinoid/fix-install-perms-error

Fix permissions error installing to non-owned directory

commit db97ce979b88c051922c2f946ce52d523c7a12c6
Author: Kevin Locke <kevinkevinlocke.name>
Date: Sun Aug 3 12:48:04 2014 -0600

Fix permissions error installing to non-owned directory

When installing to a directory which is not owned by the installing
user, even when the user has write permission for the directory, the
installation can fail with an error similar to the following:

Installing libblis-0.1.4-7-sandybridge.a into /usr/local/lib/
install: cannot change permissions of ‘/usr/local/lib’: Operation not permitted
Makefile:658: recipe for target '/usr/local/lib/libblis-0.1.4-7-sandybridge.a' failed
make: *** [/usr/local/lib/libblis-0.1.4-7-sandybridge.a] Error 1

In the example case, the error occurred because the user attempted to
install to /usr/local and /usr/local/lib is owned by root with mode 2755
which the Makefile unsuccessfully attempted to change to 0755.

Given that installing to /usr/local is likely to be quite common and the
ownership/permissions are the default for Debian and Debian-derived
Linux distributions (perhaps others as well), this commit attempts to
support that use case by using mkdir rather than install to create the
directory (which is the same approach as Automake).

Signed-off-by: Kevin Locke <kevinkevinlocke.name>

commit 383631b514c3d42b724640f57644eea276cc418c
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 31 14:51:48 2014 -0500

Redefined bit field macros with bitshift operator.

Details:
- Redefined many of the macros that define bit fields and bit values in
the obj_t info field using the bitshift operator (<<). This makes it
easier to reorder bit fields, or expand existing bit fields, or add
new fields. The bitshifting should be evaluated by the compiler at
compile-time.

commit 137143345dc93cc9a83da5ba88b25bac7502de86
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Thu Jul 31 12:12:45 2014 -0500

Reimplemented unit blocksize fix in prev commit.

Details:
- Instead of inferring the storage format of the micro-panels from within
the packm variants, we now pass in a bool_t value that denotes whether
the packed matrix contains row-stored column panels or column-stored
row panels. This value can then be tested more easily inside the main
packm variant loop.
- Renumbered pack_t schema values in bli_type_defs.h so that there are
now five bits, each with different meaning:
- 4: packed or not packed?
- 3: packed for 3m?
- 2: packed for 4m?
- 1: packed to panels?
- 0: stored by rows or columns?
- Added new macros that test for status of above bits in schema bit
subfield, and renamed some existing macros related to 4m/3m.

commit a51e32ec061941cd10119ea80115c82a40b1673f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Wed Jul 30 10:41:48 2014 -0500

Fixed unit register blocksize brokenness.

Details:
- Fixed a breakdown in BLIS's ability to differentiate between row-stored
and column-stored micro-panels when MR or NR is unit. When either
register blocksize (or both) is equal to one, inspecting the strides of
the affected packed micro-panel is no longer sufficient to determine
whether the micro-panel is a row-stored column panel or a column-stored
row panel (because both strides are unit). At that point, dimension
information is necessary when invoking the bli_is_row_stored_f() and
bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to
Ilya Polkovnichenko for reporting this bug.
- Added panel dimensions (m and n) to obj_t, which are set in
packm_init() and then passed into the blocked variants to support the
aforementioned update.

commit c2732272f0ac680a0ad19fa9db5d587398a1479a
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Tue Jul 29 16:37:18 2014 -0500

Removed old/unused packm variants.

commit b97fa9a5a70fe0123e5eebd999b947461d38445f
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jul 27 18:54:09 2014 -0500

Minor usage update to build/bump-version.sh.

commit b18ba5f62d98629cdd519ff4c96fc67ec1a62fb9
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jul 27 18:52:05 2014 -0500

Added missing 'bla_' prefix to r_imag(), d_imag().

Details:
- Added "bla_" to f2c functions r_imag() and d_imag(). Thanks to Murtaza
Ali for pointing the mis-named functions.

commit af7a8e6c042cade452130a6729377f1a3ef4e19e
Author: Field G. Van Zee <fieldcs.utexas.edu>
Date: Sun Jul 27 18:20:13 2014 -0500

CHANGELOG update (0.1.4)

Page 4 of 7

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.