Benchmark
We evaluate the following categories of operations:
1. FP16 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
2. INT8 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
3. Dequantization Operations
- Weight Quantization (WQ) GEMM and GEMV
4. Contiguous batching performance for enhanced GPU utilization
FP16 GEMM and GEMV


2. INT8 GEMM and GEMV


Dequantize GEMM and GEMV


Contiguous Batching Performance

Benchmark Configuration
The benchmark configurations for each test scenario are detailed below:
<!-- center -->
<div align="center">
<table class="tableizer-table">
<thead><tr class="tableizer-firstrow"><th>config</th><th>Provider</th><th>M</th><th>N</th><th>K</th></tr></thead><tbody>
<tr><td>V0</td><td>None</td><td>1</td><td>16384</td><td>16384</td></tr>
<tr><td>V1</td><td>BLOOM</td><td>1</td><td>43008</td><td>14336</td></tr>
<tr><td>V2</td><td>BLOOM</td><td>1</td><td>14336</td><td>14336</td></tr>
<tr><td>V3</td><td>BLOOM</td><td>1</td><td>57344</td><td>14336</td></tr>
<tr><td>V4</td><td>BLOOM</td><td>1</td><td>14336</td><td>57344</td></tr>
<tr><td>V5</td><td>OPT</td><td>1</td><td>9216</td><td>9216</td></tr>
<tr><td>V6</td><td>OPT</td><td>1</td><td>36864</td><td>9216</td></tr>
<tr><td>V7</td><td>OPT</td><td>1</td><td>9216</td><td>36864</td></tr>
<tr><td>V8</td><td>LLAMA</td><td>1</td><td>22016</td><td>8192</td></tr>
<tr><td>V9</td><td>LLAMA</td><td>1</td><td>8192</td><td>22016</td></tr>
<tr><td>V10</td><td>LLAMA-2</td><td>1</td><td>8192</td><td>8192</td></tr>
<tr><td>V11</td><td>LLAMA-2</td><td>1</td><td>28672</td><td>8192</td></tr>
<tr><td>V12</td><td>LLAMA-2</td><td>1</td><td>8192</td><td>28672</td></tr>
<tr><td>M0</td><td>None</td><td>16384</td><td>16384</td><td>16384</td></tr>
<tr><td>M1</td><td>BLOOM</td><td>8192</td><td>43008</td><td>14336</td></tr>
<tr><td>M2</td><td>BLOOM</td><td>8192</td><td>14336</td><td>14336</td></tr>
<tr><td>M3</td><td>BLOOM</td><td>8192</td><td>57344</td><td>14336</td></tr>
<tr><td>M4</td><td>BLOOM</td><td>8192</td><td>14336</td><td>57344</td></tr>
<tr><td>M5</td><td>OPT</td><td>8192</td><td>9216</td><td>9216</td></tr>
<tr><td>M6</td><td>OPT</td><td>8192</td><td>36864</td><td>9216</td></tr>
<tr><td>M7</td><td>OPT</td><td>8192</td><td>9216</td><td>36864</td></tr>
<tr><td>M8</td><td>LLAMA</td><td>8192</td><td>22016</td><td>8192</td></tr>
<tr><td>M9</td><td>LLAMA</td><td>8192</td><td>8192</td><td>22016</td></tr>
<tr><td>M10</td><td>LLAMA-2</td><td>8192</td><td>8192</td><td>8192</td></tr>
<tr><td>M11</td><td>LLAMA-2</td><td>8192</td><td>28672</td><td>8192</td></tr>
<tr><td>M12</td><td>LLAMA-2</td><td>8192</td><td>8192</td><td>28672</td></tr>
</tbody></table>
</div>
What's Changed
* fix typos by xzyaoi in https://github.com/microsoft/BitBLAS/pull/23
* [Kernel] Extend Fast Decoding to UINT2 + QZeros by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/25
* [FP8] Support FP8 MatrixCore Code gen and related test by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/29
* [FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/30
* [Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/32
* [BUG] Make sure the torch tensor is contiguous by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/34
* [BitNet] Disable accelerate for BitNET by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/36
* [FP8] Support Weight Dequantize FP16xFP8_E4M3 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/42
* [DEV][FP8] Improve e4m3 decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/43
* [Target] Improve TVM Target related items by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/45
* [BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/46
* [Feature] Enhancing MatmulOps with Splitk Support by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/48
* [Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/49
* [Dev] Improve General Matmul With Splitk by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/50
* [Dev] Bump Version to 0.0.1.dev9 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/51
* [Dev] Fix GEMV Dynamic Scheduling with Splitk by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/52
* [BugFix] Fix a bug in Static shape build by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/53
* [Dev] Fix a but within FP8 E4M3 Fast Decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/54
* [Dev] Issue24: FIx a bug of repack AutoGPTQ quantized parameters by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/57
* [FIX] GPU detection in multigpu env and OEM A100 not matching TVM by Qubitium in https://github.com/microsoft/BitBLAS/pull/58
* [FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by Qubitium in https://github.com/microsoft/BitBLAS/pull/59
* Fix gpu model missing from tvm target remap by Qubitium in https://github.com/microsoft/BitBLAS/pull/61
* [Dev] Potentially improve performance through block reduction by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/63
* [Readme] Update support matrix in README by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/67
* [Dev] Move bitblas package to the project root by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/68
* [Dev] Refactor scripts based on our new directory structure by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/69
* [Dev] Refactor testing scripts and fix security issues by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/72
* [CI] Auto Format Checking and test checking. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/73
* [Fix] Fix Bitblas Relax relevant pass and test by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/74
* [CI] Edit the notify setting in our CI by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/76
* [Dev] Move Relax Pass from testing to integration by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/77
* [Dev] Refactor the ops script implementation with SE by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/78
* [Dev] Fix a bug in general matmul ops with zero by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/79
* [Dev] Append Efficient CUDA test for low precision batch decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/80
* [Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/83
* [Dev] Refactor Modeling BitNet to support vLLM quant linear by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/84
* Fix database path default by janEbert in https://github.com/microsoft/BitBLAS/pull/85
* [Issue 62] flexible whl for different cuda version by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/86
* Limiting parallel jobs for local build by bibo-msft in https://github.com/microsoft/BitBLAS/pull/88
* [Dev] Bump version to 0.0.1.dev13 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/87
* [Dev] Feature Improves for bitnet and block reduction by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/92
* [Dev] Bug fix within block reduce schedule template by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/93
* [Dev] Fix a correctness issue when block reduce is applied with pipeline stage by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/94
* [Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/95
* [Dev] Append CUTLASS submodule by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/96
* [Dev] Add Basic Benchmark Implementation for operators by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/98
* [Dev] Improve benchmark scripts by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/99
* Fix virtual env issue for our benchmark workflow by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/101
* [BUG Fix] Add missing checkout statements in benchmark workflow by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/102
* Update benchmark.yml by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/103
* [BUG Fix] remove ref assignments of the pr commit by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/104
* Ref GPTQModel for 3rd support/integration by Qubitium in https://github.com/microsoft/BitBLAS/pull/106
* [Dev] Complete benchmark op sets of ci by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/100
* [Dev] Remove Redundant Dynamic Shared Memory sync by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/107
* [Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/110
* [Dev] Enhance Lower Warp memory to support multi stage tensorization by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/111
* Refactor benchmark yml to disable alters on issue by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/113
* [Dev] Enhance LOP3 Instruction Registration to support incoming warp level lop3 instructions by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/114
* [Dev] Merge BlockReduce with naive schedule template by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/119
* [Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/124
* [Fix][Dev] Typo fix for our workflow and enhance lop3 decode to support scaling by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/125
* [Dev] Convert the quant compress from numpy into tvm runtime by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/126
* Update documents by xysmlx in https://github.com/microsoft/BitBLAS/pull/129
* [Dev] Refactor the weight transformation to support upcoming stage3 transform by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/130
* [Dev] Bring Block Reduction into our seach space and policy by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/132
* Fix retrieve head commit in benchmark by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/134
* [Integration] Upload tutorial for making a bitnet ckpt for vLLM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/135
* [Typo] Fix missing links in the bitnet integration's docs by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/136
* fix BitNet integration for vLLM by xysmlx in https://github.com/microsoft/BitBLAS/pull/137
* fix BitNet integration for vLLM by xysmlx in https://github.com/microsoft/BitBLAS/pull/139
* [Dev] Set default weight transformation into Ladder Stage3 LDMatrixTransform by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/133
* [Dev] Disable Block reduction for int8 by default by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/140
* [Dev] BUG Fix for bitnet integration by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/141
* [Feature] Register Missing FastDecoding for INT8xINT4 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/142
* [BUG Fix] Fix the NVCC Comple options for CUDA Version >= 12.5 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/143
* [Integration] Compress Gateup and QKV for bitnet integration by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/144
* [Enhancement] Improve elementwise schedule via vectorization by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/145
* [Dev] Add LowerAllReduce Pass to support cross thread Reduction lowering by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/146
* [Fix] Fix scale and zero scopes for scale only template by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/147
* [Dev] Support Numeric Precision BFloat16 as activation type by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/148
* [Version] Bump Version to 0.0.1.dev15 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/149
* [Dev] Serialize Generated Kernel Name with Operator Config and Hint by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/153
* [BUG] Set Device when kernel be applied into Multiple GPUs. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/155
* [Benchmark] Fast Decoding Benchmark by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/158
* [BUGFix] Disable tensorcore when shape is really small by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/159
* [BUGFix] Resgiter missing FP8 LDMATRIX Instructions for dynamic shared memory by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/162
* [Docs] Update install command from github repo by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/163
* [BugFix] Fix BitBLAS Linear with BFloat16 input by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/164
* [BUGFix] Fix LowerThreadAllReduce Pass for Hopper Arch by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/165
* [Dev] Enhance Thread Sync Injector for Stream-K Implementation by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/166
* [Dev] Revert Hack impl for memory caching by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/167
* [TL] Update several TL Examples by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/168
* [TL] Enhance Layout Annotate Pass to handle PTX Inst by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/170
* chore(deps): bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by dependabot in https://github.com/microsoft/BitBLAS/pull/175
* [TL] Add TL Layout and Macro utils by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/174
* [TL] Support GEMM_SS Macro to perform gemm directly from shared memory by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/176
* [TL] Inject Storage Sync Scope Automatically for TL by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/177
* [TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/178
* [TL] Enhance TL to import customized c headers by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/179
* [Dev] Bug fix for Block Reduce Template and improve TL by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/183
* [BugFix] Disable 8bit TensorCore for SM Version lower than 80 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/185
* [Dev] Dequante SIMT Matmul Implementation. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/188
* [Dev] Improve Dequant performance on CUDA Simt by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/189
* [TL] Append Macro Test Case for GEMM and Dequant GEMM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/190
* [TL] Add example usage/test case for Dynamic Symbolic by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/191
* [BugFix]Fix llvm install bug by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/193
* [Test] Add Thread Level Macro Dequantize Gemm Test Cases by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/194
* [TL][BugFix] Add implementation of TL Gemm and Fix a bug for TL Jit by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/195
* [TL] test flashattention script by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/196
* [TL][BugFix] Disable Buffer Vectorization and Add OP Related TL Test Cases by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/197
* [TL] Wrap TL Kernel with Scheduler by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/199
* [Dev][TL] Add TL BaseScheduler and Library Generator by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/200
* [Dev][TL] Hardware Aware Tuning Examples with TL by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/201
* [TL] initial implement flashattention op in TL by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/202
* [Dev] Enhance Operator Cache to support multi-thread environments by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/205
* [TL] Adapt TL Hardware-aware Search Space with Roller by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/207
* [TL] [Doc] add flash attention usage document by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/210
* [Dev] Add support and test case for Ladder Weight only Transformation Matmul Operator by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/212
* [Dev][TL] Merge Hopper and Pipeline Modifications by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/213
* [Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/214
* [TL] [Issue215] add simplify pass for TL and test script, fixing issue by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/216
* [Bugfix] Enhance LowerAsyncCopy Pass to handle INT8 dma copy with predicate by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/219
* [Dev] Disable smooth layout rewrite for buffer store in some case by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/220
* [Dev][TL] Enhance TL Paser to support flexible tile lang kernel implementation by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/222
* [Dev][TL] Implement Tile Language Dequant Matmul and Test Case by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/224
* [Issue 192] Tail split support for dynamic matmul by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/227
* [Dev][TL] Following updates of Tile Language Backend by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/226
* [Dev] Add some tests and examples by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/228
* [AMD][HIP] Add HIP Code Generation with Block Primitives from Composable kernel Tile by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/223
* [Dev][Bugfix] Add target argument and remove override register for hip callback compile by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/229
* [Bugfix] Fix build bug due to submodule update by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/230
* [Dev] Support Tile Lang INT8xINT8 TensorCore Macro by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/231
* [Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/232
* [Dev][BitNET] Implement INT4xINT2 GEMM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/233
* [Dev][Bitnet] Implement Operator with INT4xINT4/INT2 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/234
* [Dev] Update News in Readme by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/235
* [Dev] Enhance TileLang Backend and fix a bug for INT4xINT2 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/236
* [DEV][TL] Support AMD Matrix Code Implementation by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/237
* [Dev][HIP] Fix MFMA Codegen by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/238
* [CI] Disable Benchmark workflow due to github action v4 updates by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/239
* [Dev] Enhance Infra for ROCM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/240
* [Dev][AMD] Add AMD CDNA Arch by Cunxiao2002 in https://github.com/microsoft/BitBLAS/pull/225
* [Dev] Fix some lint issues by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/241
* [Dev][Relax] Update Bitblas end2end tuning example with relax by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/242
* [Dev] Fix illegal pass order by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/243
* [Docs] update the contributing's table of contents by emmanuel-ferdman in https://github.com/microsoft/BitBLAS/pull/245
* [Dev][AMD] Implement LDS Async Copy for CDNA Arch by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/246
* [Dev][AMD] Support LDS and Flash Attention for AMD Backend by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/247
* [AMD][TL] Introduce K Pack and a Conflict Free swizzling into Matrix Core by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/248
* [BUGFix] Introduce our own `asser_close` to allow few mismatch elements for some case by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/249
* [Dev][AMD] Implement conditional async load for AMD HIP Backend by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/250
* [BUGFix] Fix MatmulDequantize with FP4 Format by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/254
* [Dev] Enhance Backend Abstraction for TileLang by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/255
* [Docker] Add Dockerfile to set up the application environment by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/260
* [Relax] Fix end2end tuning for relax graph by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/261
* [Dev] Refactor codebase to save import time by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/262
* [Enhancement][TileLang] Introduce Pass `LegalizeSafeMemoryAccess` to auto protect memory access by Injecting IfThenElse Node by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/267
* [TileLang][Dev] Enhance Layout Inference Pass to infer with complex parallel primitives by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/268
* [Dev] Migrate default backend from tir into tilelang by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/270
* [Dev] Fallback NF format to TIR backend as TileLang implementation is not currently supported. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/272
* [Dev] Implement TileLang NF Format Dequantize by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/273
* [Release] Bump version to 0.1.0 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/274
* [Bugfix] Fix Mismatched Retnet LinearAttention Layout by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/275
* [Bugfix] Fix correctness issue for float16xuint1 with fast dequantize by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/277
* Fix: Remove composable_kernel include from setup.py by LeslinD in https://github.com/microsoft/BitBLAS/pull/279
* [Bugfix] Fix VERSION FileNotFoundError bugs via pip installation by senlyu163 in https://github.com/microsoft/BitBLAS/pull/285
* [Doc] Move Torch Tensors to GPU by senlyu163 in https://github.com/microsoft/BitBLAS/pull/286
New Contributors
* xzyaoi made their first contribution in https://github.com/microsoft/BitBLAS/pull/23
* tzj-fxz made their first contribution in https://github.com/microsoft/BitBLAS/pull/57
* Qubitium made their first contribution in https://github.com/microsoft/BitBLAS/pull/58
* janEbert made their first contribution in https://github.com/microsoft/BitBLAS/pull/85
* dependabot made their first contribution in https://github.com/microsoft/BitBLAS/pull/175
* Cunxiao2002 made their first contribution in https://github.com/microsoft/BitBLAS/pull/225
* emmanuel-ferdman made their first contribution in https://github.com/microsoft/BitBLAS/pull/245
* LeslinD made their first contribution in https://github.com/microsoft/BitBLAS/pull/279
* senlyu163 made their first contribution in https://github.com/microsoft/BitBLAS/pull/285
**Full Changelog**: https://github.com/microsoft/BitBLAS/compare/v0.0.1dev...v0.1.0