Bitblas

Latest version: v0.1.0.post1

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.1.0

Benchmark

We evaluate the following categories of operations:
1. FP16 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
2. INT8 Matrix Operations
- GEMM (Matrix Multiplication)
- GEMV (Matrix-Vector Multiplication)
3. Dequantization Operations
- Weight Quantization (WQ) GEMM and GEMV
4. Contiguous batching performance for enhanced GPU utilization

FP16 GEMM and GEMV
![op_benchmark_a100_fp16_gemm](https://github.com/LeiWang1999/bitblas-benchmark/raw/master/ampere_benchmark/dense_matmul/png/op_benchmark_a100_fp16_gemm.png)
![op_benchmark_a100_fp16_gemv](https://github.com/LeiWang1999/bitblas-benchmark/raw/master/ampere_benchmark/dense_matmul/png/op_benchmark_a100_fp16_gemv.png)
2. INT8 GEMM and GEMV
![op_benchmark_a100_int8_gemm](https://github.com/LeiWang1999/bitblas-benchmark/raw/master/ampere_benchmark/dense_matmul/png/op_benchmark_a100_int8_gemm.png)
![op_benchmark_a100_int8_gemv](https://github.com/LeiWang1999/bitblas-benchmark/raw/master/ampere_benchmark/dense_matmul/png/op_benchmark_a100_int8_gemv.png)

Dequantize GEMM and GEMV
![op_benchmark_a100_wq_gemm](https://github.com/LeiWang1999/bitblas-benchmark/raw/master/ampere_benchmark/dequant_matmul/png/op_benchmark_a100_wq_gemm.png)
![op_benchmark_a100_wq_gemv](https://github.com/LeiWang1999/bitblas-benchmark/raw/master/ampere_benchmark/dequant_matmul/png/op_benchmark_a100_wq_gemv.png)

Contiguous Batching Performance
![contiguous_batching_benchmark_a100](https://github.com/LeiWang1999/bitblas-benchmark/raw/master/ampere_benchmark/contiguous_dequant_matmul/png/contiguous_batching_benchmark_a100.png)

Benchmark Configuration

The benchmark configurations for each test scenario are detailed below:

<!-- center -->
<div align="center">

<table class="tableizer-table">
<thead><tr class="tableizer-firstrow"><th>config</th><th>Provider</th><th>M</th><th>N</th><th>K</th></tr></thead><tbody>
<tr><td>V0</td><td>None</td><td>1</td><td>16384</td><td>16384</td></tr>
<tr><td>V1</td><td>BLOOM</td><td>1</td><td>43008</td><td>14336</td></tr>
<tr><td>V2</td><td>BLOOM</td><td>1</td><td>14336</td><td>14336</td></tr>
<tr><td>V3</td><td>BLOOM</td><td>1</td><td>57344</td><td>14336</td></tr>
<tr><td>V4</td><td>BLOOM</td><td>1</td><td>14336</td><td>57344</td></tr>
<tr><td>V5</td><td>OPT</td><td>1</td><td>9216</td><td>9216</td></tr>
<tr><td>V6</td><td>OPT</td><td>1</td><td>36864</td><td>9216</td></tr>
<tr><td>V7</td><td>OPT</td><td>1</td><td>9216</td><td>36864</td></tr>
<tr><td>V8</td><td>LLAMA</td><td>1</td><td>22016</td><td>8192</td></tr>
<tr><td>V9</td><td>LLAMA</td><td>1</td><td>8192</td><td>22016</td></tr>
<tr><td>V10</td><td>LLAMA-2</td><td>1</td><td>8192</td><td>8192</td></tr>
<tr><td>V11</td><td>LLAMA-2</td><td>1</td><td>28672</td><td>8192</td></tr>
<tr><td>V12</td><td>LLAMA-2</td><td>1</td><td>8192</td><td>28672</td></tr>
<tr><td>M0</td><td>None</td><td>16384</td><td>16384</td><td>16384</td></tr>
<tr><td>M1</td><td>BLOOM</td><td>8192</td><td>43008</td><td>14336</td></tr>
<tr><td>M2</td><td>BLOOM</td><td>8192</td><td>14336</td><td>14336</td></tr>
<tr><td>M3</td><td>BLOOM</td><td>8192</td><td>57344</td><td>14336</td></tr>
<tr><td>M4</td><td>BLOOM</td><td>8192</td><td>14336</td><td>57344</td></tr>
<tr><td>M5</td><td>OPT</td><td>8192</td><td>9216</td><td>9216</td></tr>
<tr><td>M6</td><td>OPT</td><td>8192</td><td>36864</td><td>9216</td></tr>
<tr><td>M7</td><td>OPT</td><td>8192</td><td>9216</td><td>36864</td></tr>
<tr><td>M8</td><td>LLAMA</td><td>8192</td><td>22016</td><td>8192</td></tr>
<tr><td>M9</td><td>LLAMA</td><td>8192</td><td>8192</td><td>22016</td></tr>
<tr><td>M10</td><td>LLAMA-2</td><td>8192</td><td>8192</td><td>8192</td></tr>
<tr><td>M11</td><td>LLAMA-2</td><td>8192</td><td>28672</td><td>8192</td></tr>
<tr><td>M12</td><td>LLAMA-2</td><td>8192</td><td>8192</td><td>28672</td></tr>
</tbody></table>
</div>

What's Changed
* fix typos by xzyaoi in https://github.com/microsoft/BitBLAS/pull/23
* [Kernel] Extend Fast Decoding to UINT2 + QZeros by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/25
* [FP8] Support FP8 MatrixCore Code gen and related test by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/29
* [FP8] Improve tensor adapter to support fp8 conversion between torch and numpy by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/30
* [Bug] Improve the Default Config Value and fix a Bug for TensorCore Config with Small shapes by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/32
* [BUG] Make sure the torch tensor is contiguous by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/34
* [BitNet] Disable accelerate for BitNET by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/36
* [FP8] Support Weight Dequantize FP16xFP8_E4M3 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/42
* [DEV][FP8] Improve e4m3 decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/43
* [Target] Improve TVM Target related items by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/45
* [BUGFix] Fix UINT/INT8 dequantize implementation and optimize the schedule template for float32 accum by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/46
* [Feature] Enhancing MatmulOps with Splitk Support by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/48
* [Dev] Bump Version to dev0.8 and fix issue INT8xINT2 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/49
* [Dev] Improve General Matmul With Splitk by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/50
* [Dev] Bump Version to 0.0.1.dev9 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/51
* [Dev] Fix GEMV Dynamic Scheduling with Splitk by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/52
* [BugFix] Fix a bug in Static shape build by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/53
* [Dev] Fix a but within FP8 E4M3 Fast Decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/54
* [Dev] Issue24: FIx a bug of repack AutoGPTQ quantized parameters by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/57
* [FIX] GPU detection in multigpu env and OEM A100 not matching TVM by Qubitium in https://github.com/microsoft/BitBLAS/pull/58
* [FIX] Must validate ENV settings or wrong gpu selected by nvidia-smi by Qubitium in https://github.com/microsoft/BitBLAS/pull/59
* Fix gpu model missing from tvm target remap by Qubitium in https://github.com/microsoft/BitBLAS/pull/61
* [Dev] Potentially improve performance through block reduction by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/63
* [Readme] Update support matrix in README by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/67
* [Dev] Move bitblas package to the project root by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/68
* [Dev] Refactor scripts based on our new directory structure by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/69
* [Dev] Refactor testing scripts and fix security issues by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/72
* [CI] Auto Format Checking and test checking. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/73
* [Fix] Fix Bitblas Relax relevant pass and test by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/74
* [CI] Edit the notify setting in our CI by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/76
* [Dev] Move Relax Pass from testing to integration by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/77
* [Dev] Refactor the ops script implementation with SE by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/78
* [Dev] Fix a bug in general matmul ops with zero by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/79
* [Dev] Append Efficient CUDA test for low precision batch decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/80
* [Dev] Refactor Backend Dispatch and Kernel Wrap Related Design by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/83
* [Dev] Refactor Modeling BitNet to support vLLM quant linear by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/84
* Fix database path default by janEbert in https://github.com/microsoft/BitBLAS/pull/85
* [Issue 62] flexible whl for different cuda version by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/86
* Limiting parallel jobs for local build by bibo-msft in https://github.com/microsoft/BitBLAS/pull/88
* [Dev] Bump version to 0.0.1.dev13 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/87
* [Dev] Feature Improves for bitnet and block reduction by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/92
* [Dev] Bug fix within block reduce schedule template by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/93
* [Dev] Fix a correctness issue when block reduce is applied with pipeline stage by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/94
* [Dev] Transform 3rdparty tvm from bitblas into bitblas_tl by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/95
* [Dev] Append CUTLASS submodule by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/96
* [Dev] Add Basic Benchmark Implementation for operators by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/98
* [Dev] Improve benchmark scripts by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/99
* Fix virtual env issue for our benchmark workflow by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/101
* [BUG Fix] Add missing checkout statements in benchmark workflow by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/102
* Update benchmark.yml by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/103
* [BUG Fix] remove ref assignments of the pr commit by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/104
* Ref GPTQModel for 3rd support/integration by Qubitium in https://github.com/microsoft/BitBLAS/pull/106
* [Dev] Complete benchmark op sets of ci by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/100
* [Dev] Remove Redundant Dynamic Shared Memory sync by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/107
* [Dev] Enhancing Lower Warp Memory Pass to support decode within warp memory by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/110
* [Dev] Enhance Lower Warp memory to support multi stage tensorization by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/111
* Refactor benchmark yml to disable alters on issue by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/113
* [Dev] Enhance LOP3 Instruction Registration to support incoming warp level lop3 instructions by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/114
* [Dev] Merge BlockReduce with naive schedule template by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/119
* [Dev] Implement ScheduleUnsafeInjectCallArgument Primitive to Hack decoding by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/124
* [Fix][Dev] Typo fix for our workflow and enhance lop3 decode to support scaling by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/125
* [Dev] Convert the quant compress from numpy into tvm runtime by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/126
* Update documents by xysmlx in https://github.com/microsoft/BitBLAS/pull/129
* [Dev] Refactor the weight transformation to support upcoming stage3 transform by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/130
* [Dev] Bring Block Reduction into our seach space and policy by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/132
* Fix retrieve head commit in benchmark by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/134
* [Integration] Upload tutorial for making a bitnet ckpt for vLLM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/135
* [Typo] Fix missing links in the bitnet integration's docs by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/136
* fix BitNet integration for vLLM by xysmlx in https://github.com/microsoft/BitBLAS/pull/137
* fix BitNet integration for vLLM by xysmlx in https://github.com/microsoft/BitBLAS/pull/139
* [Dev] Set default weight transformation into Ladder Stage3 LDMatrixTransform by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/133
* [Dev] Disable Block reduction for int8 by default by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/140
* [Dev] BUG Fix for bitnet integration by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/141
* [Feature] Register Missing FastDecoding for INT8xINT4 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/142
* [BUG Fix] Fix the NVCC Comple options for CUDA Version >= 12.5 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/143
* [Integration] Compress Gateup and QKV for bitnet integration by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/144
* [Enhancement] Improve elementwise schedule via vectorization by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/145
* [Dev] Add LowerAllReduce Pass to support cross thread Reduction lowering by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/146
* [Fix] Fix scale and zero scopes for scale only template by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/147
* [Dev] Support Numeric Precision BFloat16 as activation type by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/148
* [Version] Bump Version to 0.0.1.dev15 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/149
* [Dev] Serialize Generated Kernel Name with Operator Config and Hint by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/153
* [BUG] Set Device when kernel be applied into Multiple GPUs. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/155
* [Benchmark] Fast Decoding Benchmark by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/158
* [BUGFix] Disable tensorcore when shape is really small by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/159
* [BUGFix] Resgiter missing FP8 LDMATRIX Instructions for dynamic shared memory by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/162
* [Docs] Update install command from github repo by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/163
* [BugFix] Fix BitBLAS Linear with BFloat16 input by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/164
* [BUGFix] Fix LowerThreadAllReduce Pass for Hopper Arch by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/165
* [Dev] Enhance Thread Sync Injector for Stream-K Implementation by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/166
* [Dev] Revert Hack impl for memory caching by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/167
* [TL] Update several TL Examples by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/168
* [TL] Enhance Layout Annotate Pass to handle PTX Inst by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/170
* chore(deps): bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by dependabot in https://github.com/microsoft/BitBLAS/pull/175
* [TL] Add TL Layout and Macro utils by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/174
* [TL] Support GEMM_SS Macro to perform gemm directly from shared memory by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/176
* [TL] Inject Storage Sync Scope Automatically for TL by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/177
* [TL] Allow T.clear be applied on a "local" Buffer and improve L2 Swizzle by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/178
* [TL] Enhance TL to import customized c headers by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/179
* [Dev] Bug fix for Block Reduce Template and improve TL by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/183
* [BugFix] Disable 8bit TensorCore for SM Version lower than 80 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/185
* [Dev] Dequante SIMT Matmul Implementation. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/188
* [Dev] Improve Dequant performance on CUDA Simt by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/189
* [TL] Append Macro Test Case for GEMM and Dequant GEMM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/190
* [TL] Add example usage/test case for Dynamic Symbolic by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/191
* [BugFix]Fix llvm install bug by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/193
* [Test] Add Thread Level Macro Dequantize Gemm Test Cases by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/194
* [TL][BugFix] Add implementation of TL Gemm and Fix a bug for TL Jit by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/195
* [TL] test flashattention script by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/196
* [TL][BugFix] Disable Buffer Vectorization and Add OP Related TL Test Cases by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/197
* [TL] Wrap TL Kernel with Scheduler by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/199
* [Dev][TL] Add TL BaseScheduler and Library Generator by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/200
* [Dev][TL] Hardware Aware Tuning Examples with TL by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/201
* [TL] initial implement flashattention op in TL by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/202
* [Dev] Enhance Operator Cache to support multi-thread environments by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/205
* [TL] Adapt TL Hardware-aware Search Space with Roller by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/207
* [TL] [Doc] add flash attention usage document by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/210
* [Dev] Add support and test case for Ladder Weight only Transformation Matmul Operator by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/212
* [Dev][TL] Merge Hopper and Pipeline Modifications by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/213
* [Dev][TL] Integrate TL Dequant Implementation into BitBLAS OPs by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/214
* [TL] [Issue215] add simplify pass for TL and test script, fixing issue by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/216
* [Bugfix] Enhance LowerAsyncCopy Pass to handle INT8 dma copy with predicate by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/219
* [Dev] Disable smooth layout rewrite for buffer store in some case by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/220
* [Dev][TL] Enhance TL Paser to support flexible tile lang kernel implementation by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/222
* [Dev][TL] Implement Tile Language Dequant Matmul and Test Case by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/224
* [Issue 192] Tail split support for dynamic matmul by tzj-fxz in https://github.com/microsoft/BitBLAS/pull/227
* [Dev][TL] Following updates of Tile Language Backend by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/226
* [Dev] Add some tests and examples by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/228
* [AMD][HIP] Add HIP Code Generation with Block Primitives from Composable kernel Tile by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/223
* [Dev][Bugfix] Add target argument and remove override register for hip callback compile by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/229
* [Bugfix] Fix build bug due to submodule update by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/230
* [Dev] Support Tile Lang INT8xINT8 TensorCore Macro by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/231
* [Dev][TL] Implement MMA INT4 Tensor Core and Correctness Test Case. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/232
* [Dev][BitNET] Implement INT4xINT2 GEMM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/233
* [Dev][Bitnet] Implement Operator with INT4xINT4/INT2 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/234
* [Dev] Update News in Readme by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/235
* [Dev] Enhance TileLang Backend and fix a bug for INT4xINT2 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/236
* [DEV][TL] Support AMD Matrix Code Implementation by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/237
* [Dev][HIP] Fix MFMA Codegen by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/238
* [CI] Disable Benchmark workflow due to github action v4 updates by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/239
* [Dev] Enhance Infra for ROCM by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/240
* [Dev][AMD] Add AMD CDNA Arch by Cunxiao2002 in https://github.com/microsoft/BitBLAS/pull/225
* [Dev] Fix some lint issues by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/241
* [Dev][Relax] Update Bitblas end2end tuning example with relax by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/242
* [Dev] Fix illegal pass order by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/243
* [Docs] update the contributing's table of contents by emmanuel-ferdman in https://github.com/microsoft/BitBLAS/pull/245
* [Dev][AMD] Implement LDS Async Copy for CDNA Arch by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/246
* [Dev][AMD] Support LDS and Flash Attention for AMD Backend by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/247
* [AMD][TL] Introduce K Pack and a Conflict Free swizzling into Matrix Core by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/248
* [BUGFix] Introduce our own `asser_close` to allow few mismatch elements for some case by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/249
* [Dev][AMD] Implement conditional async load for AMD HIP Backend by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/250
* [BUGFix] Fix MatmulDequantize with FP4 Format by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/254
* [Dev] Enhance Backend Abstraction for TileLang by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/255
* [Docker] Add Dockerfile to set up the application environment by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/260
* [Relax] Fix end2end tuning for relax graph by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/261
* [Dev] Refactor codebase to save import time by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/262
* [Enhancement][TileLang] Introduce Pass `LegalizeSafeMemoryAccess` to auto protect memory access by Injecting IfThenElse Node by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/267
* [TileLang][Dev] Enhance Layout Inference Pass to infer with complex parallel primitives by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/268
* [Dev] Migrate default backend from tir into tilelang by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/270
* [Dev] Fallback NF format to TIR backend as TileLang implementation is not currently supported. by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/272
* [Dev] Implement TileLang NF Format Dequantize by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/273
* [Release] Bump version to 0.1.0 by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/274
* [Bugfix] Fix Mismatched Retnet LinearAttention Layout by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/275
* [Bugfix] Fix correctness issue for float16xuint1 with fast dequantize by LeiWang1999 in https://github.com/microsoft/BitBLAS/pull/277
* Fix: Remove composable_kernel include from setup.py by LeslinD in https://github.com/microsoft/BitBLAS/pull/279
* [Bugfix] Fix VERSION FileNotFoundError bugs via pip installation by senlyu163 in https://github.com/microsoft/BitBLAS/pull/285
* [Doc] Move Torch Tensors to GPU by senlyu163 in https://github.com/microsoft/BitBLAS/pull/286

New Contributors
* xzyaoi made their first contribution in https://github.com/microsoft/BitBLAS/pull/23
* tzj-fxz made their first contribution in https://github.com/microsoft/BitBLAS/pull/57
* Qubitium made their first contribution in https://github.com/microsoft/BitBLAS/pull/58
* janEbert made their first contribution in https://github.com/microsoft/BitBLAS/pull/85
* dependabot made their first contribution in https://github.com/microsoft/BitBLAS/pull/175
* Cunxiao2002 made their first contribution in https://github.com/microsoft/BitBLAS/pull/225
* emmanuel-ferdman made their first contribution in https://github.com/microsoft/BitBLAS/pull/245
* LeslinD made their first contribution in https://github.com/microsoft/BitBLAS/pull/279
* senlyu163 made their first contribution in https://github.com/microsoft/BitBLAS/pull/285

**Full Changelog**: https://github.com/microsoft/BitBLAS/compare/v0.0.1dev...v0.1.0

0.0.1dev

Pre-release for the v0.0.1. Under testing.

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.