What's Changed
* [Docker] Add libstdcxx-ng-12 to Dockerfiles for CUDA versions by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/160
* Add cpu jit with backend ctypes by xs-keju in https://github.com/tile-ai/tilelang/pull/154
* [Carver] Multi-Threads Compilation for Fast Auto Tuning by SiriusNEO in https://github.com/tile-ai/tilelang/pull/156
* [Refactor] Replace T.If with native Python if statement for mla paged kernel by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/162
* [Enhancement] Improve CUDA path detection by xwhzz in https://github.com/tile-ai/tilelang/pull/157
* [Refactor] Replace `T.thread_binding` with `T.get_thread_binding` in examples and test cases by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/163
* [Bugfix] Cast bool dtype into int8 in blocksparse examples by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/167
* [Example] Implement NSA Decode tilelang exampls by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/168
* [Release] Bump version to v0.1.2.post1 by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/166
* Use SS-GEMM for PV in mla by YouJiacheng in https://github.com/tile-ai/tilelang/pull/165
* [Example] Implement tilelang native sparse attention varlen example by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/170
* [Bugfix] Implement boundary check for the buffer shape with dynamic symbolic by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/173
* [AutoTune] Enable config-performance trace by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/174
* [Feat] Append Pass Context and TMA lowering configuration option by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/175
* [Feat] Introduce new caching mechanism for compiled kernels by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/176
* [Refactor] Enhance GPU Kernel Launch with Environment Thread Creation by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/178
* [Bugfix] Improve Thread Variable Handling in Layout Inference by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/179
* [Examples] Implement NSA Backward kernels by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/180
* [Enhancement] Optimize CMake build process with dynamic job count calculation by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/183
* [Bugfix] Add dynamic shape support with out_idx in Cython JIT kernel compilation by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/185
* [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug by chengyupku in https://github.com/tile-ai/tilelang/pull/188
* [Dev] Add the failed nvcc command to the exception message by penguin-wwy in https://github.com/tile-ai/tilelang/pull/189
* [Bugfix] Fix `T.copy` for scalar datatypes by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/190
* [Enhancement] Simplify GEMM example with direct kernel compilation by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/191
* [Bugfix] Make quickstart work properly on cu118 by penguin-wwy in https://github.com/tile-ai/tilelang/pull/193
* [Language] Support clamp in language by hyx1999 in https://github.com/tile-ai/tilelang/pull/192
* [Refactor] Add SetMaxNRegCollector to Improve Register Hint Handling in Warp Specialized Rewriter by chengyupku in https://github.com/tile-ai/tilelang/pull/194
* [Feature] Add TMA Store Synchronization Support by chengyupku in https://github.com/tile-ai/tilelang/pull/195
* Update expired example code. by 66RING in https://github.com/tile-ai/tilelang/pull/196
* [CMake] Add CUDA Major Version Detection for Conditional Compilation by chengyupku in https://github.com/tile-ai/tilelang/pull/197
* [Feature] Support Async Pipeline inference within if scope by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/198
* [Dev] Add new example for FlashAttention with pipelined execution by chengyupku in https://github.com/tile-ai/tilelang/pull/200
* [Enhancement] Enhancing the handling of conditional statements in the pipeline by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/201
* [Feature] Upgrade cutlass version and support fp8 T.gemm by zqh-wz in https://github.com/tile-ai/tilelang/pull/202
* [Docker] Update Dockerfiles to specify exact version of libstdcxx-ng by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/203
* [Dev] Add GQA backward example by chengyupku in https://github.com/tile-ai/tilelang/pull/205
* [LICENSE] Typo fix in LICENSE by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/208
* [Enhancement] Allow mma fallback when wgmma is not supported by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/206
* [Examples] Expand tuning configurations for FlashAttention example by chenghuaWang in https://github.com/tile-ai/tilelang/pull/204
* [Enhancement] Avoid tvm ffi handling when out_idx is specified by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/209
* [Fix] Fix K // block_K to T.ceildiv(K,block_K) and add tests by hyx1999 in https://github.com/tile-ai/tilelang/pull/210
* [Dev] Implement IfStmtBinding and MergeIfStmt transformations by chengyupku in https://github.com/tile-ai/tilelang/pull/211
* [Language] Introduce `T.reshape` and `T.view` by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/212
* [Enhancement] Improve device handling in Cython kernel adapter by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/220
* [Enhancement] Update format script to support force compare with upstream by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/221
* [Refactor] Introduce KernelParam integration across modules by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/223
* [Bugfix] Fix mismatch of shared memory layout and mma atom on Hopper by zqh-wz in https://github.com/tile-ai/tilelang/pull/224
* [Refactor] Update kernel compilation and profiling in examples by chengyupku in https://github.com/tile-ai/tilelang/pull/225
* [Examples] Add fp8 gemm 2xAcc and deepgemm example by cherichy in https://github.com/tile-ai/tilelang/pull/217
* [Doc] Add instructions for installing nightly version by xwhzz in https://github.com/tile-ai/tilelang/pull/226
* [Bugfix] Disable force inline for ldmatrix by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/227
* [Bugfix] Support duplicate tma desc declaration by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/228
* [Refactor] Rename clamp functions and enhance dtype handling in tests by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/232
* [Enhancement] Simplify kernel source extraction in JIT adapters by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/230
* [Feature] Add reduce_max corresponding tests by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/236
* [BugFix] Fix bug of missing MBarrierExpectTX by chengyupku in https://github.com/tile-ai/tilelang/pull/241
* [Refactor] Refactor for Better Layout Conflict Handling by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/240
* [Refactor] Align torch_assert_close tensor comparison with torch.testing.assert_close by xwhzz in https://github.com/tile-ai/tilelang/pull/239
* [Dev] Implement FlashAttention3 Backward by chengyupku in https://github.com/tile-ai/tilelang/pull/244
* [BugFix] Fix bug of mismatching dtype in testing by xwhzz in https://github.com/tile-ai/tilelang/pull/245
* [Enhancement] Add zero initialization option to GEMM operations by chengyupku in https://github.com/tile-ai/tilelang/pull/246
* [Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to `minBlocksPerMultiprocesor ` by cherichy in https://github.com/tile-ai/tilelang/pull/248
* [Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters by Alex4210987 in https://github.com/tile-ai/tilelang/pull/213
* [Examples] Implement elementwise add kernel by chenghuaWang in https://github.com/tile-ai/tilelang/pull/219
* [Refactor] Phaseout LLVM Dependency by Making it Optional by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/247
* [Readme] Update Bib Citation Section by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/249
* [Enhancement] Support float variable as arguments by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/250
* add autotune to example_gemm.py by yyttt6 in https://github.com/tile-ai/tilelang/pull/252
* [Language] Introduce `T.alloc_var` to define a variable like `int var;` by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/255
* [Example] Implement Kernel Example cumsum by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/258
* [Refactor] Refactor CUDA post-processing callback registration in TileLang by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/259
* [Refactor] Move compilation outside critical section by YouJiacheng in https://github.com/tile-ai/tilelang/pull/260
* [CI] Use auditwheel to generate manylinux wheels by oraluben in https://github.com/tile-ai/tilelang/pull/251
* [Bugfix] Fix Benchmark/Example Code for Autotuning by SiriusNEO in https://github.com/tile-ai/tilelang/pull/254
* [Language] Enhance alias to support blockwise memory load by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/261
* [Bugfix] Fix auto tuning tma handling by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/263
* [Release] Bump version to 0.1.3 by LeiWang1999 in https://github.com/tile-ai/tilelang/pull/264
New Contributors
* xs-keju made their first contribution in https://github.com/tile-ai/tilelang/pull/154
* YouJiacheng made their first contribution in https://github.com/tile-ai/tilelang/pull/165
* penguin-wwy made their first contribution in https://github.com/tile-ai/tilelang/pull/189
* hyx1999 made their first contribution in https://github.com/tile-ai/tilelang/pull/192
* 66RING made their first contribution in https://github.com/tile-ai/tilelang/pull/196
* zqh-wz made their first contribution in https://github.com/tile-ai/tilelang/pull/202
* chenghuaWang made their first contribution in https://github.com/tile-ai/tilelang/pull/204
* cherichy made their first contribution in https://github.com/tile-ai/tilelang/pull/217
* Alex4210987 made their first contribution in https://github.com/tile-ai/tilelang/pull/213
* yyttt6 made their first contribution in https://github.com/tile-ai/tilelang/pull/252
* oraluben made their first contribution in https://github.com/tile-ai/tilelang/pull/251
**Full Changelog**: https://github.com/tile-ai/tilelang/compare/v0.1.2...v0.1.3