Highlights:
- **Bug fixes**
- Fix compilation crash when there is a container statement after an unconditional continue (1299) (by **xumingkuan**)
- **CUDA backend**
- Fix on-demand memory pool on certain GPUs (1314) (by **Yuanming Hu**)
- **Intermediate representation**
- Replace "OffsetAndExtractBitsStmt" with "BitExtractStmt" (1306) (by **xumingkuan**)
- **Language and syntax**
- Support sep and end in print() (1311) (by **Ye Kuang**)
- Support Python-scope scalar functions / matrix operations, e.g. ti.sqrt(2) (1188) (by **彭于斌**)
- **Miscellaneous**
- Postpone backend detection to prevent possible compatibility issues (1273) (by **彭于斌**)
- **IR optimization passes**
- Move unreachable code elimination to a separate pass (1315) (by **xumingkuan**)
- Constant folding for BitExtractStmt (1307) (by **xumingkuan**)
- Remove exceptions from lower_access pass (1292) (by **Xuanda Yang**)
- **Performance improvements**
- Thread local storage for range-for reductions on CPUs (1296) (by **Yuanming Hu**)
- **Standard library**
- Add ti.rsqrt() and ti.Vector.norm_inv() (1293) (by **彭于斌**)
Full changelog:
- [cuda] Support numpy and torch tensors with zeros in shapes (e.g., (5, 0, 5)) (1305) (by **Yuanming Hu**)
- [refactor] Rename the file created in 1315 (1316) (by **xumingkuan**)
- [Opt] [refactor] Move unreachable code elimination to a separate pass (1315) (by **xumingkuan**)
- [CUDA] Fix on-demand memory pool on certain GPUs (1314) (by **Yuanming Hu**)
- [Opt] Constant folding for BitExtractStmt (1307) (by **xumingkuan**)
- [lang] [test] Improve code coverage in SNode (1214) (by **彭于斌**)
- [Lang] Support sep and end in print() (1311) (by **Ye Kuang**)
- [metal] Add kernel side util to support print() (1301) (by **Ye Kuang**)
- [Misc] Postpone backend detection to prevent possible compatibility issues (1273) (by **彭于斌**)
- [IR] [refactor] Replace "OffsetAndExtractBitsStmt" with "BitExtractStmt" (1306) (by **xumingkuan**)
- [Bug] [opt] Fix compilation crash when there is a container statement after an unconditional continue (1299) (by **xumingkuan**)
- [ir] [refactor] Simplify the "re_id" pass (1304) (by **xumingkuan**)
- [Perf] Thread local storage for range-for reductions on CPUs (1296) (by **Yuanming Hu**)
- [bug] [std] Fix matrix print shape in Taichi-scope (1300) (by **彭于斌**)
- [metal] [autodiff] Fix StackLoadTopStmt codegen in Metal (1298) (by **Ye Kuang**)
- [Lang] [refactor] Support Python-scope scalar functions / matrix operations, e.g. ti.sqrt(2) (1188) (by **彭于斌**)
- [Std] [lang] Add ti.rsqrt() and ti.Vector.norm_inv() (1293) (by **彭于斌**)
- [Opt] [ir] [refactor] Remove exceptions from lower_access pass (1292) (by **Xuanda Yang**)
- [misc] Show LLVM version on startup (1294) (by **FantasyVR**)