Highlights:
- **CUDA backend**
- Upgrade to PTX 6.3 and add a few CUDA intrinsics (1548) (by **Yuanming Hu**)
- **Performance improvements**
- Improve dynamic listgen and access performance (1547) (by **Yuanming Hu**)
- **Refactor**
- 'ti.Matrix(n, m, dt, shape)' is deprecated, use 'ti.Matrix.var(n, m, dt, shape)' instead (1531) (by **彭于斌**)
Full changelog:
- [cc] The C backend is now capable of running mpm128 (1553) (by **彭于斌**)
- [bug] Update mpm_lagrangian_force and fix Matrix constructor (1545) (by **Ye Kuang**)
- [opengl] [refactor] KernelParallelAttribs -> ParallelSize + virtual methods to make a way for grid-stride-loop (1540) (by **彭于斌**)
- [opengl] Fix reversed nested for loops error on OpenGL (1554) (by **彭于斌**)
- [Perf] Improve dynamic listgen and access performance (1547) (by **Yuanming Hu**)
- [cuda] [llvm] Module broken is TI_WARN instead of TI_ERROR (1557) (by **彭于斌**)
- [linux] Fix LLVM symbol leakage in release mode by using RTLD_GLOBAL (1544) (by **彭于斌**)
- [CUDA] Upgrade to PTX 6.3 and add a few CUDA intrinsics (1548) (by **Yuanming Hu**)
- [ir] Move struct-for demotion pass after offload pass (1541) (by **Ye Kuang**)
- [cc] Support "range for" and "while" statement on C backend (1536) (by **彭于斌**)
- [refactor] Better import order by using \_\_all\_\_ (1510) (by **彭于斌**)
- [misc] Add is_path_all_dense to SNode (1538) (by **Ye Kuang**)
- [Refactor] 'ti.Matrix(n, m, dt, shape)' is deprecated, use 'ti.Matrix.var(n, m, dt, shape)' instead (1531) (by **彭于斌**)
- [lang] [refactor] Setup a multipass AST transformer (1467) (by **彭于斌**)