**NOTE: CUDA graph support in this release is broken. Use v0.1.2 instead.**
What's new:
- Support of setting activation type to `float16` for DeepSeek R1 (via appending `keep_dtype_in_checkpoint=False dtype=float16` in command line arguments).
- Config file for QwQ-32B.
- A number of bug fixes for running with CUDA graph.
- Further optimizations of operator kernels.