hidet Changelog

0.5.0

What's Changed
- [BUG] Add comp server requirements (661) by **Vadim Gimpelson** 300fd33d
- [BUG] A number of fixes for vllm's TP (651) by **Vadim Gimpelson** 9c29f669
- matmul_f16 with wgmma (627) by **kjiang170** 9f0ea7dd
- [BUG] VLLM (and DMWL) compile with hidet backend (647) by **zhumakhan** 6c6be7a0
- [IR] Add support for `swizzle`, `interleave` and `l2Promotion` in tensor map creation (643) by **Bolin Sun** 21ff63f5
- [BUG] fix attach hash to signature (638) by **xiaocenxiaocen** dbd66132
- Hexcute base branch (All related PRs will be merged into this base PR. ) (294) by **xiaocenxiaocen** b1fdf17d
- [PERF] Default value for parallel_k is 'disabled' (634) by **Vadim Gimpelson** 135212bd
- Adapt to bfloat16 where necessary (624) by **ZichuWu** 9045865d
- [Bug] Parallel compilation sync (616) by **ZichuWu** 4c16c576
- [COMPTIME] Hot start speedup (625) by **Vadim Gimpelson** 22c657b0
- [BUG] Fix torch2.5 OoM and docs build fix (637) by **zhumakhan** bf32f8b1
- Revert "[BUG] Fix torch2.5 OoM issue" (635) by **zhumakhan** 9131a5c5
- [BUG] Fix torch2.5 OoM issue (609) by **zhumakhan** fe59c639
- [CI]Fix small typoes for building and publishing to internal Hidet PYPI Index (598) by **xinli-centml** f8400fe1
- [PERF] Support bf16 in one more place (623) by **Vadim Gimpelson** 7f773490
- [Tests] Adapt tests/operators for bfloat16 (615) by **ZichuWu** ba9c0ad5
- [DISTRIBUTED] Support `all_reduce` in `torch.compile` mode (612) by **Vadim Gimpelson** 0bca591c
- [torchAPI] Inherit cuda stream from torch (618) by **Vadim Gimpelson** ad4e00a0
- [BUG] Fix bugs in shared map implementation (608) by **Vadim Gimpelson** ffdbde4b
- [CI] Turn off search space 2 for tests/lang (617) by **ZichuWu** 5f7fae83
- [Tests] Adapt tests/lang for bfloat16 test cases (594) by **ZichuWu** 5b829cbd
- [Tests] Adapt tests/frontends to bfloat16 (592) by **ZichuWu** a5b72e62
- [Tests] Adapt tests/ir for bfloat16 test cases (593) by **ZichuWu** 545aeea4
- [Tests] Adjust test cases for tests/models for bfloat16. (595) by **ZichuWu** bedff214
- Use one global cuda workspace for all the CompiledGraph (603) by **Max Hu** 66523079
- [Fix] Fixing a minor mistake encountered while adapting test cases for `bfloat16` data type (607) by **Bolin Sun** 275070da
- Kaihang/wgmma tf32 u8 i8 support (549) by **kjiang170** a0e6658f
- [CI] Exclude tests/unit_tests/test_dynamic_shape.py::test_attention[cuda] (606) by **Vadim Gimpelson** 5579392a
- [Tests] Adjust test cases for tests/unit-tests for bfloat16. (596) by **ZichuWu** 0e5ec55b
- [BUG] Fix incorrect converting fxgraph to hidet's flow graph + expand looking for nccl lib with user site packages (604) by **Vadim Gimpelson** 1995d431
- [Tests] Added bfloat16 test cases for tests/cuda (590) by **ZichuWu** febfbd71
- [Tests] Adjust test cases for tests/utils for bfloat16. (597) by **ZichuWu** 36aab6f3
- [Tests] Change float16 to bfloat16 for tests/apps (589) by **ZichuWu** 83cddbbf
- [CI] add new github actions workflow to manually build and push to internal pypi index (554) by **xinli-centml** 6beffab9
- [OPTIONS] Remove unnecessary parallel_k (572) by **ZichuWu** 9051f264
- fix test_wgmma.py error for illegal warp address (588) by **kjiang170** 8f7e1396
- [Operators] Allow NT `matmul` layout for `bfloat16` data type (562) by **Bolin Sun** d5d0e518
- python3.8 -> python3.9 (558) by **Vadim Gimpelson** a09713ca
- [CI] Move import torch inside run_torch() (570) by **ZichuWu** 4bc4d290
- [CI] Shorten build-docs run time (565) by **ZichuWu** edadb073
- [CI] Tests Workflow. Add manual trigger of tests on different gpu types (555) by **c-fteixeira** 66d9568c
- [OPTIONS] Clean Huggingface tokens option (561) by **ZichuWu** cdf2c8af
- [Bug] Fix out of memory error occurred while running `llama-2-7b` (547) by **Bolin Sun** b8826d0e
- [OPTIONS] Set mma as default in PassContext() (530) by **ZichuWu** 35f02b96
- wgmma bf16 support (531) by **kjiang170** f8c057b4
- [Bug] ‘uint32_t’ was not declared in this scope in CI build-wheel for runtime (545) by **ZichuWu** 4ced47ef
- Add more shapes to reduce op in regression (534) by **zhumakhan** 8ef1bc21
- [COMPTIME] Added support for run_torch for the rest of transform operation (525) by **ZichuWu** 04e4d5e6
- f16 rest options supported and tested (527) by **kjiang170** e5e2404a
- [Operators] `bfloat16` data type support for attention operators (524) by **Bolin Sun** 07e597a3
- [Enhancement] Save running time by using symbolic_run to replace async_run in optimize (490) by **ZichuWu** 92c81e8f
- [BUG] Fix distilbert by changing variables names in ops.where (512) by **zhumakhan** 2d615b6a
- [OP] Support of `logsoftmax` (517) by **Vadim Gimpelson** ce43f1e8
- refactor wgmma (521) by **kjiang170** 4a80b9ab
- [Bug] Fix the incorrect result after merging changes related to `matmul_nt` (518) by **Bolin Sun** 2b7c348a
- [PERF] Rewrite softmax (516) by **Vadim Gimpelson** b50cca4c
- wgmma instruction support and test for f16 input … (499) by **kjiang170** c758e546
- [BUG] Fix NT matmul corner case where `n` or `k` dimension is odd (513) by **Bolin Sun** 1e54f773
- [Operators] Support `bfloat16` data type in `matmul` operator (511) by **Bolin Sun** a467c76f
- [Operators] Support matmul with NT layout (496) by **Bolin Sun** 8fc6de3a
- [CI] Make test and publish workflows use built wheel on tests (492) by **c-fteixeira** bc5b54e1
- [Hidet Script] Import externally defined function automatically (503) by **Yaoyao Ding** 43750c28
- [PERF] Fix for indexes optimization (488) by **Vadim Gimpelson** f8c679ae
- [CI] Update the set of Regression tests (493) by **Vadim Gimpelson** 7e3ae1f3
- [Enhancement] Causal attention with fp32 accumulator (481) by **zhumakhan** 8b569bd3
- [IR] Bound check for task mapping worker (483) by **Vadim Gimpelson** 1544cdf6
- [Bug] Rule based simplifier. Fix incorrect rule e/c1/c2 -> e/(c1*c2) (487) by **Vadim Gimpelson** fd6b4390
- [TOOLS] Task benchmark utilities (479) by **Vadim Gimpelson** dc175f29
- [Dynamic][Enhancement] Convert div and mod including symbolvars to fast int div/mod (464) by **Max Hu** c8d9158b
- Revert accidental commit (484) by **Vadim Gimpelson** 6c8ad3e2
- bug fix by **Vadim Gimpelson** 3405b558
- [PERF] Сontinue indexes optimisations (473) by **Vadim Gimpelson** da24ee3c
- [Bug] Resolved multi-threading conflict with save_lower_ir() (480) by **ZichuWu** 6a116adc
- Fixed the format change on the new transformers version (482) by **ZichuWu** 0a81840c
- Fix masked attention by using fp32 accumulate on first matmul (q and k) part (468) by **zhumakhan** 40c12c93
- remove mpt-7b due to accuracy failure (477) by **zhumakhan** 53a0cc49
- [BUG] Support concat empty tensors (475) by **ZichuWu** 85bb6dd5
- [TOOLS] Attached hash values to function signature in source.cu (459) by **ZichuWu** a6f10331
- [BUG] Fix `ValueError` caused by different operand data types in `if_then_else` while initializing `Conv2dTransposeGemmImageTask` (470) by **Bolin Sun** 28264907
- [BUG] Fix `ZeroDivisionError` triggered wihtin the function `parallel_part_heuristic` in `graph/ops/conv2d/conv2d_gemm.py` (472) by **Bolin Sun** a11d69ca
- [BUG] Fixing memory issue encountered while compiling the model `sam` (466) by **Bolin Sun** c6959747
- [PERF] Indexes optimization (458) by **Vadim Gimpelson** f1ee08fe
- Added more llms to Regression test (432) by **zhumakhan** 03d62505
- Revert "[Dynamic][Enhancement] Convert div and mod including symbolvars to fast int div/mod" (463) by **Max Hu** 29893891
- [Dynamic][Enhancement] Convert div and mod including symbolvars to fast int div/mod (405) by **Max Hu** 0cffe7ec
- [CI] Print stderr in `run_tests.py` (443) by **Vadim Gimpelson** 015ffcdd
- [BUG] Fix `NotImpelementedError` encountered while compiling the model `doctr_det_predictor` (462) by **Bolin Sun** 868dc9da
- [Operators] Adding support for `torch.nn.GLU` module (461) by **Bolin Sun** f7560512
- [BUG] Fixing another error encountered while compiling `detectron2_fcos_r_50_fpn` (457) by **Bolin Sun** 798ce6e4
- [Ir][Primitives] fix 436 via adding missing instructions (440) by **xiaocenxiaocen** 131ec204
- [BUG] Fixing errors encountered while compiling `detectron2_fcos_r_50_fpn` (455) by **Bolin Sun** c74732d8
- [PERF] Introduce the new IR optimization Pass that spatial(1,47) -> spatial(47) (452) by **Vadim Gimpelson** 0f2990b3
- [Bug] Fixing the `ValueError` triggered while compiling the model `dlrm` during operator fusion pass (437) by **Bolin Sun** de949461
- [Scripts] Add scripts of our wheel server (439) by **Yaoyao Ding** 628eb603
- [Graph][Ops] disable cublas matmul for parallel k (431) by **xiaocenxiaocen** 2696c34b
- [BUG] Fixing an error triggered from the `conv_channel_last_pass` while compiling the model `sam` (444) by **Bolin Sun** ba455220
- [BUG] Fixing a bug triggered while compiling in-place operator `torch.Tensor.scatter_add_` (429) by **Bolin Sun** 4f142c4b
- [PERF] Specialize pow(x,2) as x*x. llama-7B (434) by **Vadim Gimpelson** f421a43a
- [Version] Update 0.4.0 -> 0.5.0.dev in `setup.py` (433) by **Vadim Gimpelson** d9da46f5
- [PERF] Allow prologue fusion for `reduce` op (426) by **Vadim Gimpelson** 6606477f
- [Bug] fixing regression (422) by **zhumakhan** 646f7e7c
- [Utility] Add ncu and nsys test utilities (413) by **Yaoyao Ding** 2fc304f0
- [Operators] Adding support for the method `torch.Tensor.scatter_add_` (421) by **Bolin Sun** 8568afbb
- [Fix] fixed torch.pow (420) by **zhumakhan** cac4a0e5
- [Primitives] Add CUDA primitives: prmt, lop3, f16x2 sub and fma, and barrier (414) by **Yaoyao Ding** 5186d875
- [Ir][Primitives] add exp2 (410) by **xiaocenxiaocen** bbbfb7ba
- [Update] Updating torch docker image from 24.04 to 24.07 (418) by **zhumakhan** 9899060d
- [Fix] Support writing subbyte data to global memory (415) by **Yaoyao Ding** 9cacfe7e
- [Bug] Fixing longformer compilation (403) by **zhumakhan** 09d1bc03
- [Bug][Enhancement] Correct the behavior of non-parallel build when option `parallel_tune` is set to 1 (406) by **Max Hu** c2e8ec9a
- [CuTe] fix longformer (411) by **xiaocenxiaocen** 4953f737
- [Tests] Adding tests for math primitives (412) by **Bolin Sun** f859fac5
- Adding accruacy check for huggingface LLMs in Regression (368) by **zhumakhan** 4a1f72db
- [Bug] Fix hidet.ops.gather, add torch.sign torch.ceil. Disable torch.autograd.function.FunctionCtx (394) by **zhumakhan** 3b6cb586
- workaround for gpt-j (395) by **zhumakhan** 9d0e0c09
- [Bug] Cast dtypes in hidet.where when mismatch (386) by **zhumakhan** 2172e163
- make llama2 work with all ttransformers versions (385) by **zhumakhan** 426d14b8
- [DEBUG] Save `Task` pickle in translations cache (380) by **Vadim Gimpelson** cb72bc7c
- [BUILD] Several changes in wheel building (392) by **Vadim Gimpelson** f416ee50
- [Operators] Adding support for the `torch.nn.EmbeddingBag` (378) by **Bolin Sun** 9d309c12
- [CI] Adding successfully compiled vision models to the tests/benchmark/run_config.json (205) by **Bolin Sun** 8eb61c92
- Fix float return when limited by memory (389) by **Max Hu** 03e5966a
- [BUG] Fix bug in `normalize_launch_dims()` (381) by **Vadim Gimpelson** b61d6b11
- [Operators] Extend the functionality of `einsum` to support `Ellipsis` (374) by **Bolin Sun** d412db16
- [Dependency] Remove the version restriction of transformers and diffuers (475) by **Yaoyao Ding** 3e76c2f3
- [README] Fix broken links (474) by **Yaoyao Ding** 7b2f6804

Contributors
* yaoyaoding
* xiaocenxiaocen
* vadiklyutiy
* BolinSNLHM
* kjiang170
* zhumakhan
* ZichuWu
* maxyanghu
* c-fteixeira
* xinli-centml

0.4.1

What's Changed
- [Fix] Fixing an error triggered by the operator `any` (369) by **Bolin Sun** 6a4c2e54
- [Fix] added torch.t for mobilebert-uncased model (353) by **zhumakhan** 95d95a4c
- [CI] Use same image for tests and publishing test execution (463) by **c-fteixeira** 49fd3325
- [BUG] fix bug in disallow in graph (464) by **Vadim Gimpelson** d84f2c5b
- [CI] Move Publish workflow to internal ARC runners (461) by **c-fteixeira** b5d6aafd
- [CI] Update container for CI (460) by **Vadim Gimpelson** b9735910
- [Bug] Rename test_arithmetic.py -> test_arithmetic2.py (459) by **Vadim Gimpelson** 6aa6cf82
- Update requirements-dev.txt to use pytorch version >= 2.3.0 (458) by **Vadim Gimpelson** 6b322953
- [CI] Repeat start_instance (361) by **vadiklyutiy** cf5caddf
- [Operators] Adding `leaky_relu` support (360) by **Bolin Sun** 7401cccb
- [Fix] Fixing an error triggered while compiling the `torch.nn.Upsample` module with `align_corners=True` (344) by **Bolin Sun** 2c34cfc0
- [PERF] Remote workaround for loops in `add_hints_pass` (356) by **vadiklyutiy** 3195be5b
- [Operators] Registering tensor methods whose PyTorch function equivalents are supported by Hidet (347) by **Bolin Sun** 44ab5ad3
- [PERF] Introduce add_hint_pass (355) by **vadiklyutiy** c014dab1
- [CI] Promote nvidia docker container to version 24.4 (354) by **vadiklyutiy** cb809b99
- [Fix] type casting for attention mask from fp32 -> f16 (323) by **zhumakhan** 9a10dc01
- [Fix] Added missing torch.multiply and torch.nn.functional.unfold ops for conv-bert-base model (351) by **zhumakhan** 18842eeb
- [Fix] Fixing a bug in `register_methods` (331) by **Bolin Sun** c87c5153
- [Fix] Handling special cases in `setitem` regarding dtype and device (332) by **Bolin Sun** ff9445e2
- [BUG] Fixed search_space bug in `bench_op.py` (348) by **vadiklyutiy** 29e4c0e8
- [OPS] Dissallow in fxgraph not supported functions (317) by **vadiklyutiy** 984cf75e
- [OPTIONS] Remove dynamo_config['search_space'] (342) by **vadiklyutiy** 0814bd8e
- [Operator] Adding support for `torch.Tensor.view_as` (334) by **Bolin Sun** 5f19dd05
- [Operators] Adding support for `torch.nn.TransformerEncoder` (327) by **Bolin Sun** d625146e
- [OPTIONS] Inherit `options` from `torch.compile()` (260) by **vadiklyutiy** 3638a0b5
- [Operator] Adding `__ge__` method for the `Tensor` class (330) by **Bolin Sun** ed5fefff
- [Fix] Fixing an error triggered by `ClampOp` (329) by **Bolin Sun** 05984cb8
- [Fix] Handling hidet errors caused by device difference in `getitem` (322) by **Bolin Sun** 5a908205
- [Fix] Fixing a RuntimeError triggered by `tensor_reshape` function in `register_functions.py` (328) by **Bolin Sun** 0cd2f838
- [Operators] Adding PyTorch operators encountered while compiling `DALLE2_pytorch` (319) by **Bolin Sun** ecb99b1d
- [Fix] Fix the bug in `tensor_expand` caused by attempting to modify `immutable_list` (320) by **Bolin Sun** bb89e227
- [Chore] replace copyrights with citations (315) by **xiaocenxiaocen** 3fba0919
- [Operator] Extending the functionality support for `einsum` (312) by **Bolin Sun** 703e92aa
- Handle dtype and device in hidet.ones_like op (316) by **zhumakhan** f031eb30
- [PERF] Reduce fixed overhead for model run (310) by **vadiklyutiy** fadf67d3
- Increase batch size for bert to decrease fluctuations (236) by **vadiklyutiy** a8db40cf
- Setitem with tensor values. And Boolean type promotion (290) by **zhumakhan** 60e75ca4
- [BUG] when device is None, device_from_torch returns 'cpu' by default. Fixed (311) by **zhumakhan** d0474402
- [Graph][Ops] fp32 accumulation for cute matmul (292) by **xiaocenxiaocen** a8136059
- [Perf] support vectorized epilogue fusion (220) by **xiaocenxiaocen** ddacf36b
- Removing constant tensors that are not needed after subgraph rewrite pass (252) by **zhumakhan** db49f688
- [Fix] Handling `Tensor.to(..., device=....)` on symbolic tensors (284) by **Bolin Sun** 63578804
- [Operator] torch.any (287) by **zhumakhan** 8a42a65f
- [Graph][Ops] fp32 accumulation for matmul_f16 (268) by **xiaocenxiaocen** 5bf255ad
- adding support for torch.any (277) by **zhumakhan** 2c4c672e
- fix: handles race condition on parallel config directory creation (285) by **c-fteixeira** b465dd34
- [SCRIPTS] Adopt our scripts to use `mode` from `torch.compile` (274) by **vadiklyutiy** 0f825b38
- [Fix] Handling `getitem` special case (281) by **Bolin Sun** 564561ec
- [Operator] Added advanced tensor indexing (251) by **zhumakhan** 018ca2ce
- [Operator] Adding support to `repeat_interleave` and more (270) by **Bolin Sun** b52bc889
- [PERF] Increase accuracy of pick up the best candidate (269) by **vadiklyutiy** 3834643f
- [Operator] Registering `torch.Tensor.copy_` (259) by **Bolin Sun** af5c8933
- [OPTIONS] Use Attention by default (261) by **vadiklyutiy** 33ad85bd
- [Operator] Registering torch.sigmoid_ (258) by **Bolin Sun** c9fb801d
- [Operator] Adding support for `torch.Tensor.div` (249) by **Bolin Sun** c8d46638
- [Operator] Adding `torch.Tensor.expand_as` support (250) by **Bolin Sun** 923f0781
- [Operator] Adding support to operators `torch.Tensor.max` and `torch.Tensor.new_full` (238) by **Bolin Sun** c5912a4b
- Delete options `use_fp16` and `use_fp16_reduction` (239) by **vadiklyutiy** e7fe23b6
- Inherit `mode` argument from `torch.compile` and set corresponding options (237) by **vadiklyutiy** 91f666ea
- [Operators] Registering `torch.as_tensor` (235) by **Bolin Sun** 540367ba
- [Operator] Registering `torch.Tensor.argmax` (234) by **Bolin Sun** bdd7acde
- [Ir][CuTE] lower cute dialect (109) (230) by **xiaocenxiaocen** 783a5495
- Xiaocenxiaocen/expose more ldst instructions (216) by **xiaocenxiaocen** 8f03f9e3
- steal_weight option fixes && fixes for mistral model (209) by **zhumakhan** 9728c219
- Fix issues related to mistral model (213) by **zhumakhan** 68e801b7
- [BENCHs] Refactor transformers tests. Add llama2, mistral, gemma, gpt2 to script (210) by **vadiklyutiy** 59028d8f
- [BUGFIX] Init cuda info before run forks for IR generation (208) by **vadiklyutiy** 30125463
- [Ir] add utilities for CuTe (107) by **xiaocenxiaocen** 423e1122
- [BUG] Clear `_job_queue` in `parallel_imap` for tests (204) by **vadiklyutiy** bf39bd64
- [OPTIONS] Don't create hidet config if it's not exist (203) by **vadiklyutiy** 294d2613
- feat: parallel job execution for tests (147) by **c-fteixeira** db588f99
- \_\_getitem\_\_ with N dimensional index tensor (185) by **zhumakhan** f46a184f
- [Fix] Remove YOLOv7 from tests/benchmarks/run_configs.json (187) by **Bolin Sun** 5fc4271e
- [Operator] Adding meshgrid operator support (183) by **Bolin Sun** d8158a9a
- [Bug] Fix number of groups under certain case (181) by **Max Hu** 8a6cbfdd
- [COMPTIME] Reduce the number of `fork` in `multithreading.Pool` (180) by **vadiklyutiy** 9e576dc2
- [COMPTIME] Add `chunksize` arg to `pool.imap` (178) by **vadiklyutiy** 7c50af6f
- optimize grouping method (174) by **Max Hu** 9b9a22bb
- [App] SyncLLM + AsyncLLM interface (166) by **Jack Lee** e51f0c00
- [Ir][Primitives] add hopper instructions (83) by **xiaocenxiaocen** 42252980
- [OPS] Add `torch.Tensor.sin`, `torch.Tensor.cos` and `torch._C._nn.pad` (175) by **vadiklyutiy** 90a6231a
- [App] ResNet Compiled App (2/2) - Pipeline (165) by **Kevin Tong** d308f8f8
- Revive dynamic shape support with `torch.compile` (162) by **vadiklyutiy** cf343ab2
- [Models] Gemma implementation (132) by **Jack Lee** 3a848202
- Support Transpose2D (77) by **zhiwei-fang** dd2e9d2e
- [App] Cleanup SD Implementation (143) by **Kevin Tong** 359763ef
- [Fixbug] Set _is_exiting correctly (163) by **Jack Lee** 1c8b31fa
- [App] Fix LLM app tracing (158) by **Jack Lee** f618977b
- [Operator] triu + tril operators (146) by **Jack Lee** 70894fa5
- Gemma+torch.compile fixes(autocast, rtruediv) (159) by **vadiklyutiy** 710ac501
- [IR] [Primitives] Add thread cluster on sm_90 (145) by **Kevin Tong** ccc28d65
- [App] Minor bugfixes for LLM app (157) by **Jack Lee** 179f0583
- [COMPTIME] Specialize `Constant._binary()` for compilation speedup (148) by **vadiklyutiy** 8a1eab4f
- [Operator] Fix symbolic broadcasting (131) by **Jack Lee** 12522203
- [Operator] Register missing math primitives (134) by **Jack Lee** 61b00523
- [Ir][Primitives] fix __shfl_xor_sync (155) by **xiaocenxiaocen** 37c75a6d
- [COMPTIME] Parallelize `apply_prologue_epilog`(fusion) and IR generation(`implement*`) (127) by **vadiklyutiy** 9e96c457
- [Graph] Enhance forward debug instrument (130) by **Jack Lee** 4267686b
- Stable Diffusion App Infra (103) by **Kevin Tong** 8f03f9e4
- [LLM App] LLM Application initial support (121) by **Yaoyao Ding** fc61f48d
- [Models] Support for tokenizers in C++ runtime (69) by **Jack Lee** c14de4e2
- [Graph] Add major UNet building components (97) by **Kevin Tong** 364ba9c3
- [CI] Add clang-format script/action (120) by **Jack Lee** cdff99af
- [Graph] Stable Diffusion Rope Module (95) by **Kevin Tong** 6fa58030
- [App] Complete UNet Definition (99) by **Kevin Tong** 805620e5
- [FFI] Refactor CompiledFunction interface with ctypes (79) by **Jack Lee** a8c9d945
- [STYLE] Format cpp/h files (454) by **vadiklyutiy** 1f1b011e
- [cuDNN] Add cudnn conv2d (453) by **vadiklyutiy** bc5a6df2

Contributors
* yaoyaoding
* xiaocenxiaocen
* vadiklyutiy
* maxyanghu
* BolinSNLHM
* zhumakhan
* c-fteixeira
* jacklee1792
* KTong821
* zhiwei-fang

**Full Changelog**: https://github.com/hidet-org/hidet/compare/v0.3.1...v0.4.0

0.4.0

What's Changed
- [Fix] Fixing an error triggered by the operator `any` (369) by **Bolin Sun** 6a4c2e54
- [Fix] added torch.t for mobilebert-uncased model (353) by **zhumakhan** 95d95a4c
- [CI] Use same image for tests and publishing test execution (463) by **c-fteixeira** 49fd3325
- [BUG] fix bug in disallow in graph (464) by **Vadim Gimpelson** d84f2c5b
- [CI] Move Publish workflow to internal ARC runners (461) by **c-fteixeira** b5d6aafd
- [CI] Update container for CI (460) by **Vadim Gimpelson** b9735910
- [Bug] Rename test_arithmetic.py -> test_arithmetic2.py (459) by **Vadim Gimpelson** 6aa6cf82
- Update requirements-dev.txt to use pytorch version >= 2.3.0 (458) by **Vadim Gimpelson** 6b322953
- [CI] Repeat start_instance (361) by **vadiklyutiy** cf5caddf
- [Operators] Adding `leaky_relu` support (360) by **Bolin Sun** 7401cccb
- [Fix] Fixing an error triggered while compiling the `torch.nn.Upsample` module with `align_corners=True` (344) by **Bolin Sun** 2c34cfc0
- [PERF] Remote workaround for loops in `add_hints_pass` (356) by **vadiklyutiy** 3195be5b
- [Operators] Registering tensor methods whose PyTorch function equivalents are supported by Hidet (347) by **Bolin Sun** 44ab5ad3
- [PERF] Introduce add_hint_pass (355) by **vadiklyutiy** c014dab1
- [CI] Promote nvidia docker container to version 24.4 (354) by **vadiklyutiy** cb809b99
- [Fix] type casting for attention mask from fp32 -> f16 (323) by **zhumakhan** 9a10dc01
- [Fix] Added missing torch.multiply and torch.nn.functional.unfold ops for conv-bert-base model (351) by **zhumakhan** 18842eeb
- [Fix] Fixing a bug in `register_methods` (331) by **Bolin Sun** c87c5153
- [Fix] Handling special cases in `setitem` regarding dtype and device (332) by **Bolin Sun** ff9445e2
- [BUG] Fixed search_space bug in `bench_op.py` (348) by **vadiklyutiy** 29e4c0e8
- [OPS] Dissallow in fxgraph not supported functions (317) by **vadiklyutiy** 984cf75e
- [OPTIONS] Remove dynamo_config['search_space'] (342) by **vadiklyutiy** 0814bd8e
- [Operator] Adding support for `torch.Tensor.view_as` (334) by **Bolin Sun** 5f19dd05
- [Operators] Adding support for `torch.nn.TransformerEncoder` (327) by **Bolin Sun** d625146e
- [OPTIONS] Inherit `options` from `torch.compile()` (260) by **vadiklyutiy** 3638a0b5
- [Operator] Adding `__ge__` method for the `Tensor` class (330) by **Bolin Sun** ed5fefff
- [Fix] Fixing an error triggered by `ClampOp` (329) by **Bolin Sun** 05984cb8
- [Fix] Handling hidet errors caused by device difference in `getitem` (322) by **Bolin Sun** 5a908205
- [Fix] Fixing a RuntimeError triggered by `tensor_reshape` function in `register_functions.py` (328) by **Bolin Sun** 0cd2f838
- [Operators] Adding PyTorch operators encountered while compiling `DALLE2_pytorch` (319) by **Bolin Sun** ecb99b1d
- [Fix] Fix the bug in `tensor_expand` caused by attempting to modify `immutable_list` (320) by **Bolin Sun** bb89e227
- [Chore] replace copyrights with citations (315) by **xiaocenxiaocen** 3fba0919
- [Operator] Extending the functionality support for `einsum` (312) by **Bolin Sun** 703e92aa
- Handle dtype and device in hidet.ones_like op (316) by **zhumakhan** f031eb30
- [PERF] Reduce fixed overhead for model run (310) by **vadiklyutiy** fadf67d3
- Increase batch size for bert to decrease fluctuations (236) by **vadiklyutiy** a8db40cf
- Setitem with tensor values. And Boolean type promotion (290) by **zhumakhan** 60e75ca4
- [BUG] when device is None, device_from_torch returns 'cpu' by default. Fixed (311) by **zhumakhan** d0474402
- [Graph][Ops] fp32 accumulation for cute matmul (292) by **xiaocenxiaocen** a8136059
- [Perf] support vectorized epilogue fusion (220) by **xiaocenxiaocen** ddacf36b
- Removing constant tensors that are not needed after subgraph rewrite pass (252) by **zhumakhan** db49f688
- [Fix] Handling `Tensor.to(..., device=....)` on symbolic tensors (284) by **Bolin Sun** 63578804
- [Operator] torch.any (287) by **zhumakhan** 8a42a65f
- [Graph][Ops] fp32 accumulation for matmul_f16 (268) by **xiaocenxiaocen** 5bf255ad
- adding support for torch.any (277) by **zhumakhan** 2c4c672e
- fix: handles race condition on parallel config directory creation (285) by **c-fteixeira** b465dd34
- [SCRIPTS] Adopt our scripts to use `mode` from `torch.compile` (274) by **vadiklyutiy** 0f825b38
- [Fix] Handling `getitem` special case (281) by **Bolin Sun** 564561ec
- [Operator] Added advanced tensor indexing (251) by **zhumakhan** 018ca2ce
- [Operator] Adding support to `repeat_interleave` and more (270) by **Bolin Sun** b52bc889
- [PERF] Increase accuracy of pick up the best candidate (269) by **vadiklyutiy** 3834643f
- [Operator] Registering `torch.Tensor.copy_` (259) by **Bolin Sun** af5c8933
- [OPTIONS] Use Attention by default (261) by **vadiklyutiy** 33ad85bd
- [Operator] Registering torch.sigmoid_ (258) by **Bolin Sun** c9fb801d
- [Operator] Adding support for `torch.Tensor.div` (249) by **Bolin Sun** c8d46638
- [Operator] Adding `torch.Tensor.expand_as` support (250) by **Bolin Sun** 923f0781
- [Operator] Adding support to operators `torch.Tensor.max` and `torch.Tensor.new_full` (238) by **Bolin Sun** c5912a4b
- Delete options `use_fp16` and `use_fp16_reduction` (239) by **vadiklyutiy** e7fe23b6
- Inherit `mode` argument from `torch.compile` and set corresponding options (237) by **vadiklyutiy** 91f666ea
- [Operators] Registering `torch.as_tensor` (235) by **Bolin Sun** 540367ba
- [Operator] Registering `torch.Tensor.argmax` (234) by **Bolin Sun** bdd7acde
- [Ir][CuTE] lower cute dialect (109) (230) by **xiaocenxiaocen** 783a5495
- Xiaocenxiaocen/expose more ldst instructions (216) by **xiaocenxiaocen** 8f03f9e3
- steal_weight option fixes && fixes for mistral model (209) by **zhumakhan** 9728c219
- Fix issues related to mistral model (213) by **zhumakhan** 68e801b7
- [BENCHs] Refactor transformers tests. Add llama2, mistral, gemma, gpt2 to script (210) by **vadiklyutiy** 59028d8f
- [BUGFIX] Init cuda info before run forks for IR generation (208) by **vadiklyutiy** 30125463
- [Ir] add utilities for CuTe (107) by **xiaocenxiaocen** 423e1122
- [BUG] Clear `_job_queue` in `parallel_imap` for tests (204) by **vadiklyutiy** bf39bd64
- [OPTIONS] Don't create hidet config if it's not exist (203) by **vadiklyutiy** 294d2613
- feat: parallel job execution for tests (147) by **c-fteixeira** db588f99
- \_\_getitem\_\_ with N dimensional index tensor (185) by **zhumakhan** f46a184f
- [Fix] Remove YOLOv7 from tests/benchmarks/run_configs.json (187) by **Bolin Sun** 5fc4271e
- [Operator] Adding meshgrid operator support (183) by **Bolin Sun** d8158a9a
- [Bug] Fix number of groups under certain case (181) by **Max Hu** 8a6cbfdd
- [COMPTIME] Reduce the number of `fork` in `multithreading.Pool` (180) by **vadiklyutiy** 9e576dc2
- [COMPTIME] Add `chunksize` arg to `pool.imap` (178) by **vadiklyutiy** 7c50af6f
- optimize grouping method (174) by **Max Hu** 9b9a22bb
- [App] SyncLLM + AsyncLLM interface (166) by **Jack Lee** e51f0c00
- [Ir][Primitives] add hopper instructions (83) by **xiaocenxiaocen** 42252980
- [OPS] Add `torch.Tensor.sin`, `torch.Tensor.cos` and `torch._C._nn.pad` (175) by **vadiklyutiy** 90a6231a
- [App] ResNet Compiled App (2/2) - Pipeline (165) by **Kevin Tong** d308f8f8
- Revive dynamic shape support with `torch.compile` (162) by **vadiklyutiy** cf343ab2
- [Models] Gemma implementation (132) by **Jack Lee** 3a848202
- Support Transpose2D (77) by **zhiwei-fang** dd2e9d2e
- [App] Cleanup SD Implementation (143) by **Kevin Tong** 359763ef
- [Fixbug] Set _is_exiting correctly (163) by **Jack Lee** 1c8b31fa
- [App] Fix LLM app tracing (158) by **Jack Lee** f618977b
- [Operator] triu + tril operators (146) by **Jack Lee** 70894fa5
- Gemma+torch.compile fixes(autocast, rtruediv) (159) by **vadiklyutiy** 710ac501
- [IR] [Primitives] Add thread cluster on sm_90 (145) by **Kevin Tong** ccc28d65
- [App] Minor bugfixes for LLM app (157) by **Jack Lee** 179f0583
- [COMPTIME] Specialize `Constant._binary()` for compilation speedup (148) by **vadiklyutiy** 8a1eab4f
- [Operator] Fix symbolic broadcasting (131) by **Jack Lee** 12522203
- [Operator] Register missing math primitives (134) by **Jack Lee** 61b00523
- [Ir][Primitives] fix __shfl_xor_sync (155) by **xiaocenxiaocen** 37c75a6d
- [COMPTIME] Parallelize `apply_prologue_epilog`(fusion) and IR generation(`implement*`) (127) by **vadiklyutiy** 9e96c457
- [Graph] Enhance forward debug instrument (130) by **Jack Lee** 4267686b
- Stable Diffusion App Infra (103) by **Kevin Tong** 8f03f9e4
- [LLM App] LLM Application initial support (121) by **Yaoyao Ding** fc61f48d
- [Models] Support for tokenizers in C++ runtime (69) by **Jack Lee** c14de4e2
- [Graph] Add major UNet building components (97) by **Kevin Tong** 364ba9c3
- [CI] Add clang-format script/action (120) by **Jack Lee** cdff99af
- [Graph] Stable Diffusion Rope Module (95) by **Kevin Tong** 6fa58030
- [App] Complete UNet Definition (99) by **Kevin Tong** 805620e5
- [FFI] Refactor CompiledFunction interface with ctypes (79) by **Jack Lee** a8c9d945
- [STYLE] Format cpp/h files (454) by **vadiklyutiy** 1f1b011e
- [cuDNN] Add cudnn conv2d (453) by **vadiklyutiy** bc5a6df2

Contributors
* yaoyaoding
* xiaocenxiaocen
* vadiklyutiy
* maxyanghu
* BolinSNLHM
* zhumakhan
* c-fteixeira
* jacklee1792
* KTong821
* zhiwei-fang

**Full Changelog**: https://github.com/hidet-org/hidet/compare/v0.3.1...v0.4.0

0.3.1

What's Changed
* [Version] Bump version to v0.3.1.dev by yaoyaoding in https://github.com/hidet-org/hidet/pull/361
* [Option] Add an option to disable imperative execution by serach24 in https://github.com/hidet-org/hidet/pull/362
* [Graph][Benchmark] Update benchmark function by Aalanli in https://github.com/hidet-org/hidet/pull/363
* [Compile Server] Update deps for compilation server by xinli-git in https://github.com/hidet-org/hidet/pull/365
* [Utils] Changed the multiprocessing context by destefy in https://github.com/hidet-org/hidet/pull/367
* [Dynamo] Refactoring code for Hidet remote compilation by destefy in https://github.com/hidet-org/hidet/pull/369
* [Graph][Dynamo Backend] Lshift/Rshift/Mod by Aalanli in https://github.com/hidet-org/hidet/pull/371
* [Graph][Operator] Fix reduce bug, add uint8x4 by Aalanli in https://github.com/hidet-org/hidet/pull/372
* [CompiledGraph] Add option to store dispatch table option by destefy in https://github.com/hidet-org/hidet/pull/377
* [Graph][Tensor] remove unnecessary synchronization by xiaocenxiaocen in https://github.com/hidet-org/hidet/pull/374
* [Graph][Dynamo Backend] Minor imperative run bug fix by Aalanli in https://github.com/hidet-org/hidet/pull/383
* [Graph] Fix CompiledGraph aliasing bug by Aalanli in https://github.com/hidet-org/hidet/pull/384
* [Frontend] Add mapping for `torch.sqrt` by yaoyaoding in https://github.com/hidet-org/hidet/pull/387
* [Fix][Graph] Write compiled graph to tempfile first by destefy in https://github.com/hidet-org/hidet/pull/392
* [Operators] Improving fp32 matrix multiplication on x86 CPUs by BolinSNLHM in https://github.com/hidet-org/hidet/pull/378
* [Fixbug] Fix a bug related to c/c++ integer promotion by yaoyaoding in https://github.com/hidet-org/hidet/pull/391
* [Option] Add option to set class Var id attribute to 0 by default by destefy in https://github.com/hidet-org/hidet/pull/393
* [CI] Add CI workflow and scripts by hjjq in https://github.com/hidet-org/hidet/pull/394
* [CI] Fix deadlock by hjjq in https://github.com/hidet-org/hidet/pull/395
* [Operator] Enhancements to Reduce by hjjq in https://github.com/hidet-org/hidet/pull/366
* [CI] Launch and stop compile server via workflow by hjjq in https://github.com/hidet-org/hidet/pull/396
* [Operator] Support advanced options for pooling operators by yaoyaoding in https://github.com/hidet-org/hidet/pull/399
* [Torch] Implements __torch_func__ protocol by yaoyaoding in https://github.com/hidet-org/hidet/pull/400
* [Docs] Add more documentation by yaoyaoding in https://github.com/hidet-org/hidet/pull/401
* [Fixbug] Fix a performance bug in auto-scheduler by yaoyaoding in https://github.com/hidet-org/hidet/pull/402
* [Library] Add cublas library by yaoyaoding in https://github.com/hidet-org/hidet/pull/404
* [Operator] Add `hidet.ops.matmul_cublas` operator by yaoyaoding in https://github.com/hidet-org/hidet/pull/405
* [Fusion] Allow shallow fusion of cublas operator by yaoyaoding in https://github.com/hidet-org/hidet/pull/407
* [CI] Clear op cache by hjjq in https://github.com/hidet-org/hidet/pull/406
* [Runtime] Add a new compiled format CompiledApp by yaoyaoding in https://github.com/hidet-org/hidet/pull/408
* CPU AVX implementation for Softmax, Norm by fishingguy456 in https://github.com/hidet-org/hidet/pull/357
* [CI] Reduce scope of secrets by hjjq in https://github.com/hidet-org/hidet/pull/413
* [Operator] Add a opaque operator base class by yaoyaoding in https://github.com/hidet-org/hidet/pull/414
* [IR] Support inplace operators by yaoyaoding in https://github.com/hidet-org/hidet/pull/416
* [Graph][Quantization] Multi-stage software pipelining and update parallel k rule by Aalanli in https://github.com/hidet-org/hidet/pull/364
* [CI] Trigger workflow by hjjq in https://github.com/hidet-org/hidet/pull/417
* [Scheduler] Add the fused task name to auto-scheduled kernels by yaoyaoding in https://github.com/hidet-org/hidet/pull/418
* [CI] Use cudagraph for benchmarks by hjjq in https://github.com/hidet-org/hidet/pull/419
* [CI] Remove unnecessary synchronization by hjjq in https://github.com/hidet-org/hidet/pull/420
* Update Netron viewer link by KTong821 in https://github.com/hidet-org/hidet/pull/421
* [Operator] Add cublas to matmul tune space by hjjq in https://github.com/hidet-org/hidet/pull/422
* [IR] Support integer subbyte by xiaocenxiaocen in https://github.com/hidet-org/hidet/pull/403
* [README] Fix ONNX link by dbabokin in https://github.com/hidet-org/hidet/pull/425
* [cuBLAS] Add cublas_gemm_batched and use cublasSetStream to set stream to the current stream in all cublas API calls by yudi0201 in https://github.com/hidet-org/hidet/pull/423
* [Fixbug] Fix dynamic memcpy bug by KTong821 in https://github.com/hidet-org/hidet/pull/427
* [Compile Server] Fetch repo before checking out by hjjq in https://github.com/hidet-org/hidet/pull/429
* [CI] Use slurm for runners by hjjq in https://github.com/hidet-org/hidet/pull/430
* [CI] CI migration by hjjq in https://github.com/hidet-org/hidet/pull/433
* [Fixbug] Fix graph metadata hash by KTong821 in https://github.com/hidet-org/hidet/pull/428
* [CI] Add back tests by hjjq in https://github.com/hidet-org/hidet/pull/436
* [Fix] Skip a failed test due to huggingface transformers update by yaoyaoding in https://github.com/hidet-org/hidet/pull/439
* [RC] Release candidate for version 0.3.1 by yaoyaoding in https://github.com/hidet-org/hidet/pull/442

New Contributors
* destefy made their first contribution in https://github.com/hidet-org/hidet/pull/367
* xiaocenxiaocen made their first contribution in https://github.com/hidet-org/hidet/pull/374
* fishingguy456 made their first contribution in https://github.com/hidet-org/hidet/pull/357
* KTong821 made their first contribution in https://github.com/hidet-org/hidet/pull/421
* dbabokin made their first contribution in https://github.com/hidet-org/hidet/pull/425
* yudi0201 made their first contribution in https://github.com/hidet-org/hidet/pull/423

**Full Changelog**: https://github.com/hidet-org/hidet/compare/v0.3.0...v0.3.1

0.3.0

Notes

In this release, we add more support for large language model inference, distributed inference, and quantization. We also make hidet script more stable and added more documentation for it. More operators and models are supported. See below for more details.

Frontend
* [Frontend] Dynamic shape fx trace by Aalanli in https://github.com/hidet-org/hidet/pull/294
* [Torch] Steal Pytorch weights by hjjq in https://github.com/hidet-org/hidet/pull/310
* [Dynamo Frontend] Refactor the dynamic shape support by yaoyaoding in https://github.com/hidet-org/hidet/pull/319
* [Torch][Graph][Operator] Add and fix various items for torchvision model support by hjjq in https://github.com/hidet-org/hidet/pull/347
* [Dynamo] minor enhancements to attention and register a few functions by xinli-git in https://github.com/hidet-org/hidet/pull/345

Operators and models
* [Operator] Further performance enhancements for conv2D by Aalanli in https://github.com/hidet-org/hidet/pull/290
* [Operator] Refactoring matrix multiplication implementation by yaoyaoding in https://github.com/hidet-org/hidet/pull/296
* [Model Support] Add support for wav2vec by yaoyaoding in https://github.com/hidet-org/hidet/pull/303
* [Operator] Update attention for dynamic shape by hjjq in https://github.com/hidet-org/hidet/pull/307
* [Operator] Resolve Adaptive Pool to reduce by hjjq in https://github.com/hidet-org/hidet/pull/308
* [Reduce] optimize and unify reduce operator to a single place by xinli-git in https://github.com/hidet-org/hidet/pull/311
* [Operator] optimize normalize op with vectorized load, dynamic shape and more by xinli-git in https://github.com/hidet-org/hidet/pull/316
* [Model] Add missing operators for T5 by yaoyaoding in https://github.com/hidet-org/hidet/pull/322
* [Fixbug] Reduce should perform syncthread after initializing shared memory to zero by xinli-git in https://github.com/hidet-org/hidet/pull/325
* [Models] Llama 2 support by Aalanli in https://github.com/hidet-org/hidet/pull/324
* [Models] Llama2 fix by Aalanli in https://github.com/hidet-org/hidet/pull/333
* [Operator] Composite Elementwise Operation by hjjq in https://github.com/hidet-org/hidet/pull/337
* [Operator] Add clamp/isinf/any/all op, enhance where op by yaoyaoding in https://github.com/hidet-org/hidet/pull/343
* [Torch][Operator] More torchvision model support by hjjq in https://github.com/hidet-org/hidet/pull/348
* [Operator] Add einsum by hjjq in https://github.com/hidet-org/hidet/pull/349
* [Operator][Graph][Regression] CNN optimizations by hjjq in https://github.com/hidet-org/hidet/pull/356
* [Graph] Minor bug fixes by hjjq in https://github.com/hidet-org/hidet/pull/358

Distributed inference
* [Distributed] all_reduce op and distributed info in graphs by soodoshll in https://github.com/hidet-org/hidet/pull/284
* [Distributed] Add more runtime distributed communication functions by soodoshll in https://github.com/hidet-org/hidet/pull/314
* [Fixbug] group_start and group_end should be able importable without nccl by soodoshll in https://github.com/hidet-org/hidet/pull/317

Quantization
* [Operators] preliminary symmetric weight quantization by Aalanli in https://github.com/hidet-org/hidet/pull/298
* [Quantization] Quantization API by Aalanli in https://github.com/hidet-org/hidet/pull/309
* [Quantization] fix quantization pass bug by Aalanli in https://github.com/hidet-org/hidet/pull/355

IR and passes
* [FixBug] Don't instantiate symbol for primitive functions by hjjq in https://github.com/hidet-org/hidet/pull/291
* [Fix] NCCL API mismatch and NCCL primitive fix by soodoshll in https://github.com/hidet-org/hidet/pull/301
* [Fixbug] Prevent allreduce op from being fused by soodoshll in https://github.com/hidet-org/hidet/pull/304
* [Enhancements] add a vcude device to help mitigate compile time GPU memory usage by xinli-git in https://github.com/hidet-org/hidet/pull/302
* [Task] More descriptive kernel names for nsys/ncu by Aalanli in https://github.com/hidet-org/hidet/pull/315
* [Fixbug][Hidet Script] Fix a bug that hidet script does not recognize return type by yaoyaoding in https://github.com/hidet-org/hidet/pull/329
* [Hidet script] Add `hidet.lang.types` submodule by yaoyaoding in https://github.com/hidet-org/hidet/pull/340
* [IR][Parser] Hidet IR grammar, parser and ir reconstructor by Aalanli in https://github.com/hidet-org/hidet/pull/354

Runtime
* [Runtime] Check for input tensor device by hjjq in https://github.com/hidet-org/hidet/pull/287
* [Fixbug] Is exiting fix by xinli-git in https://github.com/hidet-org/hidet/pull/293

Backends
* [Fixbug] Fix the c++ standard to c++11 for both nvcc and gcc compilers by yaoyaoding in https://github.com/hidet-org/hidet/pull/327
* [CPU][Scheduler] Use mutli-threads for autl-scheduler by yaoyaoding in https://github.com/hidet-org/hidet/pull/341

Documentation
* [Document] fix installation guide by soodoshll in https://github.com/hidet-org/hidet/pull/288
* [Docs] Update the documentation for the coming release by yaoyaoding in https://github.com/hidet-org/hidet/pull/360

Others
* [Version] Bump version to 0.3.0.dev by yaoyaoding in https://github.com/hidet-org/hidet/pull/286
* [Tools] simple benchmarking utility by Aalanli in https://github.com/hidet-org/hidet/pull/292
* [Compile Server] Support remote compilation via compilation server by yaoyaoding in https://github.com/hidet-org/hidet/pull/297
* [Compile Server] Allow the user to specify the repo and branch/tag to use by yaoyaoding in https://github.com/hidet-org/hidet/pull/300
* [Compile Server] Add a new option to specify the cuda arch by yaoyaoding in https://github.com/hidet-org/hidet/pull/305
* [Fixbug] Fix a bug in compile server by yaoyaoding in https://github.com/hidet-org/hidet/pull/306
* [Graph] Minor graph benchmark fix by Aalanli in https://github.com/hidet-org/hidet/pull/313
* [Regression] Local performance regression by hjjq in https://github.com/hidet-org/hidet/pull/321
* [Regression] Increase benchmark iters and update perf data by hjjq in https://github.com/hidet-org/hidet/pull/328
* [CI] List package versions in ci by yaoyaoding in https://github.com/hidet-org/hidet/pull/334
* [Fixbug] Clear the intermediate object files for kernel tuning by yaoyaoding in https://github.com/hidet-org/hidet/pull/339
* [Config] Add configuration file by Aalanli in https://github.com/hidet-org/hidet/pull/359

**Full Changelog**: https://github.com/hidet-org/hidet/compare/v0.2.4...v0.3.0

0.2.4

What's Changed
* [Version] Bump version to v0.2.4.dev by yaoyaoding in https://github.com/hidet-org/hidet/pull/188
* [Dynamo] module tests + operator support by AndreSlavescu in https://github.com/hidet-org/hidet/pull/148
* Refactor compilation workflow to support CPU without CUDA by LDY1998 in https://github.com/hidet-org/hidet/pull/189
* [Stack] Allow the the ulimit stack size less than expected by yaoyaoding in https://github.com/hidet-org/hidet/pull/195
* [Readme] Add platform requirements by yaoyaoding in https://github.com/hidet-org/hidet/pull/196
* [DataType] Add complex64 and complex128 data type by yaoyaoding in https://github.com/hidet-org/hidet/pull/200
* [Example] Add an example of running GPT-2 model by yaoyaoding in https://github.com/hidet-org/hidet/pull/203
* [Fusion] Use inline pass in fusion to allow template call functions with kernel params by yaoyaoding in https://github.com/hidet-org/hidet/pull/197
* [Frontend][Operator] Add missing operators for dinov2 by yaoyaoding in https://github.com/hidet-org/hidet/pull/206
* [Backend] Add openmp support by yaoyaoding in https://github.com/hidet-org/hidet/pull/208
* [Operator] Update batch_matmul to use Hidet Script by hjjq in https://github.com/hidet-org/hidet/pull/207
* [Cache] Add cache management command line interface by yaoyaoding in https://github.com/hidet-org/hidet/pull/212
* [IR] Creation-time constant fold for constant expressions by yaoyaoding in https://github.com/hidet-org/hidet/pull/209
* [Torch][Operator] Allow change torch tensor device when possible by yaoyaoding in https://github.com/hidet-org/hidet/pull/214
* [Torch][Operator] Add op mapping for torch.min/max/minimum/maximum by yaoyaoding in https://github.com/hidet-org/hidet/pull/216
* [Typo] Fix a typo in resnext.py by eltociear in https://github.com/hidet-org/hidet/pull/210
* [Operator] Adding missing operators for llama by yaoyaoding in https://github.com/hidet-org/hidet/pull/219
* [IR] Adding more support for dynamic shape on Task and FlowGraph level by yaoyaoding in https://github.com/hidet-org/hidet/pull/220
* [Torch] Add mapping for `torch.ops.aten.add` and `torch.ops.aten.cos` by yaoyaoding in https://github.com/hidet-org/hidet/pull/223
* [Operator][Backend] Add nvcc flags for faster math and update Attention schedule by hjjq in https://github.com/hidet-org/hidet/pull/221
* [CI] Always clear the cache before tests by yaoyaoding in https://github.com/hidet-org/hidet/pull/224
* fix batch_matmul for invalid mma config for sm < 80 by xinli-git in https://github.com/hidet-org/hidet/pull/227
* [Dynamic Shape] Adding more dynamic shape support by yaoyaoding in https://github.com/hidet-org/hidet/pull/228
* [CI] Add `importlib_metadata` to `requirements-dev.txt` by yaoyaoding in https://github.com/hidet-org/hidet/pull/233
* [Script] Add list comprehension support in hidet script by yaoyaoding in https://github.com/hidet-org/hidet/pull/235
* [Refactor][Dynamic Shape] Introduce SymbolVar to implement dynamic shape by yaoyaoding in https://github.com/hidet-org/hidet/pull/236
* [Script] Add pointer arthematic by yaoyaoding in https://github.com/hidet-org/hidet/pull/237
* [Operator][Torch] Add causal fmha and torch sdpa mapping by hjjq in https://github.com/hidet-org/hidet/pull/238
* [Fixbug][Pass] Fix a bug in the `inline_let_stmt` pass by yaoyaoding in https://github.com/hidet-org/hidet/pull/240
* [Options] Add option for controlling parallel build with number of jobs or memory reserved for each job by xinli-git in https://github.com/hidet-org/hidet/pull/230
* [Typo] Fix a typo by BolinSNLHM in https://github.com/hidet-org/hidet/pull/245
* [Typo] Fix minor spelling mistake by Aalanli in https://github.com/hidet-org/hidet/pull/246
* [Fixbug] Fix a bug in StmtRewriter which discard declare scope information by yaoyaoding in https://github.com/hidet-org/hidet/pull/248
* [Refactor] Adding support for compiled model by yaoyaoding in https://github.com/hidet-org/hidet/pull/247
* [Operator] batch_matmul: Remove duplicate smem declaration by hjjq in https://github.com/hidet-org/hidet/pull/249
* [Operator] Adding CPU support for matrix multiplication by BolinSNLHM in https://github.com/hidet-org/hidet/pull/251
* [Hidet Script] Allow `bind_tuple` argument in `mapping.on(...)` and `grid(...)` by yaoyaoding in https://github.com/hidet-org/hidet/pull/254
* [Hidet Script] Add `in` and `not in` expression in hidet script by yaoyaoding in https://github.com/hidet-org/hidet/pull/255
* [Codegen] Include header files as needed by yaoyaoding in https://github.com/hidet-org/hidet/pull/256
* [Operator] Add new operator "normalize" that makes a group of layers (layer norm, group norm and instance norm) faster using hidet script by xinli-git in https://github.com/hidet-org/hidet/pull/257
* [Testing][Models] Add gpt2 module in testing models by yaoyaoding in https://github.com/hidet-org/hidet/pull/252
* [Fixbug] Fix test warnings and the incompatibility of two recent PRs by yaoyaoding in https://github.com/hidet-org/hidet/pull/258
* [Operator] Add sm75 support for attention by hjjq in https://github.com/hidet-org/hidet/pull/259
* [Operator] batch_matmul: Remove unroll and reduce tuning space by hjjq in https://github.com/hidet-org/hidet/pull/260
* [Fixbug] Fix a bug when fused operator has no input by yaoyaoding in https://github.com/hidet-org/hidet/pull/263
* [Graph] Translate softmax and reduce to hidet script by Aalanli in https://github.com/hidet-org/hidet/pull/242
* [Fixbug] batch_matmul: move cc checking inside schedule by hjjq in https://github.com/hidet-org/hidet/pull/264
* [Refactor] Refactor building system and adding compiled products by yaoyaoding in https://github.com/hidet-org/hidet/pull/261
* [Fixbug] Reduce the default unroll factor to 4 by yaoyaoding in https://github.com/hidet-org/hidet/pull/266
* [Torch] Add some torch frontend mappings for roberta-base by hjjq in https://github.com/hidet-org/hidet/pull/267
* [Refactor] Remove `schedules` submodule under `hidet.graph.ops` by yaoyaoding in https://github.com/hidet-org/hidet/pull/269
* [Device] Add support for mixed cpu and cuda kernels in the same flow graph by yaoyaoding in https://github.com/hidet-org/hidet/pull/270
* [Dynamic Shape] Adding dynamic shape support for reduce by Aalanli in https://github.com/hidet-org/hidet/pull/268
* [Complex Type] Add more support for complex data type by yaoyaoding in https://github.com/hidet-org/hidet/pull/271
* [Tools] Model translator by Aalanli in https://github.com/hidet-org/hidet/pull/273
* [Model] Llama model implementation in hidet by Aalanli in https://github.com/hidet-org/hidet/pull/243
* [Operator] Add support for cross attention by hjjq in https://github.com/hidet-org/hidet/pull/275
* [Operator] Add dynamic shape support and tests for Operators. by Aalanli in https://github.com/hidet-org/hidet/pull/274
* [Fusion] Enhance the prologue epilogue fusion by yaoyaoding in https://github.com/hidet-org/hidet/pull/277
* [Drivers] Suppress OSError by hjjq in https://github.com/hidet-org/hidet/pull/278
* [Dynamic Shape] More correctness guards by Aalanli in https://github.com/hidet-org/hidet/pull/276
* [Operator] Make Convolution gemms fusible by resolving to batch_matmul by hjjq in https://github.com/hidet-org/hidet/pull/279
* [External Tasks] Move task build into method call for external kernel support by xinli-git in https://github.com/hidet-org/hidet/pull/282
* [Distributed] add nccl primitives by soodoshll in https://github.com/hidet-org/hidet/pull/280
* [Operators] Conv2d fp16 implicit gemm kernel by Aalanli in https://github.com/hidet-org/hidet/pull/283

New Contributors
* eltociear made their first contribution in https://github.com/hidet-org/hidet/pull/210
* BolinSNLHM made their first contribution in https://github.com/hidet-org/hidet/pull/245
* Aalanli made their first contribution in https://github.com/hidet-org/hidet/pull/246

**Full Changelog**: https://github.com/hidet-org/hidet/compare/v0.2.3...v0.2.4

Hidet

Page 1 of 2