What's Changed
- [BUG] Add comp server requirements (661) by **Vadim Gimpelson** 300fd33d
- [BUG] A number of fixes for vllm's TP (651) by **Vadim Gimpelson** 9c29f669
- matmul_f16 with wgmma (627) by **kjiang170** 9f0ea7dd
- [BUG] VLLM (and DMWL) compile with hidet backend (647) by **zhumakhan** 6c6be7a0
- [IR] Add support for `swizzle`, `interleave` and `l2Promotion` in tensor map creation (643) by **Bolin Sun** 21ff63f5
- [BUG] fix attach hash to signature (638) by **xiaocenxiaocen** dbd66132
- Hexcute base branch (All related PRs will be merged into this base PR. ) (294) by **xiaocenxiaocen** b1fdf17d
- [PERF] Default value for parallel_k is 'disabled' (634) by **Vadim Gimpelson** 135212bd
- Adapt to bfloat16 where necessary (624) by **ZichuWu** 9045865d
- [Bug] Parallel compilation sync (616) by **ZichuWu** 4c16c576
- [COMPTIME] Hot start speedup (625) by **Vadim Gimpelson** 22c657b0
- [BUG] Fix torch2.5 OoM and docs build fix (637) by **zhumakhan** bf32f8b1
- Revert "[BUG] Fix torch2.5 OoM issue" (635) by **zhumakhan** 9131a5c5
- [BUG] Fix torch2.5 OoM issue (609) by **zhumakhan** fe59c639
- [CI]Fix small typoes for building and publishing to internal Hidet PYPI Index (598) by **xinli-centml** f8400fe1
- [PERF] Support bf16 in one more place (623) by **Vadim Gimpelson** 7f773490
- [Tests] Adapt tests/operators for bfloat16 (615) by **ZichuWu** ba9c0ad5
- [DISTRIBUTED] Support `all_reduce` in `torch.compile` mode (612) by **Vadim Gimpelson** 0bca591c
- [torchAPI] Inherit cuda stream from torch (618) by **Vadim Gimpelson** ad4e00a0
- [BUG] Fix bugs in shared map implementation (608) by **Vadim Gimpelson** ffdbde4b
- [CI] Turn off search space 2 for tests/lang (617) by **ZichuWu** 5f7fae83
- [Tests] Adapt tests/lang for bfloat16 test cases (594) by **ZichuWu** 5b829cbd
- [Tests] Adapt tests/frontends to bfloat16 (592) by **ZichuWu** a5b72e62
- [Tests] Adapt tests/ir for bfloat16 test cases (593) by **ZichuWu** 545aeea4
- [Tests] Adjust test cases for tests/models for bfloat16. (595) by **ZichuWu** bedff214
- Use one global cuda workspace for all the CompiledGraph (603) by **Max Hu** 66523079
- [Fix] Fixing a minor mistake encountered while adapting test cases for `bfloat16` data type (607) by **Bolin Sun** 275070da
- Kaihang/wgmma tf32 u8 i8 support (549) by **kjiang170** a0e6658f
- [CI] Exclude tests/unit_tests/test_dynamic_shape.py::test_attention[cuda] (606) by **Vadim Gimpelson** 5579392a
- [Tests] Adjust test cases for tests/unit-tests for bfloat16. (596) by **ZichuWu** 0e5ec55b
- [BUG] Fix incorrect converting fxgraph to hidet's flow graph + expand looking for nccl lib with user site packages (604) by **Vadim Gimpelson** 1995d431
- [Tests] Added bfloat16 test cases for tests/cuda (590) by **ZichuWu** febfbd71
- [Tests] Adjust test cases for tests/utils for bfloat16. (597) by **ZichuWu** 36aab6f3
- [Tests] Change float16 to bfloat16 for tests/apps (589) by **ZichuWu** 83cddbbf
- [CI] add new github actions workflow to manually build and push to internal pypi index (554) by **xinli-centml** 6beffab9
- [OPTIONS] Remove unnecessary parallel_k (572) by **ZichuWu** 9051f264
- fix test_wgmma.py error for illegal warp address (588) by **kjiang170** 8f7e1396
- [Operators] Allow NT `matmul` layout for `bfloat16` data type (562) by **Bolin Sun** d5d0e518
- python3.8 -> python3.9 (558) by **Vadim Gimpelson** a09713ca
- [CI] Move import torch inside run_torch() (570) by **ZichuWu** 4bc4d290
- [CI] Shorten build-docs run time (565) by **ZichuWu** edadb073
- [CI] Tests Workflow. Add manual trigger of tests on different gpu types (555) by **c-fteixeira** 66d9568c
- [OPTIONS] Clean Huggingface tokens option (561) by **ZichuWu** cdf2c8af
- [Bug] Fix out of memory error occurred while running `llama-2-7b` (547) by **Bolin Sun** b8826d0e
- [OPTIONS] Set mma as default in PassContext() (530) by **ZichuWu** 35f02b96
- wgmma bf16 support (531) by **kjiang170** f8c057b4
- [Bug] ‘uint32_t’ was not declared in this scope in CI build-wheel for runtime (545) by **ZichuWu** 4ced47ef
- Add more shapes to reduce op in regression (534) by **zhumakhan** 8ef1bc21
- [COMPTIME] Added support for run_torch for the rest of transform operation (525) by **ZichuWu** 04e4d5e6
- f16 rest options supported and tested (527) by **kjiang170** e5e2404a
- [Operators] `bfloat16` data type support for attention operators (524) by **Bolin Sun** 07e597a3
- [Enhancement] Save running time by using symbolic_run to replace async_run in optimize (490) by **ZichuWu** 92c81e8f
- [BUG] Fix distilbert by changing variables names in ops.where (512) by **zhumakhan** 2d615b6a
- [OP] Support of `logsoftmax` (517) by **Vadim Gimpelson** ce43f1e8
- refactor wgmma (521) by **kjiang170** 4a80b9ab
- [Bug] Fix the incorrect result after merging changes related to `matmul_nt` (518) by **Bolin Sun** 2b7c348a
- [PERF] Rewrite softmax (516) by **Vadim Gimpelson** b50cca4c
- wgmma instruction support and test for f16 input … (499) by **kjiang170** c758e546
- [BUG] Fix NT matmul corner case where `n` or `k` dimension is odd (513) by **Bolin Sun** 1e54f773
- [Operators] Support `bfloat16` data type in `matmul` operator (511) by **Bolin Sun** a467c76f
- [Operators] Support matmul with NT layout (496) by **Bolin Sun** 8fc6de3a
- [CI] Make test and publish workflows use built wheel on tests (492) by **c-fteixeira** bc5b54e1
- [Hidet Script] Import externally defined function automatically (503) by **Yaoyao Ding** 43750c28
- [PERF] Fix for indexes optimization (488) by **Vadim Gimpelson** f8c679ae
- [CI] Update the set of Regression tests (493) by **Vadim Gimpelson** 7e3ae1f3
- [Enhancement] Causal attention with fp32 accumulator (481) by **zhumakhan** 8b569bd3
- [IR] Bound check for task mapping worker (483) by **Vadim Gimpelson** 1544cdf6
- [Bug] Rule based simplifier. Fix incorrect rule e/c1/c2 -> e/(c1*c2) (487) by **Vadim Gimpelson** fd6b4390
- [TOOLS] Task benchmark utilities (479) by **Vadim Gimpelson** dc175f29
- [Dynamic][Enhancement] Convert div and mod including symbolvars to fast int div/mod (464) by **Max Hu** c8d9158b
- Revert accidental commit (484) by **Vadim Gimpelson** 6c8ad3e2
- bug fix by **Vadim Gimpelson** 3405b558
- [PERF] Сontinue indexes optimisations (473) by **Vadim Gimpelson** da24ee3c
- [Bug] Resolved multi-threading conflict with save_lower_ir() (480) by **ZichuWu** 6a116adc
- Fixed the format change on the new transformers version (482) by **ZichuWu** 0a81840c
- Fix masked attention by using fp32 accumulate on first matmul (q and k) part (468) by **zhumakhan** 40c12c93
- remove mpt-7b due to accuracy failure (477) by **zhumakhan** 53a0cc49
- [BUG] Support concat empty tensors (475) by **ZichuWu** 85bb6dd5
- [TOOLS] Attached hash values to function signature in source.cu (459) by **ZichuWu** a6f10331
- [BUG] Fix `ValueError` caused by different operand data types in `if_then_else` while initializing `Conv2dTransposeGemmImageTask` (470) by **Bolin Sun** 28264907
- [BUG] Fix `ZeroDivisionError` triggered wihtin the function `parallel_part_heuristic` in `graph/ops/conv2d/conv2d_gemm.py` (472) by **Bolin Sun** a11d69ca
- [BUG] Fixing memory issue encountered while compiling the model `sam` (466) by **Bolin Sun** c6959747
- [PERF] Indexes optimization (458) by **Vadim Gimpelson** f1ee08fe
- Added more llms to Regression test (432) by **zhumakhan** 03d62505
- Revert "[Dynamic][Enhancement] Convert div and mod including symbolvars to fast int div/mod" (463) by **Max Hu** 29893891
- [Dynamic][Enhancement] Convert div and mod including symbolvars to fast int div/mod (405) by **Max Hu** 0cffe7ec
- [CI] Print stderr in `run_tests.py` (443) by **Vadim Gimpelson** 015ffcdd
- [BUG] Fix `NotImpelementedError` encountered while compiling the model `doctr_det_predictor` (462) by **Bolin Sun** 868dc9da
- [Operators] Adding support for `torch.nn.GLU` module (461) by **Bolin Sun** f7560512
- [BUG] Fixing another error encountered while compiling `detectron2_fcos_r_50_fpn` (457) by **Bolin Sun** 798ce6e4
- [Ir][Primitives] fix 436 via adding missing instructions (440) by **xiaocenxiaocen** 131ec204
- [BUG] Fixing errors encountered while compiling `detectron2_fcos_r_50_fpn` (455) by **Bolin Sun** c74732d8
- [PERF] Introduce the new IR optimization Pass that spatial(1,47) -> spatial(47) (452) by **Vadim Gimpelson** 0f2990b3
- [Bug] Fixing the `ValueError` triggered while compiling the model `dlrm` during operator fusion pass (437) by **Bolin Sun** de949461
- [Scripts] Add scripts of our wheel server (439) by **Yaoyao Ding** 628eb603
- [Graph][Ops] disable cublas matmul for parallel k (431) by **xiaocenxiaocen** 2696c34b
- [BUG] Fixing an error triggered from the `conv_channel_last_pass` while compiling the model `sam` (444) by **Bolin Sun** ba455220
- [BUG] Fixing a bug triggered while compiling in-place operator `torch.Tensor.scatter_add_` (429) by **Bolin Sun** 4f142c4b
- [PERF] Specialize pow(x,2) as x*x. llama-7B (434) by **Vadim Gimpelson** f421a43a
- [Version] Update 0.4.0 -> 0.5.0.dev in `setup.py` (433) by **Vadim Gimpelson** d9da46f5
- [PERF] Allow prologue fusion for `reduce` op (426) by **Vadim Gimpelson** 6606477f
- [Bug] fixing regression (422) by **zhumakhan** 646f7e7c
- [Utility] Add ncu and nsys test utilities (413) by **Yaoyao Ding** 2fc304f0
- [Operators] Adding support for the method `torch.Tensor.scatter_add_` (421) by **Bolin Sun** 8568afbb
- [Fix] fixed torch.pow (420) by **zhumakhan** cac4a0e5
- [Primitives] Add CUDA primitives: prmt, lop3, f16x2 sub and fma, and barrier (414) by **Yaoyao Ding** 5186d875
- [Ir][Primitives] add exp2 (410) by **xiaocenxiaocen** bbbfb7ba
- [Update] Updating torch docker image from 24.04 to 24.07 (418) by **zhumakhan** 9899060d
- [Fix] Support writing subbyte data to global memory (415) by **Yaoyao Ding** 9cacfe7e
- [Bug] Fixing longformer compilation (403) by **zhumakhan** 09d1bc03
- [Bug][Enhancement] Correct the behavior of non-parallel build when option `parallel_tune` is set to 1 (406) by **Max Hu** c2e8ec9a
- [CuTe] fix longformer (411) by **xiaocenxiaocen** 4953f737
- [Tests] Adding tests for math primitives (412) by **Bolin Sun** f859fac5
- Adding accruacy check for huggingface LLMs in Regression (368) by **zhumakhan** 4a1f72db
- [Bug] Fix hidet.ops.gather, add torch.sign torch.ceil. Disable torch.autograd.function.FunctionCtx (394) by **zhumakhan** 3b6cb586
- workaround for gpt-j (395) by **zhumakhan** 9d0e0c09
- [Bug] Cast dtypes in hidet.where when mismatch (386) by **zhumakhan** 2172e163
- make llama2 work with all ttransformers versions (385) by **zhumakhan** 426d14b8
- [DEBUG] Save `Task` pickle in translations cache (380) by **Vadim Gimpelson** cb72bc7c
- [BUILD] Several changes in wheel building (392) by **Vadim Gimpelson** f416ee50
- [Operators] Adding support for the `torch.nn.EmbeddingBag` (378) by **Bolin Sun** 9d309c12
- [CI] Adding successfully compiled vision models to the tests/benchmark/run_config.json (205) by **Bolin Sun** 8eb61c92
- Fix float return when limited by memory (389) by **Max Hu** 03e5966a
- [BUG] Fix bug in `normalize_launch_dims()` (381) by **Vadim Gimpelson** b61d6b11
- [Operators] Extend the functionality of `einsum` to support `Ellipsis` (374) by **Bolin Sun** d412db16
- [Dependency] Remove the version restriction of transformers and diffuers (475) by **Yaoyao Ding** 3e76c2f3
- [README] Fix broken links (474) by **Yaoyao Ding** 7b2f6804
Contributors
* yaoyaoding
* xiaocenxiaocen
* vadiklyutiy
* BolinSNLHM
* kjiang170
* zhumakhan
* ZichuWu
* maxyanghu
* c-fteixeira
* xinli-centml