Torchao

Latest version: v0.8.0

Safety actively analyzes 702253 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 3

0.6.1

Highlights

We are excited to announce the 0.6.1 release of torchao! This release adds support for Auto-Round support, Float8 Axiswise scaled training, a BitNet training recipe, an implementation of AWQ and much more!

Auto-Round Support (581)
Auto-Round is a new weight-only quantization algorithm, it has as achieved superior accuracy compared to [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), and [OmniQuant](https://arxiv.org/abs/2308.13137) across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our [paper](https://arxiv.org/abs/2309.05516), [GitHub repository](https://github.com/intel/auto-round/blob/main/docs/acc.md?rgh-link-date=2024-07-22T01%3A42%3A54Z), and Hugging Face [low-bit quantization leaderboard](https://huggingface.co/spaces/Intel/low_bit_open_llm_leaderboard).

Python
from torchao.prototype.autoround.core import prepare_model_for_applying_auto_round_
from torchao.prototype.autoround.core import apply_auto_round

prepare_model_for_applying_auto_round_(
model,
is_target_module=is_target_module,
bits=4,
group_size=128,
iters=200,
device=device,
)

input_ids_lst = []
for data in dataloader:
input_ids_lst.append(data["input_ids"].to(model_device))

multi_t_input_ids = MultiTensor(input_ids_lst)
out = model(multi_t_input_ids)

quantize_(model, apply_auto_round(), is_target_module)

Added float8 training axiswise scaling support with per-gemm-argument configuration (940)

We added experimental support for rowwise scaled float8 gemm to `torchao.float8`, with per-gemm-input configurability to enable exploration of various recipes. Here is how a user can configure all-axiswise scaling

python
all-axiswise scaling
config = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.ALL_AXISWISE)
m = torchao.float8.convert_to_float8_training(config)

or, a custom recipe by lw where grad_weight is left in bfloat16
config = torchao.float8.config.recipe_name_to_linear_config(Float8LinearRecipeName.LW_AXISWISE_WITH_GW_HP)
m = torchao.float8.convert_to_float8_training(config)


Early performance benchmarks show all-axiswise scaling achieve a 1.13x speedup vs bf16 on torchtitan / LLaMa 3 8B / 8 H100 GPUs (compared to 1.17x from all-tensorwise scaling in the same setup), and loss curves which match to bf16 and all-tensorwise scaling. Further performance and accuracy benchmarks will follow in future releases.

Introduced BitNet b1.58 training recipe (930)
Adds recipe for doing BitNet b1.58](https://arxiv.org/abs/2402.17764) ternary weights clamping.
Python
from torchao.prototype.quantized_training import bitnet_training
from torchao import quantize_

model = ...
quantize_(model, bitnet_training())

Notably: Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.

[Prototype] Implemented Activation Aware Weight Quantization [AWQ](https://arxiv.org/pdf/2306.00978) (#743)
Perplexity and performance measured on A100 GPU:
| Model | Quantization | Tokens/sec | Throughput (GB/sec) | Peak Mem (GB) | Model Size (GB) |
|--------------------|--------------|------------|---------------------|---------------|-----------------|
| Llama-2-7b-chat-hf | bfloat16 | 107.38 | 1418.93 | 13.88 | 13.21 |

0.5.0

from torchao.quantization.prototype.qat._module_swap_api import Int8DynActInt4WeightQATQuantizerModuleSwap
quantizer = Int8DynActInt4WeightQATQuantizerModuleSwap()
model = quantizer.prepare(model)
train(model)
model = quantizer.convert(model)


Deprecations

New Features

* Optimizer CPU offload for single GPU training https://github.com/pytorch/ao/pull/584
* Add support for save quantized checkpoint in llama code https://github.com/pytorch/ao/pull/553
* Intx quantization tensor subclass https://github.com/pytorch/ao/pull/468
* Add superblock to sparse/prototype https://github.com/pytorch/ao/pull/660
* Add AffineQuantizedObserver https://github.com/pytorch/ao/pull/650
* Add BSR subclass + torch.compile and clean up superblock https://github.com/pytorch/ao/pull/680
* Add HQQ support https://github.com/pytorch/ao/pull/605
* Add performance profiler https://github.com/pytorch/ao/pull/690
* Add experimental INT8 quantized training https://github.com/pytorch/ao/pull/644
* Add high-level operator interface https://github.com/pytorch/ao/pull/708
* Add sparse marlin 2:4 gemm op https://github.com/pytorch/ao/pull/733
* Example for GPTQ-like calibration flow https://github.com/pytorch/ao/pull/721
* Llama3.1 and KV cache quantization https://github.com/pytorch/ao/pull/738
* Add float8 weight only and weight + dynamic activation https://github.com/pytorch/ao/pull/740
* Add Auto-Round support https://github.com/pytorch/ao/pull/581

Mixed-Precision Quantization
* Add sensitivity analysis tool for layer-wise FIT and Hessian trace https://github.com/pytorch/ao/pull/592
* Bayesian optimization tool for mixed precision quantization https://github.com/pytorch/ao/pull/694

Improvements

* Move sam eval from `scripts` to `torchao/_models` https://github.com/pytorch/ao/pull/591
* QOL improvements to float8 gemm benchmark https://github.com/pytorch/ao/pull/596
* Move lowbit universal kernels from torchaccel to torchao https://github.com/pytorch/ao/pull/582
* Refactor autoquant to use AQT https://github.com/pytorch/ao/pull/609
* Add support for using AffineQuantizedTensor with `weights_only=True` https://github.com/pytorch/ao/pull/630
* Move Uintx out of prototype for future extension https://github.com/pytorch/ao/pull/635
* Refactor `_quantized_linear` for better extensibility https://github.com/pytorch/ao/pull/634
* Update micro benchmarking code for AQT https://github.com/pytorch/ao/pull/673
* Refactor superblock code + add final benchmark/eval scripts https://github.com/pytorch/ao/pull/691
* Relax QAT dtype assertion https://github.com/pytorch/ao/pull/692
* Add option to move param to `device` before quantization https://github.com/pytorch/ao/pull/699
* Add gpu benchmarking script https://github.com/pytorch/ao/pull/192
* Enable `to(device=device_name)` for `Uintx` https://github.com/pytorch/ao/pull/722
* Make torchao's llama model trainable https://github.com/pytorch/ao/pull/728
* Specify output dtype to `torch.float32` in `_foreach_norm` https://github.com/pytorch/ao/pull/727
* Add semi-structured sparsity to hf eval https://github.com/pytorch/ao/pull/576
* Use `torch.uint1` to `torch.uint7` for Uintx tensor subclass https://github.com/pytorch/ao/pull/672
* Add AdamW to `CPUOffloadOptimizer` default https://github.com/pytorch/ao/pull/742
* Make developer experience better for extending AQT https://github.com/pytorch/ao/pull/749
* Add back QAT module swap API https://github.com/pytorch/ao/pull/762
* Refactor quant_llm to work with affine quantized tensor https://github.com/pytorch/ao/pull/772
* Move iOS benchmarking infra code to torchao https://github.com/pytorch/ao/pull/766
* Add CPU bandwidth benchmark https://github.com/pytorch/ao/pull/773
* Update method names to support intx and floatx changes https://github.com/pytorch/ao/pull/775
* Add implementation for torchao::parallel_for backends https://github.com/pytorch/ao/pull/774
* Add Llama2-7B finetune benchmarks for low-bit optimizers https://github.com/pytorch/ao/pull/746
* Fix Adam4bit support on PyTorch 2.3 and 2.4 and update AdamFp8 torch requirement https://github.com/pytorch/ao/pull/755
* Improve compile time + fix PyTorch 2.3 support for 4-bit optim https://github.com/pytorch/ao/pull/812
* Allow quantized linear registration in a different file https://github.com/pytorch/ao/pull/783
* Add 2bit, 5bit packing routines https://github.com/pytorch/ao/pull/797, https://github.com/pytorch/ao/pull/798
* Freeze dataclass in nf4, prep for better pt2 support https://github.com/pytorch/ao/pull/799
* Format and lint nf4 file and test https://github.com/pytorch/ao/pull/800
* Move more utils to TorchAOBaseTensor https://github.com/pytorch/ao/pull/784
* Add more information to quantized linear module and added some logs https://github.com/pytorch/ao/pull/782
* Add int4 mode to autoquant https://github.com/pytorch/ao/pull/804
* Add uintx quant to generate and eval https://github.com/pytorch/ao/pull/811
* Move non-NF4 tensor to device prior to quantization on copy https://github.com/pytorch/ao/pull/737

Static quantization
* Add float8 static quant support https://github.com/pytorch/ao/pull/787
* Update how block_size is calculated with Observers https://github.com/pytorch/ao/pull/815
* Add a linear observer class and test https://github.com/pytorch/ao/pull/807

Float8

* Update benchmarks to be more useful for smaller shapes https://github.com/pytorch/ao/pull/615
* Remove unneeded kernel for scale generation https://github.com/pytorch/ao/pull/616
* Filter out microbenchmarking overhead in profiling script https://github.com/pytorch/ao/pull/629
* Save torch_logs, and attach them to profiling trace https://github.com/pytorch/ao/pull/645
* Add option for gpu time in GEMM benchmarks https://github.com/pytorch/ao/pull/666
* Add roofline estimation of GEMM + overhead https://github.com/pytorch/ao/pull/668
* Make roofline utils reusable https://github.com/pytorch/ao/pull/731
* Use `torch.compiler.is_compiling` https://github.com/pytorch/ao/pull/739
* Float8 support in AQT https://github.com/pytorch/ao/pull/671
* Add static scaling for float8 training https://github.com/pytorch/ao/pull/760
* Make roofline script calculate observed overhead https://github.com/pytorch/ao/pull/734
* Make Inference and training code independent https://github.com/pytorch/ao/pull/808
* Add rowwise scaling option to float8 dynamic quant https://github.com/pytorch/ao/pull/819

Bug fixes

* Fix all-gather in 2D with DTensor (WeightWithDynamicFloat8CastTensor) https://github.com/pytorch/ao/pull/590
* Fix FP6-LLM API and add `.to(device)` op https://github.com/pytorch/ao/pull/595
* Fix linear_activation_tensor dynamic quant https://github.com/pytorch/ao/pull/622
* Fix bug with float8 inference_mode https://github.com/pytorch/ao/pull/659
* Quantization kernel bug fixes https://github.com/pytorch/ao/pull/717
* Cast `local_scale_tensor` to fp32 for precompute of float8 dynamic scaling https://github.com/pytorch/ao/pull/713
* Fix affine quantized tensor to device calls https://github.com/pytorch/ao/pull/726
* Small fix for micro benchmark code https://github.com/pytorch/ao/pull/711
* Fix LR schedule handling for low-bit optimizers https://github.com/pytorch/ao/pull/736
* Fix FPX inductor error https://github.com/pytorch/ao/pull/790
* Fixed llama model inference https://github.com/pytorch/ao/pull/769

Docs

* Add QAT README https://github.com/pytorch/ao/pull/597
* Update serialization.rst to include get_model_size_in_bytes import https://github.com/pytorch/ao/pull/604
* Clarify details around unwrap_tensor_subclass in README.md https://github.com/pytorch/ao/pull/618, https://github.com/pytorch/ao/pull/619
* Spelling fixes https://github.com/pytorch/ao/pull/662
* Move developer guide file to a folder https://github.com/pytorch/ao/pull/681
* Update docs on how to use AUTOQUANT_CACHE https://github.com/pytorch/ao/pull/649
* Update pip install command in README https://github.com/pytorch/ao/pull/723
* Fix docstring args names https://github.com/pytorch/ao/pull/735
* Update README example with correct import of `sparsify_` https://github.com/pytorch/ao/pull/741
* Update main and quantization README https://github.com/pytorch/ao/pull/745, https://github.com/pytorch/ao/pull/747, https://github.com/pytorch/ao/pull/757
* Add README for mixed-precision search tool and code refactor https://github.com/pytorch/ao/pull/776
* Add performance section to float8 README.md https://github.com/pytorch/ao/pull/794
* Make float8 README.md examples standalone https://github.com/pytorch/ao/pull/809
* Add KV cache quantization to READMEs https://github.com/pytorch/ao/pull/813
* Update main README.md with more current float8 speedup https://github.com/pytorch/ao/pull/816

Not user facing
* Fix float8 inference tests and add export test https://github.com/pytorch/ao/pull/613
* Reduce atol/rtol for stable tests https://github.com/pytorch/ao/pull/617
* Fix version guard in https://github.com/pytorch/ao/pull/620, https://github.com/pytorch/ao/pull/679, https://github.com/pytorch/ao/pull/684
* Fix BC for QAT location https://github.com/pytorch/ao/pull/626
* Enable float8 CI on sm89 https://github.com/pytorch/ao/pull/587
* Fix Inductor bench BC change https://github.com/pytorch/ao/pull/638, https://github.com/pytorch/ao/pull/641
* Add CUDA compute capability compile guard https://github.com/pytorch/ao/pull/636
* Remove numpy as bitpack dependency https://github.com/pytorch/ao/pull/677
* Add PyTorch 2.4 tests in CI https://github.com/pytorch/ao/pull/654
* Remove torchao_nightly package https://github.com/pytorch/ao/pull/661
* Update licenses in torchao/experimental https://github.com/pytorch/ao/pull/720
* Add lint checks for float8 inference https://github.com/pytorch/ao/pull/779

New Contributors
* sayakpaul made their first contribution in https://github.com/pytorch/ao/pull/604
* metascroy made their first contribution in https://github.com/pytorch/ao/pull/582
* raziel made their first contribution in https://github.com/pytorch/ao/pull/618
* nmacchioni made their first contribution in https://github.com/pytorch/ao/pull/641
* Diogo-V made their first contribution in https://github.com/pytorch/ao/pull/670
* mobicham made their first contribution in https://github.com/pytorch/ao/pull/605
* crcrpar made their first contribution in https://github.com/pytorch/ao/pull/703
* ebsmothers made their first contribution in https://github.com/pytorch/ao/pull/737
* a-r-r-o-w made their first contribution in https://github.com/pytorch/ao/pull/741
* kimishpatel made their first contribution in https://github.com/pytorch/ao/pull/766

We were able to close about [70% of tasks for 0.5.0](https://github.com/pytorch/ao/issues/667), which will now spill over into upcoming releases. We will post a list for 0.6.0 next, which we aim to release at the end of September 2024. We want to follow a monthly release cadence until further notice.

**Full Changelog**: https://github.com/pytorch/ao/compare/v0.4.0...v0.5.0-rc1

0.4

from torchao.sparsity import _sparsify, semi_sparse_weight
sparsify_(model, semi_sparse_weight())

0.4.0

Highlights

We are excited to announce the 0.4 release of torchao! This release adds support for KV cache quantization, quantization aware training (QAT), low bit optimizer support, composing quantization and sparsity, and more!

KV cache quantization (https://github.com/pytorch/ao/pull/532)

We've added support for KV cache quantization, showing a peak memory reduction from 19.7 -> 19.2 GB on Llama3-8B at an 8192 context length. We plan to investigate Llama3.1 next.

<img src="https://github.com/user-attachments/assets/31946f46-e8eb-45c2-ac1c-3a7d981c58a2" width="300" height="auto">



Quantization-Aware Training (QAT) ([383](https://github.com/pytorch/ao/pull/383), [#555](https://github.com/pytorch/ao/pull/555))

We now support two QAT schemes for linear layers: Int8 per token dynamic activations + int4 per group weights, and int4 per group weights (using the efficient [tinygemm int4 kernel](https://github.com/pytorch/pytorch/blob/a672f6c84e318bbf455f13dfdd3fd7c68a388bf5/aten/src/ATen/native/cuda/int4mm.cu#L1097) after training). Users can access this feature by transforming their models before and after training using the appropriate quantizer, for example:


python
from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer

Quantizer for int8 dynamic per token activations +
int4 grouped per channel weights, only for linear layers
qat_quantizer = Int8DynActInt4WeightQATQuantizer()

Insert "fake quantize" operations into linear layers.
These operations simulate quantization numerics during
training without performing any dtype casting
model = qat_quantizer.prepare(model)

Convert fake quantize to actual quantize operations
model = qat_quantizer.convert(model)


Initial evaluation results indicate that QAT in torchao can recover up to 96% of quantized accuracy degradation on hellaswag and up to 68% of quantized perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). For more details, please refer to the [README](https://github.com/pytorch/ao/tree/main/torchao/quantization/prototype/qat) and [this blog post](https://pytorch.org/blog/quantization-aware-training/).

Composing quantization and sparsity (457, 473)

We've added support for composing int8 dynamic quantization with 2:4 sparsity, using the `quantize_` API. We also added SAM benchmarks that show a 7% speedup over standalone sparsity / int8 dynamic quantization [here](https://github.com/pytorch/ao/tree/main/torchao/sparsity#segment-anything-fast).

python
from torchao.quantization import quantize_, int8_dynamic_activation_int8_semi_sparse_weight
quantize_(model, int8_dynamic_activation_int8_semi_sparse_weight())


Community Contributions

low-bit optimizer support (478, 463, 482, 484, 538)

gau-nernst added implementations for 4-bit, 8-bit, and FP8 Adam with FSDP2/FSDP support. Our API is a drop-in replacement for `torch.optim.Adam` and can be used as follows:
python
from torchao.prototype.low_bit_optim import Adam8bit, Adam4bit, AdamFp8
from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit, AdamWFp8


model = ...
optim = Adam8bit(model.parameters()) replace with Adam4bit and AdamFp8 for the 4 / fp8 versions


For more information about low bit optimizer support please refer to our [README](https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim).

Improvements to 4-bit quantization (https://github.com/pytorch/ao/pull/517, https://github.com/pytorch/ao/pull/552, https://github.com/pytorch/ao/pull/544, #479 )

bdhirsh jeromeku yanbing-j manuelcandales larryliu0820 added torch.compile support for NF4 Tensor, custom CUDA int4 tinygemm unpacking ops, and several bugfixes to torchao

BC breaking
* `quantize` has been renamed to `quantize_` https://github.com/pytorch/ao/pull/467
python

0.3.1

Highlights

We are excited to announce the 0.3 release of torchao! This release adds support for a new quantize API, MX format, FP6 dtype and bitpacking, 2:4 sparse accelerated training and benchmarking infra for llama2/llama3 models.


`quantize` API (https://github.com/pytorch/ao/pull/256)
We added a tensor subclass based quantization API, see [docs](https://github.com/pytorch/ao/tree/main/torchao/quantization) and README for details on usage, this is planned to replace all existing quantization APIs in torchao for torch 2.4 and later.

Accelerated training with 2:4 sparsity (184)
You can now accelerate training with 2:4 sparsity, using the runtime pruning + compression kernels written by xFormers. These kernels process a 4x4 sub-tile to be 2:4 sparse in both directions, to handle both the forward and backward pass when training. We see a [1.3x speedup](https://github.com/pytorch/ao/tree/main/torchao/sparsity/training#benchmarking) for the MLP layers of ViT-L across a forward and backwards pass.


MX support (https://github.com/pytorch/ao/pull/264)
We added prototype support for MX format for training and inference with a reference native PyTorch implementation of training and inference primitives for using MX accelerated matrix multiplications. The MX numerical formats are new low precision formats with recent acceptance into the OCP spec:
https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

Benchmarking (https://github.com/pytorch/ao/pull/276, https://github.com/pytorch/ao/pull/374)
We added a stable way to benchmark llama2 and llama3 models that includes perf/accuracy comparisons. See torchao/_models/llama/benchmarks.sh for more details.

🌟 💥 Community Contributions 🌟 💥
FP6 support (https://github.com/pytorch/ao/pull/279, https://github.com/pytorch/ao/pull/283, https://github.com/pytorch/ao/pull/358)
gau-nernst Added support for FP6 dtype and mixed matmul FP16 x FP6 kernel with support for torch.compile. Benchmark results show a [2.3x speedup](https://github.com/pytorch/ao/issues/208#issuecomment-2143240728) over BF16 baseline for meta-llama/Llama-2-7b-chat-hf

Bitpacking (https://github.com/pytorch/ao/pull/307, https://github.com/pytorch/ao/pull/282)
vayuda, melvinebenezer CoffeeVampir3 andreaskoepf Added support for packing/unpacking lower bit dtypes leveraging torch.compile to generate the kernels for this and added UInt2 and Bitnet tensor based on this approach.

FP8 split-gemm kernel https://github.com/pytorch/ao/pull/263
Added the kernel written by AdnanHoque to torchao with [speedups](https://github.com/pytorch/ao/pull/263#issuecomment-2130284378) compared to the cuBLAS kernel for batch size <=16


BC Breaking

Deprecations
* Deprecate top level quantization APIs https://github.com/pytorch/ao/pull/344

1. int8 weight only quantization
`apply_weight_only_int8_quant(model)` or `change_linear_weights_to_int8_woqtensors(model)`

-->

python

0.3

from torchao.sparsity import apply_sparse_semi_structured
apply_sparse_semi_structured(model)


Deprecations


New Features
* Added kv_cache quantization https://github.com/pytorch/ao/pull/532
* Migrated float8_experimental to `torchao.float8`, enabling float8 training support https://github.com/pytorch/ao/pull/551 https://github.com/pytorch/ao/pull/529
* Added FP5 E2M2 https://github.com/pytorch/ao/pull/399
* Added 4-bit, 8-bit, and FP8 ADAM support https://github.com/pytorch/ao/pull/478 https://github.com/pytorch/ao/pull/463 https://github.com/pytorch/ao/pull/482
* Added FSDP2 support for low-bit optimizers https://github.com/pytorch/ao/pull/484
* [prototype] mixed-precision quantization and eval framework https://github.com/pytorch/ao/pull/531
* Added int4 weight-only QAT support https://github.com/pytorch/ao/pull/555, https://github.com/pytorch/ao/pull/383
* Added custom CUDA `tinygemm` unpacking ops https://github.com/pytorch/ao/pull/415


Improvements
* Composing quantization and sparsity now uses the unified AQT Layout https://github.com/pytorch/ao/pull/498
* Added default inductor config settings https://github.com/pytorch/ao/pull/423
* Better dtype and device handling for` Int8DynActInt4WeightQuantizer` and `Int4WeightOnlyQuantizer` https://github.com/pytorch/ao/pull/475 https://github.com/pytorch/ao/pull/479
* Enable `model.to` for int4/int8 weight only quantized models https://github.com/pytorch/ao/pull/486 https://github.com/pytorch/ao/pull/522
* Added more logging to `TensorCoreTiledAQTLayout` https://github.com/pytorch/ao/pull/520
* Added general `fake_quantize_affine op` with mask support https://github.com/pytorch/ao/pull/492 https://github.com/pytorch/ao/pull/500
* QAT now uses the shared `fake_quantize_affine` primitive https://github.com/pytorch/ao/pull/527
* Improve FSDP support for low-bit optimizers https://github.com/pytorch/ao/pull/538
* Custom op and inductor decomp registration now uses a decorator https://github.com/pytorch/ao/pull/434
* Updated torch version to no longer require `unwrap_tensor_subclass` https://github.com/pytorch/ao/pull/595


Bug fixes
* Fixed import for `TORCH_VERSION_AFTER_*` https://github.com/pytorch/ao/pull/433
* Fixed crash when PYTORCH_VERSION is not defined https://github.com/pytorch/ao/pull/455
* Added `torch.compile` support for `NF4Tensor` https://github.com/pytorch/ao/pull/544
* Added fbcode check to fix torchtune in Genie https://github.com/pytorch/ao/pull/480
* Fixed `int4pack_mm` error https://github.com/pytorch/ao/pull/517
* Fixed cuda device check https://github.com/pytorch/ao/pull/536
* Weight shuffling now runs on CPU for int4 quantization due to a MPS memory issue https://github.com/pytorch/ao/pull/552
* Scale and input now are the same dtype for int8 weight only quantization https://github.com/pytorch/ao/pull/534
* Fixed FP6-LLM API https://github.com/pytorch/ao/pull/595

Performance
* Added `segment-anything-fast` benchmarks for composed quantization + sparsity https://github.com/pytorch/ao/pull/457
* Updated low-bit Adam benchmark https://github.com/pytorch/ao/pull/481


Docs
* Updated README.md https://github.com/pytorch/ao/pull/583 https://github.com/pytorch/ao/pull/438 https://github.com/pytorch/ao/pull/445 https://github.com/pytorch/ao/pull/460
* Updated installation instructions https://github.com/pytorch/ao/pull/447 https://github.com/pytorch/ao/pull/459
* Added more docs for int4_weight_only API https://github.com/pytorch/ao/pull/469
* Added developer guide notebook https://github.com/pytorch/ao/pull/588
* Added optimized model serialization/deserialization doc https://github.com/pytorch/ao/pull/524 https://github.com/pytorch/ao/pull/525
* Added new float8 feature tracker https://github.com/pytorch/ao/pull/557
* Added static quantization tutorial for calibration-based techniques https://github.com/pytorch/ao/pull/487


Devs
* Fix numpy version in CI https://github.com/pytorch/ao/pull/537
* trymerge now uploads merge records to s3 https://github.com/pytorch/ao/pull/448
* Updated python version to 3.9 https://github.com/pytorch/ao/pull/488
* `torchao` no long depends on `torch` https://github.com/pytorch/ao/pull/449
* `benchmark_model` now accepts args and kwargs and supports `cpu` and `mps` backends https://github.com/pytorch/ao/pull/586 https://github.com/pytorch/ao/pull/406
* Add git version suffix to package name https://github.com/pytorch/ao/pull/547
* Added validations to torchao https://github.com/pytorch/ao/pull/453 https://github.com/pytorch/ao/pull/454
* Parallel test support with pytest-xdist https://github.com/pytorch/ao/pull/518
* `Quantizer` now uses `logging` instead of `print` https://github.com/pytorch/ao/pull/472


Not user facing
* Refactored `_replace_linear_8da4w` https://github.com/pytorch/ao/pull/451
* Remove unused code from AQT implementation https://github.com/pytorch/ao/pull/476 https://github.com/pytorch/ao/pull/440 https://github.com/pytorch/ao/pull/441 https://github.com/pytorch/ao/pull/471
* Improved error message for lm_eval script https://github.com/pytorch/ao/pull/444
* Updated HF_TOKEN env variable https://github.com/pytorch/ao/pull/427
* Fixed typo in Quant-LLM in https://github.com/pytorch/ao/pull/450
* Add a test for map_location="cpu" in https://github.com/pytorch/ao/pull/497
* Removed sparse test collection warning https://github.com/pytorch/ao/pull/489
* Refactored layout implementation https://github.com/pytorch/ao/pull/491
* Refactored `LinearActQuantizedTensor` https://github.com/pytorch/ao/pull/542

New Contributors
* qingquansong made their first contribution in https://github.com/pytorch/ao/pull/433
* Hanxian97 made their first contribution in https://github.com/pytorch/ao/pull/451
* larryliu0820 made their first contribution in https://github.com/pytorch/ao/pull/472
* SLR722 made their first contribution in https://github.com/pytorch/ao/pull/480
* jainapurva made their first contribution in https://github.com/pytorch/ao/pull/406
* bdhirsh made their first contribution in https://github.com/pytorch/ao/pull/544
* yanbing-j made their first contribution in https://github.com/pytorch/ao/pull/517
* manuelcandales made their first contribution in https://github.com/pytorch/ao/pull/552
* Valentine233 made their first contribution in https://github.com/pytorch/ao/pull/534

**Full Changelog**: https://github.com/pytorch/ao/compare/v0.3.1-rc1...v0.4.0-rc1

We were able to close about [60% of tasks for 0.4.0](https://github.com/pytorch/ao/issues/493), which will now spill over into upcoming releases. We will post a list for 0.5.0 next, which we aim to release at the end of August 2024. We want to follow a monthly release cadence until further notice.

Page 2 of 3

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.