Torchao

Latest version: v0.8.0

Safety actively analyzes 702232 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

95.24

| `-q int8wo` | 155.31 | 1028.37 | 8.97 | 6.62 |
| `-q int4wo-32` | 186.70 | 774.98 | 5.31 | 4.15 |
| `-q int4wo-hqq` | 186.47 | 774.01 | 5.04 | 4.15 |
| `-q int8dq` | 49.64 | 328.72 | 9.44 | 6.62 |
| `-q w4a8-cutlass` (**tuned**) | 119.31 | 394.86 | 4.52 | 3.31 |

Prefill performance benchmarks

We’ve added TTFT [benchmarks](https://github.com/pytorch/ao/pull/1140) to torchAO and compared different quantization \+ sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an [option](https://github.com/pytorch/ao/pull/1436) for int8 dynamic quantization that will selectively use prefill during LLM decoding.

![Screenshot 2025-01-15 at 10 06 09 AM](https://github.com/user-attachments/assets/06a029db-db48-4053-9c7b-9e6a47d9361f)

BC Breaking

Delete the float8-all-gather-only functionality from float8 training ([https://github.com/pytorch/ao/pull/1451](https://github.com/pytorch/ao/pull/1451))

The `use_fp8_all_gather_only` was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.

**Before**

python
config = Float8LinearConfig(
...,
the option below is being removed
use_fp8_all_gather_only = True,
)
convert_to_float8_training(model, config=config, ...)


**After**

The `use_fp8_all_gather_only` option is no longer supported.

New Features

* Add TTFT benchmarks + update sparsity benchmarks ([https://github.com/pytorch/ao/pull/1140](https://github.com/pytorch/ao/pull/1140))
* Gemlite integration in torchao ([https://github.com/pytorch/ao/pull/1034](https://github.com/pytorch/ao/pull/1034))
* W4A8 based on CUTLASS ([https://github.com/pytorch/ao/pull/880](https://github.com/pytorch/ao/pull/880))

Improvement

quantize_

* Expose zero_point_domain as arguments ([https://github.com/pytorch/ao/pull/1401](https://github.com/pytorch/ao/pull/1401))
* Add convert path for quantize_ QAT API ([https://github.com/pytorch/ao/pull/1540](https://github.com/pytorch/ao/pull/1540))
* Int8 dynamic prefill weight only decode ([https://github.com/pytorch/ao/pull/1436](https://github.com/pytorch/ao/pull/1436))

autoquant

* Make int8 dynamic quant in autoquant serializable ([https://github.com/pytorch/ao/pull/1484](https://github.com/pytorch/ao/pull/1484))
* Additional fixes for autoquant serialization ([https://github.com/pytorch/ao/pull/1486](https://github.com/pytorch/ao/pull/1486))
* Add exhaustive config option to intmm kernel ([https://github.com/pytorch/ao/pull/1392](https://github.com/pytorch/ao/pull/1392))

float8 training

* \[float8\] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward ([https://github.com/pytorch/ao/pull/1378](https://github.com/pytorch/ao/pull/1378))

experimental

* Remove temp build files from torchao ([https://github.com/pytorch/ao/pull/1551](https://github.com/pytorch/ao/pull/1551))

other

* Torchao setup.py with cmake ([https://github.com/pytorch/ao/pull/1490](https://github.com/pytorch/ao/pull/1490))

Bug Fixes

* Fix bfloat16/float16/float32 options ([https://github.com/pytorch/ao/pull/1369](https://github.com/pytorch/ao/pull/1369))
* Fix a bug in LinearActivationQuantizedTensor ([https://github.com/pytorch/ao/pull/1400](https://github.com/pytorch/ao/pull/1400))
* Fix error message in float8 FSDP utils ([https://github.com/pytorch/ao/pull/1423](https://github.com/pytorch/ao/pull/1423))
* Fixes observer attachment to model based on config for wanda sparsifier ([https://github.com/pytorch/ao/pull/1265](https://github.com/pytorch/ao/pull/1265))
* \[resubmit\] Gemlite fix ([https://github.com/pytorch/ao/pull/1435](https://github.com/pytorch/ao/pull/1435))
* 🐛 Fix: Memory leak in image processing endpoint ([https://github.com/pytorch/ao/pull/1513](https://github.com/pytorch/ao/pull/1513))

Performance

* \[float8\] Re-enable slow-accum in the bwd of axis-wise scaling schemes ([https://github.com/pytorch/ao/pull/1377](https://github.com/pytorch/ao/pull/1377))

Documentation

* Update api\_ref\_quantization.rst ([https://github.com/pytorch/ao/pull/1408](https://github.com/pytorch/ao/pull/1408))
* Update index.rst ([https://github.com/pytorch/ao/pull/1409](https://github.com/pytorch/ao/pull/1409))
* Update QAT READMEs using new APIs ([https://github.com/pytorch/ao/pull/1541](https://github.com/pytorch/ao/pull/1541))

Developers

* Pytorch/ao/torchao/experimental/ops/mps/test ([https://github.com/pytorch/ao/pull/1442](https://github.com/pytorch/ao/pull/1442))
* Verify that submodules are checked out ([https://github.com/pytorch/ao/pull/1536](https://github.com/pytorch/ao/pull/1536))

New Contributors

* sanchitintel made their first contribution in [https://github.com/pytorch/ao/pull/1375](https://github.com/pytorch/ao/pull/1375)
* philipbutler made their first contribution in [https://github.com/pytorch/ao/pull/1337](https://github.com/pytorch/ao/pull/1337)
* airMeng made their first contribution in [https://github.com/pytorch/ao/pull/1401](https://github.com/pytorch/ao/pull/1401)
* DerekLiu35 made their first contribution in [https://github.com/pytorch/ao/pull/1299](https://github.com/pytorch/ao/pull/1299)
* agrawal-aka made their first contribution in [https://github.com/pytorch/ao/pull/1265](https://github.com/pytorch/ao/pull/1265)
* gmagogsfm made their first contribution in [https://github.com/pytorch/ao/pull/1443](https://github.com/pytorch/ao/pull/1443)
* dongxiaolong made their first contribution in [https://github.com/pytorch/ao/pull/1513](https://github.com/pytorch/ao/pull/1513)

**Full Changelog**: [https://github.com/pytorch/ao/compare/v0.7.0...v0.8.0-rc2](https://github.com/pytorch/ao/compare/v0.7.0...v0.8.0-rc2)

3.87

| | awq-uint4 | 43.59 | 194.93 | 7.31 | 4.47 |
| | int4wo-hqq | 209.19 | 804.32 | 4.89 | 3.84 |
| | int4wo-64 | 201.14 | 751.42 | 4.87 | 3.74 |

Usage:

Python
from torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear
quant_dtype = torch.uint4
group_size = 64
calibration_limit = 10
calibration_seq_length = 1024
model=model.to(device)
insert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size)
with torch.no_grad():
for batch in calibration_data:
model(batch.to(device))
is_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear)
quantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)

New Features

- [Prototype] Added Float8 support for AQT tensor parallel (1003)
- Added composable QAT quantizer (938)
- Introduced torchchat quantizer (897)
- Added INT8 mixed-precision training (748)
- Implemented sparse marlin AQT layout (621)
- Added a PerTensor static quant api (787)
- Introduced uintx quant to generate and eval (811)
- Added Float8 Weight Only and FP8 weight + dynamic activation (740)
- Implemented Auto-Round support (581)
- Added 2, 3, 4, 5 bit custom ops (828)
- Introduced symmetric quantization with no clipping error in the tensor subclass based API (845)
- Added int4 weight-only embedding QAT (947)
- Added support for 1-bit and 6-bit quantization for Llama in torchchat (910, 1007)
- Added a linear_observer class for doing static activation calibration (807)
- Exposed hqq through uintx_weight_only API (786)
- Added RowWise scaling option for Float8 dynamic activation quantization (819)
- Added Float8 weight only to autoquant api (866)

Improvements

- Enhanced Auto-Round functionality (870)
- Improved FSDP support for low-bit optimizers (538)
- Added support for using AffineQuantizedTensor with `weights_only=True` for torch.load (630)
- Optimized 3-bit packing (1029)
- Added more evaluation metrics to llama/eval.sh (934)
- Improved eager numerics for dynamic scales in float8 (904)

Bug fixes

- Fixed inference_mode issues (885)
- Fixed failing FP6 benchmark (931)
- Resolved various issues with float8 support (918, 923)
- Fixed load state dict when device is different for low-bit optim (1021)

Performance

- Added SM75 (Turing) support for FP6 kernel (942)
- Implemented int8 dynamic quant + bsr support (821)
- Added workaround to recover the perf for quantized vit in torch.compile (926)
-
INT8 Mixed-Precision Training
On NVIDIA GPUs, INT8 Tensor Cores is approximately 2x faster than their BF16/FP16 counterparts. In mixed-precision training, we can down-cast activations and weights dynamically to INT8 to leverage faster matmuls. However, since INT8 has very limited range [-128,127], we perform row-wise quantization, similar to how INT8 post-training quantization (PTQ) is done. Weight is still in original precision.

Python
from torchao.prototype.quantized_training import int8_mixed_precision_training, Int8MixedPrecisionTrainingConfig
from torchao.quantization import quantize_

model = ...

apply INT8 matmul to all 3 matmuls
quantize_(model, int8_mixed_precision_training())

customize which matmul is left in original precision.
config = Int8MixedPrecisionTrainingConfig(
output=True,
grad_input=True,
grad_weight=False,
)
quantize_(model, int8_mixed_precision_training(config))

**End2end speed benchmark** using `benchmarks/quantized_training/pretrain_llama2.py`

Model & GPU | bs x seq_len| Config | Tok/s | Peak mem (GB)
-----|-----|-----|-----|-----
Llama2-7B, A100 | 8 x 2048 | BF16 (baseline) | ~4400 | 59.69
Llama2-7B, A100 | 8 x 2048 | INT8 mixed-precision | ~6100 (**+39%**) | 58.28
Llama2-1B, 4090 | 16 x 2048 | BF16 (baseline) | ~17,900 | 18.23
Llama2-1B, 4090 | 16 x 2048 | INT8 mixed-precision | ~30,700 (**+72%**) | 18.34

Docs

- Updated README with more current float8 speedup information (816)
- Added tutorial for trainable tensor subclass (908)
- Improved documentation for float8 unification and inference (895, 896)

Devs

- Added compile tests to test suite (906)
- Improved CI setup and build processes (887)
- Added M1 wheel support (822)
- Added more benchmarking and profiling tools (1017)
- Renamed `fpx` to `floatx` (877)
- Removed torchao_nightly package (661)
- Added more lint fixes (827)
- Added better subclass testing support (839)
- Added CI to catch syntax errors (861)
- Added tutorial on composing quantized subclass w/ Dtensor based TP (785)

Security

No significant security updates in this release.

Untopiced

- Added basic SAM2 AutomaticMaskGeneration example server (1039)

New Contributors

New Contributors
* iseeyuan made their first contribution in https://github.com/pytorch/ao/pull/805
* YihengBrianWu made their first contribution in https://github.com/pytorch/ao/pull/860
* kshitij12345 made their first contribution in https://github.com/pytorch/ao/pull/863
* ZainRizvi made their first contribution in https://github.com/pytorch/ao/pull/887
* alexsamardzic made their first contribution in https://github.com/pytorch/ao/pull/899
* vaishnavi17 made their first contribution in https://github.com/pytorch/ao/pull/911
* tobiasvanderwerff made their first contribution in https://github.com/pytorch/ao/pull/931
* kwen2501 made their first contribution in https://github.com/pytorch/ao/pull/937
* y-sq made their first contribution in https://github.com/pytorch/ao/pull/912
* jimexist made their first contribution in https://github.com/pytorch/ao/pull/969
* danielpatrickhug made their first contribution in https://github.com/pytorch/ao/pull/914
* ramreddymounica made their first contribution in https://github.com/pytorch/ao/pull/1007
* yushangdi made their first contribution in https://github.com/pytorch/ao/pull/1006
* ringohoffman made their first contribution in https://github.com/pytorch/ao/pull/1023

**Full Changelog**: https://github.com/pytorch/ao/compare/v0.5.0...v0.6.1

2.4

from torchao.quantization import quantize, int4_weight_only
quantize(model, int4_weight_only())

2.3

from torchao.quantization.quant_api import change_linear_weights_to_int4_woqtensors
change_linear_weights_to_int4_woqtensors(model)



New Features
* Add `quantize` https://github.com/pytorch/ao/pull/256
* Add a prototype of MX format training and inference https://github.com/pytorch/ao/pull/264
* [FP6-LLM] Port splitK map from DeepSpeed https://github.com/pytorch/ao/pull/283
* Improve FP6-LLM 2+4bit weight splitting + user API https://github.com/pytorch/ao/pull/279
* Bitpacking https://github.com/pytorch/ao/pull/291
* training acceleration via runtime semi-structured sparsity https://github.com/pytorch/ao/pull/184
* Bitpackingv2 https://github.com/pytorch/ao/pull/307
* Add FP6-LLM doc and move FP6-LLM to prototype https://github.com/pytorch/ao/pull/358
* Added first bits of Uint2Tensor and BitnetTensor https://github.com/pytorch/ao/pull/282


Improvements
* Improve primitives for FP6 quant https://github.com/pytorch/ao/pull/248
* Extract eval code from GPTQ for more general usage https://github.com/pytorch/ao/pull/275
* Factor out the specific configurations to helper functions https://github.com/pytorch/ao/pull/286
* Add support for `AQTLayout`, `PlainAQTLayout` and `TensorCoreTiledAQTLayout` https://github.com/pytorch/ao/pull/278
* Graceful handling of cpp extensions https://github.com/pytorch/ao/pull/296
* Refactor int8 dynamic quantization with call to `quantize` https://github.com/pytorch/ao/pull/294
* [NF4][FSDP] return contiguous `quantization_factor` https://github.com/pytorch/ao/pull/298
* Refactor int4 and int8 weight only quantization to use `quantize` https://github.com/pytorch/ao/pull/301
* Adding a quick way for users to test model eval for hf models https://github.com/pytorch/ao/pull/328
* Wrap torch.ops.quantized_decomposed to improve import errors https://github.com/pytorch/ao/pull/310
* [NF4Tensor] Switch to save for backward since are now a tensor input https://github.com/pytorch/ao/pull/323
* Refactor rest of tinygemm quant primitive ops https://github.com/pytorch/ao/pull/321
* Move some util functions from quantization.utils to torchao.utils https://github.com/pytorch/ao/pull/337
* Clean up FP6-LLM https://github.com/pytorch/ao/pull/304
* Move quant ops to utils.py https://github.com/pytorch/ao/pull/331
* FP6-LLM clean up (again) https://github.com/pytorch/ao/pull/339
* Improving hf_eval.py https://github.com/pytorch/ao/pull/342
* Generalize Model Size Code https://github.com/pytorch/ao/pull/364
* Minor upgrades to bit pack https://github.com/pytorch/ao/pull/347
* Factor out dispatch and layout registration table https://github.com/pytorch/ao/pull/360
* Add `register_apply_tensor_subclass` https://github.com/pytorch/ao/pull/366
* Refactor custom FPx cast https://github.com/pytorch/ao/pull/363
* Remove all dependencies except torch https://github.com/pytorch/ao/pull/369
* Enable a test for loading state_dict with tensor subclasses https://github.com/pytorch/ao/pull/389
* 073 scripts for benchmarks https://github.com/pytorch/ao/pull/372
* Add WOQ int8 test with Inductor Freeze https://github.com/pytorch/ao/pull/362
* Benchmarking updates for semi-structured sparse training https://github.com/pytorch/ao/pull/398
* add FSDP QLoRA test and revert failing PR https://github.com/pytorch/ao/pull/403
* Refactor the API for quant method argument for quantize function https://github.com/pytorch/ao/pull/400
* eval script fixes https://github.com/pytorch/ao/pull/414


Bug Fixes
* Fixed the HQQ import skip https://github.com/pytorch/ao/pull/262
* fixing autoquant bug https://github.com/pytorch/ao/pull/265
* Fix eval import after 275 https://github.com/pytorch/ao/pull/290
* Fixed f-string printing of `NF4Tensor`s https://github.com/pytorch/ao/pull/297
* Check and fix dequantize_affine is idempotent https://github.com/pytorch/ao/pull/309
* Update old pretrained TorchVision API in ao tutorials (313) https://github.com/pytorch/ao/pull/314
* Fix dimension issues for int4 weight only quant path https://github.com/pytorch/ao/pull/330
* Fix compile in `hf_eval.py` https://github.com/pytorch/ao/pull/341
* task_list to tasks in hf_eval https://github.com/pytorch/ao/pull/343
* fixing peak memory stats for benchmark https://github.com/pytorch/ao/pull/353
* Fix inductor config BC change https://github.com/pytorch/ao/pull/382
* fixing scripts https://github.com/pytorch/ao/pull/395

Performance
* FP8 splitgemm user defined triton kernel https://github.com/pytorch/ao/pull/263
* sparse benchmarking numbers https://github.com/pytorch/ao/pull/303
* Fix FP6-LLM benchmark https://github.com/pytorch/ao/pull/312
* Adding Llama to TorchAO https://github.com/pytorch/ao/pull/276
* Generalize Model Size Code https://github.com/pytorch/ao/pull/364
* eval script for llama https://github.com/pytorch/ao/pull/374
* 077 autoquant gpt fast https://github.com/pytorch/ao/pull/361


Docs
* add static folder for images + fix links https://github.com/pytorch/ao/pull/271
* Fix Readme and remove unused kernel https://github.com/pytorch/ao/pull/270
* Kernel docs https://github.com/pytorch/ao/pull/274
* Quantization Docstrings https://github.com/pytorch/ao/pull/273
* Add `AffineQuantizedTensor` based workflow doc and examples https://github.com/pytorch/ao/pull/277
* Add `AUTOQUANT_CACHE` docs for reusing the same quantization plan https://github.com/pytorch/ao/pull/329
* Update nightly build instructions https://github.com/pytorch/ao/pull/334
* add link to benchmarking script https://github.com/pytorch/ao/pull/355
* New README https://github.com/pytorch/ao/pull/392
* Minor README updates https://github.com/pytorch/ao/pull/401
* Add `quantize` to doc page https://github.com/pytorch/ao/pull/367
* Add link to new custom op tutorial https://github.com/pytorch/ao/pull/424

Devs
* ci: Add push trigger for binary build workflows https://github.com/pytorch/ao/pull/259
* Make fp8 test explicit https://github.com/pytorch/ao/pull/266
* Move `AffineQuantizedTensor` to torchao/dtypes https://github.com/pytorch/ao/pull/272
* Add suffix to package version https://github.com/pytorch/ao/pull/293
* Re-enable AOTI tests https://github.com/pytorch/ao/pull/212
* Add fused QKV `HQQ` `triton_mm` test https://github.com/pytorch/ao/pull/306
* Pin CUDA nightly to mitigate regression https://github.com/pytorch/ao/pull/322
* Unpin CUDA nightly https://github.com/pytorch/ao/pull/333
* Add architecture to index postfix for nightly builds https://github.com/pytorch/ao/pull/336
* Update regression test to python 3.8 https://github.com/pytorch/ao/pull/340
* Remove test_ops.py warning spew https://github.com/pytorch/ao/pull/267
* Add torchao.__version__ https://github.com/pytorch/ao/pull/359
* make torchao test discovery pass in fbcode https://github.com/pytorch/ao/pull/351
* use pytorch version env variable https://github.com/pytorch/ao/pull/373
* Update pre_build_script.sh https://github.com/pytorch/ao/pull/390
* Add support for building CUDA extension on Windows https://github.com/pytorch/ao/pull/396
* Add trymerge https://github.com/pytorch/ao/pull/388
* Fix github CI error https://github.com/pytorch/ao/pull/409
* Fix missing dependencies in trymerge workflow https://github.com/pytorch/ao/pull/413
* Setup trymerge secrets https://github.com/pytorch/ao/pull/416
* Pin CUDA nightlies for mx failures https://github.com/pytorch/ao/pull/428
* fix mx triton kernel after PyTorch triton pin change https://github.com/pytorch/ao/pull/431

Untopiced
* Print the code when the check failed https://github.com/pytorch/ao/pull/254
* Retry of D58015187 Move AsyncCompile to a different file by jamesjwu in https://github.com/pytorch/ao/pull/302
* Revert "Clean up FP6-LLM" https://github.com/pytorch/ao/pull/338
* Update version to 0.3.0 https://github.com/pytorch/ao/pull/348
* Add torchao.__version__ https://github.com/pytorch/ao/pull/359


New Contributors
* seemethere made their first contribution in https://github.com/pytorch/ao/pull/259
* yiliu30 made their first contribution in https://github.com/pytorch/ao/pull/262
* vkuzo made their first contribution in https://github.com/pytorch/ao/pull/264
* vayuda made their first contribution in https://github.com/pytorch/ao/pull/291
* awgu made their first contribution in https://github.com/pytorch/ao/pull/297
* jamesjwu made their first contribution in https://github.com/pytorch/ao/pull/302
* kit1980 made their first contribution in https://github.com/pytorch/ao/pull/314
* RobinKa made their first contribution in https://github.com/pytorch/ao/pull/329
* andreaskoepf made their first contribution in https://github.com/pytorch/ao/pull/282
* clee2000 made their first contribution in https://github.com/pytorch/ao/pull/388

**Full Changelog**: https://github.com/pytorch/ao/compare/v0.2.0...v0.3.0-rc1


We were able to close about [60% of tasks for 0.3.0](https://github.com/pytorch/ao/issues/252), which will now spill over into upcoming releases. We will post a list for 0.4.0 next, which we aim to release at the end of July 2024. We want to follow a monthly release cadence until further notice.

EDIT: We made a patch release for 0.3.1 to include 2 more PRs so now ao has no runtime dependencies https://github.com/pytorch/ao/pull/449 and https://github.com/pytorch/ao/pull/455

0.8.0

Highlights

We are excited to announce the 0.8.0 release of torchao\! In this release we’ve shipped the first CUTLASS kernel in torchAO which adds support for W4A8 linear operator. In addition to this, we’ve also added TTFT benchmarks to torchAO and compared different quantization \+ sparsity speedups for prefill / decoding.

W4A8 based on CUTLASS

A new W4A8 linear operator is implemented, that corresponds to int8\_dynamic\_activation\_int4\_weight quantization where two 4-bit weights get packed into a single 8-bit integer value; also, CUTLASS is made a sub-module of torchao repo, in order to be able to utilize more of its functionality to implement new kernels.

Benchmarks on A100
| `-q parameter` | Average tokens/sec | Average Bandwidth in GB/s | Peak Memory Usage in GB | Model Size in GB |
| :--- | ---: | ---: | ---: | ---: |

0.7.0rc3

Highlights

We are excited to announce the 0.7.0 release of torchao! This release moves QAT out of prototype with improved LoRA support and more flexible APIs, and adds support for new experimental kernels such as Marlin QQQ (for CUDA), `int8_dynamic_activation_intx_weight` (for ARM CPU), and more!

QAT moved out of prototype, LoRA integration, new flexible APIs (1020, 1085, 1152, 1037, 1152)

QAT has been moved out of prototype to `torchao/quantization/qat` to provide better API stability guarantees moving forward. In addition to the existing `*QATQuantizer` classes, we now also support the more flexible `FakeQuantizedLinear` and `FakeQuantizedEmbedding` modules for users to configure the exact quantization settings they wish to use during QAT.

python
from torchao.quantization.qat.api import FakeQuantizeConfig
from torchao.quantization.qat.embedding import FakeQuantizedEmbedding
from torchao.quantization.qat.linear import FakeQuantizedLinear

Specify quantization schemes to use during QAT
activation_config = FakeQuantizeConfig(torch.int8, "per_token", is_symmetric=False)
weight_config = FakeQuantizeConfig(torch.int4, group_size=8)

Replace nn.Linear and nn.Embedding with these in your model
fq_linear = FakeQuantizedLinear(16, 32, False, activation_config, weight_config)
fq_embedding = FakeQuantizedEmbedding(16, 32, weight_config=weight_config)


We also leveraged the new flexible APIs to build a new QAT + LoRA fine-tuning flow in torchtune. Try it out today!

bash
tune run --nnodes 1 --nproc_per_node 4 qat_lora_finetune_distributed --config llama3/8B_qat_lora

Marlin QQQ for CUDA (1113)

Marlin QQQ is an optimized GPU kernel that supports W4A8 mixed precision GEMM. For more details about Marlin QQQ, please refer to [paper](https://arxiv.org/pdf/2406.09904).

python
from torchao.dtypes import MarlinQQQLayout
quantize_(
model,
int8_dynamic_activation_int4_weight(
group_size=128,
mapping_type=MappingType.SYMMETRIC,
act_mapping_type=MappingType.SYMMETRIC,
layout=MarlinQQQLayout(),
),
)


Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#marlin-qqq.

This is a prototype feature - feel free to try out!

int8_dynamic_activation_intx_weight Quantization for ARM CPU (995, 1027, 1254, 1353)

We have kernels that do 8-bit dynamic quantization of activations and uintx groupwise quantization of weights. These kernels are experimental and can only be run on a device with an ARM CPU (e.g., a Mac computers with Apple silicon).

python
from torchao.experimental.quant_api import int8_dynamic_activation_intx_weight
assert precision == torch.float32, "int8_dynamic_activation_intx_weight requires fp32 precision"

Build kernels in temp location, and load them in torch
This requires an ARM CPU
from torchao.experimental.temp_build import temp_build_and_load_torchao_ops
temp_build_and_load_torchao_ops(cmake_lists_path=os.path.dirname(os.path.realpath(__file__)) + "/../../experimental")
Quantize model
nbit = 4
assert nbit >= 1 and nbit <= 8, "nbits must be 1 to 8"
group_size = 128
has_weight_zeros = False
quantize_(
model,
int8_dynamic_activation_intx_weight(
group_size=group_size,
nbit=nbit,
has_weight_zeros=has_weight_zeros,
),
)


Benchmarking results can be found in https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#int8_dynamic_activation_intx_weight-quantization

We are still trying to figure out how to ship the ARM CPU kernels, so the exact API is subject to change.

BC Breaking

Rename AQT2 LayoutType -> Layout ([1049](https://github.com/pytorch/ao/pull/1049))

Before:


from torchao.dtypes import (
BlockSparseLayoutType,
Int4CPULayoutType,
MarlinQQQLayoutType,
MarlinSparseLayoutType,
SemiSparseLayoutType,
TensorCoreTiledLayoutType,
UintxLayoutType,
Float8LayoutType,
LayoutType,
PlainLayoutType,
)


After:


from torchao.dtypes import (
BlockSparseLayout,
Int4CPULayout,
MarlinQQQLayout,
MarlinSparseLayout,
SemiSparseLayout,
TensorCoreTiledLayout,
UintxLayout,
Float8Layout,
Layout,
PlainLayout,
)


QAT imports after move out of prototype (1091)

Before:

python
from torchao.quantization.prototype.qat import (
disable_4w_fake_quant,
disable_8da4w_fake_quant,
enable_4w_fake_quant,
enable_8da4w_fake_quant,
ComposableQATQuantizer,
Int4WeightOnlyQATQuantizer,
Int4WeightOnlyEmbeddingQATQuantizer
Int8DynActInt4WeightQATQuantizer,
Int8DynActInt4WeightQATLinear,
)
from torchao.quantization.prototype.qat.api import (
FakeQuantizeConfig,
)
from torchao.quantization.prototype.qat.fake_quantizer import (
FakeQuantizer,
)


After:

python
from torchao.quantization.qat import (
ComposableQATQuantizer,
Int4WeightOnlyQATQuantizer,
Int4WeightOnlyEmbeddingQATQuantizer
Int8DynActInt4WeightQATQuantizer,
)
from torchao.quantization.qat.linear import (
disable_4w_fake_quant,
disable_8da4w_fake_quant,
enable_4w_fake_quant,
enable_8da4w_fake_quant,
Int8DynActInt4WeightQATLinear,
)
from torchao.quantization.qat.api import (
FakeQuantizeConfig,
)
from torchao.quantization.qat.fake_quantizer import (
FakeQuantizer,
)


New Features

* Add BF16 stochastic rounding option for optimizers (https://github.com/pytorch/ao/pull/1124)
* Add quantize_() API support for NF4 (https://github.com/pytorch/ao/pull/1216)
* Support W4A8 Marlin kernel (https://github.com/pytorch/ao/pull/1113)

Improvements

quantize_

* Add default filtering to remove mis-alinged weights (https://github.com/pytorch/ao/pull/1194)
* Add tensor parallelism support for int4_weight_only quantization (https://github.com/pytorch/ao/pull/1120)
* Add support for asymmetric act quant for int8 dynamic quant (https://github.com/pytorch/ao/pull/1131)
* Add support for groupwise quantization for int8 weight only quantization (https://github.com/pytorch/ao/pull/1121)
* Add AQT tensor parallel for float8_dynamic_quant (https://github.com/pytorch/ao/pull/1078)
* Int8wo Embedding Quant (https://github.com/pytorch/ao/pull/1167)
* Making sure int4 weight only supports cpu as well (https://github.com/pytorch/ao/pull/1203)
* BF16 support for Quant-LLM kernel (https://github.com/pytorch/ao/pull/1147)
* Add hardware check to fp8 quant (https://github.com/pytorch/ao/pull/1314)
* Add support for quantize_() with Float8Linear module (https://github.com/pytorch/ao/pull/1344)

autoquant

* Added support for Per Tensor Scaling for Float8 Dynamic Autoquant (https://github.com/pytorch/ao/pull/1175)
* Add floating point options for autoquant and add accuracy measurement (https://github.com/pytorch/ao/pull/1355)

benchmarks

* Adding batchsize support for torchao llama benchmarks (https://github.com/pytorch/ao/pull/1182)
* Add capability of benchmarking arbitrary binary (https://github.com/pytorch/ao/pull/1107)

experimental

* Add embedding ops aten (https://github.com/pytorch/ao/pull/1129)
* Add embedding ops executorch (https://github.com/pytorch/ao/pull/1137)
* Add quantized embedding kernels to torchao (https://github.com/pytorch/ao/pull/1018)
* Allow deprecated declarations what using Parallel ExecuTorch (https://github.com/pytorch/ao/pull/1031)
* Introduce lowbit quantized linear MPS kernels (https://github.com/pytorch/ao/pull/954)
* Enable 6-bit kernel (https://github.com/pytorch/ao/pull/1027)
* Kleidi 4b blockwise gemv prototype (https://github.com/pytorch/ao/pull/997)
* Experimental 6-bit quantization for Llama in torchchat (https://github.com/pytorch/ao/pull/1094)
* Introduce 7-bit quantization for Llama in torchchat. (https://github.com/pytorch/ao/pull/1139)
* Executorch Subclass API (966) (https://github.com/pytorch/ao/pull/995)
* 8-bit packing support (https://github.com/pytorch/ao/pull/1248)
* Experimental Enable 8-bit (https://github.com/pytorch/ao/pull/1254)
* Experimental Benchmarking (https://github.com/pytorch/ao/pull/1353)

optimizer

* [low-bit optim] Upcast everything to FP32 for internal calculations (https://github.com/pytorch/ao/pull/1068)
* [Low-bit optim] Support for dcp.save() and dcp.load() (https://github.com/pytorch/ao/pull/1217)
* Enable CPU Offload for Intel GPU (https://github.com/pytorch/ao/pull/1324)

SAM2

* SAM2.1 copy (https://github.com/pytorch/ao/pull/1172)
* SAM2 AMG server side request batching (https://github.com/pytorch/ao/pull/1197)
* More SAM2-fast server improvements (https://github.com/pytorch/ao/pull/1285)
* SAM2 Fast AMG: memory profiling and more compile (https://github.com/pytorch/ao/pull/1296)
* SAM2 AMG cli and other QoL improvements (https://github.com/pytorch/ao/pull/1336)
* SAM2 AMG cli.py on modal (https://github.com/pytorch/ao/pull/1349)
* Reduce SAM2 AMG cli startup by using deploy (https://github.com/pytorch/ao/pull/1350)
* Reduce startup time for SAM2 AMG by using torch.export (https://github.com/pytorch/ao/pull/1358)
* More batching and improved furious accuracy/performance (https://github.com/pytorch/ao/pull/1253)
* SAM2.1 and example README (https://github.com/pytorch/ao/pull/1048)
* SAM2 AMG example mIoU, perf numbers and more SAM2 model annotations (https://github.com/pytorch/ao/pull/1196)

other

* Add SpinQuant to generate.py (https://github.com/pytorch/ao/pull/1069)
* SpinQuant (https://github.com/pytorch/ao/pull/983)
* SmoothQuant using tensor subclassing (https://github.com/pytorch/ao/pull/1030)
* Expose FakeQuantizeConfigs in QAT quantizers (https://github.com/pytorch/ao/pull/1214)
* Add module-swap UX for INT8 mixed-precision training (https://github.com/pytorch/ao/pull/1179)
* Float8 training: move module attribute setting to sync function (https://github.com/pytorch/ao/pull/1341)

Bug Fixes

* Header bug fix (https://github.com/pytorch/ao/pull/1079)
* Temporary fix for QAT quantizer when linear layer bias is True (https://github.com/pytorch/ao/pull/1087)
* Fix out-of-bounds memory access in Galore dequant kernel (https://github.com/pytorch/ao/pull/1125)
* Fixed weights_only=True load for float8_dynamic_activation_float8_weight in quant_api (https://github.com/pytorch/ao/pull/1122)
* Fix int8_weight_only group_size (https://github.com/pytorch/ao/pull/1165)
* Is_linear fix for MHA (https://github.com/pytorch/ao/pull/1141)
* Fixing eval.py to use GPTQ_MT for gptq (https://github.com/pytorch/ao/pull/1176)
* [CPU offload optim] Fix when there are non-trainable params (https://github.com/pytorch/ao/pull/1210)
* Fix for weights-only load (https://github.com/pytorch/ao/pull/1228)
* Pin nightlies to deal with std::badalloc (https://github.com/pytorch/ao/pull/1256)
* Fix 2.5.1 failing sparsity test (https://github.com/pytorch/ao/pull/1261)
* Call narrow only for TensorCoreTiledLayout (https://github.com/pytorch/ao/pull/1207)
* Fix an autoquant bug in flatten/unflatten (https://github.com/pytorch/ao/pull/1288)
* Float8 with delayed scaling: fix autocast handling (https://github.com/pytorch/ao/pull/1306)
* Fix bug with float8 training + FSDP2 + TP (https://github.com/pytorch/ao/pull/1327)
* Float8 training: fix bug with AC + compile (https://github.com/pytorch/ao/pull/1329)
* Fix torchtitan + float8 + delayed + compile (https://github.com/pytorch/ao/pull/1334)
* [low-bit optim] Fix edge cases for FSDP2 integration (https://github.com/pytorch/ao/pull/1269)
* [NF4] .to() fixes (https://github.com/pytorch/ao/pull/1312)
* Check scale.ndim before applying t/transpose (https://github.com/pytorch/ao/pull/1339)

Performance

* Swap in faster uint6 bitpacking function (https://github.com/pytorch/ao/pull/1098)
* Implement more efficient pack and unpack uint5 (https://github.com/pytorch/ao/pull/1138)
* Fix 20x slowdown of FP6 kernel due to device properties query (https://github.com/pytorch/ao/pull/1092)

Documentation

* Add a developer guide for exporting to executorch (https://github.com/pytorch/ao/pull/1219)
* Enable AWQ example on CPU (https://github.com/pytorch/ao/pull/1043)
* Add readme doc for experiemental (https://github.com/pytorch/ao/pull/1130)
* Move float8 out of prototype in quantization README (https://github.com/pytorch/ao/pull/1166)
* Update torchao api reference and add contributor guide (https://github.com/pytorch/ao/pull/1255)
* Fix pickle.dump missing file argument typo in README (https://github.com/pytorch/ao/pull/1316)
* Update README.md (https://github.com/pytorch/ao/pull/1319)
* Update README.md: Fix bibtex and sglang links (https://github.com/pytorch/ao/pull/1361)
* Add bibtex (https://github.com/pytorch/ao/pull/1177)
* Clarify torchao.float8 PyTorch version support (https://github.com/pytorch/ao/pull/1191)

Developers

* [Tp Test] Fix the placement of the device tensor (https://github.com/pytorch/ao/pull/1054)
* Skip test_fpx_weight_only in fbcode (https://github.com/pytorch/ao/pull/1056)
* Pin pt nightly CPU version (https://github.com/pytorch/ao/pull/1061)
* Unpin CUDA Nightly (https://github.com/pytorch/ao/pull/1064)
* Update smoke test (https://github.com/pytorch/ao/pull/1111)
* Update regression_test.yml (https://github.com/pytorch/ao/pull/1163)
* Add PyTorch 2.5 to regression test (https://github.com/pytorch/ao/pull/1168)
* Fix Bias APIs, re-enable kleidi tests for arm64 (https://github.com/pytorch/ao/pull/1162)
* Create CITATION.cff (https://github.com/pytorch/ao/pull/1178)
* Unpin nightlies (https://github.com/pytorch/ao/pull/1183)
* [experimental] Kleidi - add operator level tests (https://github.com/pytorch/ao/pull/1173)
* Ruff format and lint (https://github.com/pytorch/ao/pull/1226)
* Update pre-commit to match CI/CD (https://github.com/pytorch/ao/pull/1227)
* Fixing pytest skip for only test_floatx.py (https://github.com/pytorch/ao/pull/1251)
* Fixed invalid url in citation section (https://github.com/pytorch/ao/pull/1348)
* Add to safe globals (https://github.com/pytorch/ao/pull/1171)
* Aqt rename1 Layout -> TensorImpl (https://github.com/pytorch/ao/pull/1046)
* Move and rename GranularityType -> Granularity (https://github.com/pytorch/ao/pull/1038)
* Change torchao quantization types from int to size_t and preface vars with "preferred_" (https://github.com/pytorch/ao/pull/1041)
* Shrink hadamard matrices (https://github.com/pytorch/ao/pull/1051)
* Use ExecuTorch prebuilt library in pip package to build custom kernels (https://github.com/pytorch/ao/pull/1059)
* Update base.h unit to unsigned int (https://github.com/pytorch/ao/pull/962)
* Create header for packed weight ops (https://github.com/pytorch/ao/pull/1072)
* Update cmake files (https://github.com/pytorch/ao/pull/1070)
* Create build_wheels_aarch64_linux.yml (https://github.com/pytorch/ao/pull/1083)
* ROCM binary upload (https://github.com/pytorch/ao/pull/1099)
* Create build_wheels_windows.yml (https://github.com/pytorch/ao/pull/1101)
* Use fewer instructions when unpacking uint6s. (https://github.com/pytorch/ao/pull/1109)
* [CI] XPU binary build enable (https://github.com/pytorch/ao/pull/1105)
* Move common ET/Aten op stuff to ops/library.h (https://github.com/pytorch/ao/pull/1116)
* Move bias from kernel to packed_weights (https://github.com/pytorch/ao/pull/1119)
* Update gpu_sparsity kernel benchmarking script (https://github.com/pytorch/ao/pull/1143)
* [ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1142)
* Move files to prototype/sparsity (https://github.com/pytorch/ao/pull/1145)
* C10::nullopt -> std::nullopt (1032) (https://github.com/pytorch/ao/pull/1151)
* [reland][ROCm] use dataclass for fnuz type setting (https://github.com/pytorch/ao/pull/1150)
* Move float8_aten_api to float8_ops (https://github.com/pytorch/ao/pull/1155)
* Initialize model with meta device for generation benchmarking (https://github.com/pytorch/ao/pull/1144)
* Replace torch.empty with torch.zeros (https://github.com/pytorch/ao/pull/1157)
* Update utils.py (https://github.com/pytorch/ao/pull/1186)
* Remove int_scaled_mm's dependency on triton for cpu (https://github.com/pytorch/ao/pull/128)
* at::optional -> std::optional (1170) (https://github.com/pytorch/ao/pull/1212)
* fast_flush kwarg of do_bench is removed (https://github.com/pytorch/ao/pull/1222)
* Remove calibration args from generate.py (https://github.com/pytorch/ao/pull/1258)
* Skip marlin QQQ ops test in fbcode (https://github.com/pytorch/ao/pull/1289)
* Fix Marlin QQQ ops test with unittest (https://github.com/pytorch/ao/pull/1294)
* Fix Failing CI - Update bitsandbytes import (https://github.com/pytorch/ao/pull/1343)
* Remove lm_eval warning (https://github.com/pytorch/ao/pull/1347)
* Refactor Affine Quantized Tensor ([1234](https://github.com/pytorch/ao/pull/1234))
* Move files from quantization/prototype -> prototype/quantization (1187)
* Add TTFT benchmarks + update sparsity benchmarks (https://github.com/pytorch/ao/pull/1140)
* Add "_gemm_input_role" to dunder slots (https://github.com/pytorch/ao/pull/984)
* Add an option to use fp8-all-gather only without fp8 computation. (https://github.com/pytorch/ao/pull/1093)
* Bump version to 0.7 (https://github.com/pytorch/ao/pull/1045)

New Contributors

* Jack-Khuu made their first contribution in https://github.com/pytorch/ao/pull/1031
* keyan made their first contribution in https://github.com/pytorch/ao/pull/1041
* digantdesai made their first contribution in https://github.com/pytorch/ao/pull/997
* EnragedAntelope made their first contribution in https://github.com/pytorch/ao/pull/962
* c4lcut3c made their first contribution in https://github.com/pytorch/ao/pull/1094
* elfisworking made their first contribution in https://github.com/pytorch/ao/pull/1087
* chuanqi129 made their first contribution in https://github.com/pytorch/ao/pull/1105
* p4arth made their first contribution in https://github.com/pytorch/ao/pull/1122
* xuzijian629 made their first contribution in https://github.com/pytorch/ao/pull/1138
* jeffdaily made their first contribution in https://github.com/pytorch/ao/pull/1142
* r-barnes made their first contribution in https://github.com/pytorch/ao/pull/1151
* helunwencser made their first contribution in https://github.com/pytorch/ao/pull/1157
* bertmaher made their first contribution in https://github.com/pytorch/ao/pull/1222
* tibidoh made their first contribution in https://github.com/pytorch/ao/pull/1248
* mandroid6 made their first contribution in https://github.com/pytorch/ao/pull/1250
* HandH1998 made their first contribution in https://github.com/pytorch/ao/pull/1113
* readleyj made their first contribution in https://github.com/pytorch/ao/pull/1316
* 22dimensions made their first contribution in https://github.com/pytorch/ao/pull/1318
* galqiwi made their first contribution in https://github.com/pytorch/ao/pull/1348
* dbyoung18 made their first contribution in https://github.com/pytorch/ao/pull/1324
* sunjiweiswift made their first contribution in https://github.com/pytorch/ao/pull/1259
* merrymercy made their first contribution in https://github.com/pytorch/ao/pull/1361

Full Changelog: https://github.com/pytorch/ao/compare/v0.6.1...v0.7.0-rc1

Page 1 of 3

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.