Torchao

Latest version: v0.9.0

Safety actively analyzes 722491 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 4

95.24

| `-q int8wo` | 155.31 | 1028.37 | 8.97 | 6.62 |
| `-q int4wo-32` | 186.70 | 774.98 | 5.31 | 4.15 |
| `-q int4wo-hqq` | 186.47 | 774.01 | 5.04 | 4.15 |
| `-q int8dq` | 49.64 | 328.72 | 9.44 | 6.62 |
| `-q w4a8-cutlass` (**tuned**) | 119.31 | 394.86 | 4.52 | 3.31 |

Prefill performance benchmarks

We’ve added TTFT [benchmarks](https://github.com/pytorch/ao/pull/1140) to torchAO and compared different quantization \+ sparsity speedups for prefill / decoding. During prefill, we are compute bound and find that dynamic quantization offers greater speedups over weight-only quantization, which is faster for prefill. We’ve also added an [option](https://github.com/pytorch/ao/pull/1436) for int8 dynamic quantization that will selectively use prefill during LLM decoding.

![Screenshot 2025-01-15 at 10 06 09 AM](https://github.com/user-attachments/assets/06a029db-db48-4053-9c7b-9e6a47d9361f)

BC Breaking

Delete the float8-all-gather-only functionality from float8 training ([https://github.com/pytorch/ao/pull/1451](https://github.com/pytorch/ao/pull/1451))

The `use_fp8_all_gather_only` was an experimental flag, off by default, which was not marketed and not used by anyone as far as we know. We are removing it to simplify the code.

**Before**

python
config = Float8LinearConfig(
...,
the option below is being removed
use_fp8_all_gather_only = True,
)
convert_to_float8_training(model, config=config, ...)

**After**

The `use_fp8_all_gather_only` option is no longer supported.

New Features

* Add TTFT benchmarks + update sparsity benchmarks ([https://github.com/pytorch/ao/pull/1140](https://github.com/pytorch/ao/pull/1140))
* Gemlite integration in torchao ([https://github.com/pytorch/ao/pull/1034](https://github.com/pytorch/ao/pull/1034))
* W4A8 based on CUTLASS ([https://github.com/pytorch/ao/pull/880](https://github.com/pytorch/ao/pull/880))

Improvement

quantize_

* Expose zero_point_domain as arguments ([https://github.com/pytorch/ao/pull/1401](https://github.com/pytorch/ao/pull/1401))
* Add convert path for quantize_ QAT API ([https://github.com/pytorch/ao/pull/1540](https://github.com/pytorch/ao/pull/1540))
* Int8 dynamic prefill weight only decode ([https://github.com/pytorch/ao/pull/1436](https://github.com/pytorch/ao/pull/1436))

autoquant

* Make int8 dynamic quant in autoquant serializable ([https://github.com/pytorch/ao/pull/1484](https://github.com/pytorch/ao/pull/1484))
* Additional fixes for autoquant serialization ([https://github.com/pytorch/ao/pull/1486](https://github.com/pytorch/ao/pull/1486))
* Add exhaustive config option to intmm kernel ([https://github.com/pytorch/ao/pull/1392](https://github.com/pytorch/ao/pull/1392))

float8 training

* \[float8\] Allow specifying arbitrary dtype for each tensor, enabling recipes with e4m3 in both the forward and the backward ([https://github.com/pytorch/ao/pull/1378](https://github.com/pytorch/ao/pull/1378))

experimental

* Remove temp build files from torchao ([https://github.com/pytorch/ao/pull/1551](https://github.com/pytorch/ao/pull/1551))

other

* Torchao setup.py with cmake ([https://github.com/pytorch/ao/pull/1490](https://github.com/pytorch/ao/pull/1490))

Bug Fixes

* Fix bfloat16/float16/float32 options ([https://github.com/pytorch/ao/pull/1369](https://github.com/pytorch/ao/pull/1369))
* Fix a bug in LinearActivationQuantizedTensor ([https://github.com/pytorch/ao/pull/1400](https://github.com/pytorch/ao/pull/1400))
* Fix error message in float8 FSDP utils ([https://github.com/pytorch/ao/pull/1423](https://github.com/pytorch/ao/pull/1423))
* Fixes observer attachment to model based on config for wanda sparsifier ([https://github.com/pytorch/ao/pull/1265](https://github.com/pytorch/ao/pull/1265))
* \[resubmit\] Gemlite fix ([https://github.com/pytorch/ao/pull/1435](https://github.com/pytorch/ao/pull/1435))
* 🐛 Fix: Memory leak in image processing endpoint ([https://github.com/pytorch/ao/pull/1513](https://github.com/pytorch/ao/pull/1513))

Performance

* \[float8\] Re-enable slow-accum in the bwd of axis-wise scaling schemes ([https://github.com/pytorch/ao/pull/1377](https://github.com/pytorch/ao/pull/1377))

Documentation

* Update api\_ref\_quantization.rst ([https://github.com/pytorch/ao/pull/1408](https://github.com/pytorch/ao/pull/1408))
* Update index.rst ([https://github.com/pytorch/ao/pull/1409](https://github.com/pytorch/ao/pull/1409))
* Update QAT READMEs using new APIs ([https://github.com/pytorch/ao/pull/1541](https://github.com/pytorch/ao/pull/1541))

Developers

* Pytorch/ao/torchao/experimental/ops/mps/test ([https://github.com/pytorch/ao/pull/1442](https://github.com/pytorch/ao/pull/1442))
* Verify that submodules are checked out ([https://github.com/pytorch/ao/pull/1536](https://github.com/pytorch/ao/pull/1536))

New Contributors

* sanchitintel made their first contribution in [https://github.com/pytorch/ao/pull/1375](https://github.com/pytorch/ao/pull/1375)
* philipbutler made their first contribution in [https://github.com/pytorch/ao/pull/1337](https://github.com/pytorch/ao/pull/1337)
* airMeng made their first contribution in [https://github.com/pytorch/ao/pull/1401](https://github.com/pytorch/ao/pull/1401)
* DerekLiu35 made their first contribution in [https://github.com/pytorch/ao/pull/1299](https://github.com/pytorch/ao/pull/1299)
* agrawal-aka made their first contribution in [https://github.com/pytorch/ao/pull/1265](https://github.com/pytorch/ao/pull/1265)
* gmagogsfm made their first contribution in [https://github.com/pytorch/ao/pull/1443](https://github.com/pytorch/ao/pull/1443)
* dongxiaolong made their first contribution in [https://github.com/pytorch/ao/pull/1513](https://github.com/pytorch/ao/pull/1513)

**Full Changelog**: [https://github.com/pytorch/ao/compare/v0.7.0...v0.8.0-rc2](https://github.com/pytorch/ao/compare/v0.7.0...v0.8.0-rc2)

6.01

6.00

4.88

| 2:4 sparse + int4wo (marlin) | 255.21 | 3.89 |

Block Sparsity technique names (bsr) indicate sparsity fraction and blocksize.

These numbers were generated on H100 using torchao/_models/llama/generate.py on the Meta-Llama-3.1-8B model. You can reproduce these numbers using this [script](https://gist.github.com/HDCharles/6e782c33d5aac24b36fa81d9e3bd5f5c)

BC Breaking

quantize_ configuration callables -> configs (https://github.com/pytorch/ao/pull/1595, https://github.com/pytorch/ao/pull/1694, https://github.com/pytorch/ao/pull/1696, https://github.com/pytorch/ao/pull/1697)

We are migrating the way `quantize_` workflows are configured from callables (tensor subclass inserters) to direct configuration (config objects). Motivation: align with the rest of the ecosystem, enable inspection of configs after instantiation, remove a common source of confusion.

**What is changing:**

Specifically, here is how the signature of `quantize_`'s second argument will change:

python

3.87

| | awq-uint4 | 43.59 | 194.93 | 7.31 | 4.47 |
| | int4wo-hqq | 209.19 | 804.32 | 4.89 | 3.84 |
| | int4wo-64 | 201.14 | 751.42 | 4.87 | 3.74 |

Usage:

Python
from torchao.prototype.awq import insert_awq_observer_, awq_uintx, AWQObservedLinear
quant_dtype = torch.uint4
group_size = 64
calibration_limit = 10
calibration_seq_length = 1024
model=model.to(device)
insert_awq_observer_(model,calibration_limit, calibration_seq_length, quant_dtype=quant_dtype, group_size=group_size)
with torch.no_grad():
for batch in calibration_data:
model(batch.to(device))
is_observed_linear = lambda m, fqn: isinstance(m, AWQObservedLinear)
quantize_(model, awq_uintx(quant_dtype=quant_dtype, group_size = group_size), is_observed_linear)

New Features

- [Prototype] Added Float8 support for AQT tensor parallel (1003)
- Added composable QAT quantizer (938)
- Introduced torchchat quantizer (897)
- Added INT8 mixed-precision training (748)
- Implemented sparse marlin AQT layout (621)
- Added a PerTensor static quant api (787)
- Introduced uintx quant to generate and eval (811)
- Added Float8 Weight Only and FP8 weight + dynamic activation (740)
- Implemented Auto-Round support (581)
- Added 2, 3, 4, 5 bit custom ops (828)
- Introduced symmetric quantization with no clipping error in the tensor subclass based API (845)
- Added int4 weight-only embedding QAT (947)
- Added support for 1-bit and 6-bit quantization for Llama in torchchat (910, 1007)
- Added a linear_observer class for doing static activation calibration (807)
- Exposed hqq through uintx_weight_only API (786)
- Added RowWise scaling option for Float8 dynamic activation quantization (819)
- Added Float8 weight only to autoquant api (866)

Improvements

- Enhanced Auto-Round functionality (870)
- Improved FSDP support for low-bit optimizers (538)
- Added support for using AffineQuantizedTensor with `weights_only=True` for torch.load (630)
- Optimized 3-bit packing (1029)
- Added more evaluation metrics to llama/eval.sh (934)
- Improved eager numerics for dynamic scales in float8 (904)

Bug fixes

- Fixed inference_mode issues (885)
- Fixed failing FP6 benchmark (931)
- Resolved various issues with float8 support (918, 923)
- Fixed load state dict when device is different for low-bit optim (1021)

Performance

- Added SM75 (Turing) support for FP6 kernel (942)
- Implemented int8 dynamic quant + bsr support (821)
- Added workaround to recover the perf for quantized vit in torch.compile (926)
-
INT8 Mixed-Precision Training
On NVIDIA GPUs, INT8 Tensor Cores is approximately 2x faster than their BF16/FP16 counterparts. In mixed-precision training, we can down-cast activations and weights dynamically to INT8 to leverage faster matmuls. However, since INT8 has very limited range [-128,127], we perform row-wise quantization, similar to how INT8 post-training quantization (PTQ) is done. Weight is still in original precision.

Python
from torchao.prototype.quantized_training import int8_mixed_precision_training, Int8MixedPrecisionTrainingConfig
from torchao.quantization import quantize_

model = ...

apply INT8 matmul to all 3 matmuls
quantize_(model, int8_mixed_precision_training())

customize which matmul is left in original precision.
config = Int8MixedPrecisionTrainingConfig(
output=True,
grad_input=True,
grad_weight=False,
)
quantize_(model, int8_mixed_precision_training(config))

**End2end speed benchmark** using `benchmarks/quantized_training/pretrain_llama2.py`

Model & GPU | bs x seq_len| Config | Tok/s | Peak mem (GB)
-----|-----|-----|-----|-----
Llama2-7B, A100 | 8 x 2048 | BF16 (baseline) | ~4400 | 59.69
Llama2-7B, A100 | 8 x 2048 | INT8 mixed-precision | ~6100 (**+39%**) | 58.28
Llama2-1B, 4090 | 16 x 2048 | BF16 (baseline) | ~17,900 | 18.23
Llama2-1B, 4090 | 16 x 2048 | INT8 mixed-precision | ~30,700 (**+72%**) | 18.34

Docs

- Updated README with more current float8 speedup information (816)
- Added tutorial for trainable tensor subclass (908)
- Improved documentation for float8 unification and inference (895, 896)

Devs

- Added compile tests to test suite (906)
- Improved CI setup and build processes (887)
- Added M1 wheel support (822)
- Added more benchmarking and profiling tools (1017)
- Renamed `fpx` to `floatx` (877)
- Removed torchao_nightly package (661)
- Added more lint fixes (827)
- Added better subclass testing support (839)
- Added CI to catch syntax errors (861)
- Added tutorial on composing quantized subclass w/ Dtensor based TP (785)

Security

No significant security updates in this release.

Untopiced

- Added basic SAM2 AutomaticMaskGeneration example server (1039)

New Contributors

New Contributors
* iseeyuan made their first contribution in https://github.com/pytorch/ao/pull/805
* YihengBrianWu made their first contribution in https://github.com/pytorch/ao/pull/860
* kshitij12345 made their first contribution in https://github.com/pytorch/ao/pull/863
* ZainRizvi made their first contribution in https://github.com/pytorch/ao/pull/887
* alexsamardzic made their first contribution in https://github.com/pytorch/ao/pull/899
* vaishnavi17 made their first contribution in https://github.com/pytorch/ao/pull/911
* tobiasvanderwerff made their first contribution in https://github.com/pytorch/ao/pull/931
* kwen2501 made their first contribution in https://github.com/pytorch/ao/pull/937
* y-sq made their first contribution in https://github.com/pytorch/ao/pull/912
* jimexist made their first contribution in https://github.com/pytorch/ao/pull/969
* danielpatrickhug made their first contribution in https://github.com/pytorch/ao/pull/914
* ramreddymounica made their first contribution in https://github.com/pytorch/ao/pull/1007
* yushangdi made their first contribution in https://github.com/pytorch/ao/pull/1006
* ringohoffman made their first contribution in https://github.com/pytorch/ao/pull/1023

**Full Changelog**: https://github.com/pytorch/ao/compare/v0.5.0...v0.6.1

2.4

from torchao.quantization import quantize, int4_weight_only
quantize(model, int4_weight_only())

Page 1 of 4

Releases

Has known vulnerabilities

Torchao

Page 1 of 4

95.24

6.01

6.00

4.88

3.87

2.4

Page 1 of 4

Links

Releases