Bigger again at 4311 lines :( But, tons of new features this time!
Just over 500 commits since `0.6.0`.
Release Highlights
- Windows support has been dropped to focus on Linux and Mac OS.
- Some functionality may work on Windows but no support will be provided, use WSL instead.
- [DiskTensors](/tinygrad/runtime/ops_disk.py): a way to store tensors on disk has been added.
- This is coupled with functionality in [`state.py`](/tinygrad/nn/state.py) which supports saving/loading safetensors and loading torch weights.
- Tensor Cores are supported on M1/Apple Silicon and on the 7900 XTX (WMMA).
- Support on the 7900 XTX requires weights and data to be in float16, full float16 compute support will come in a later release.
- Tensor Core behaviour/usage is controlled by the `TC` envvar.
- Kernel optimization with nevergrad
- This optimizes the shapes going into the kernel, gated by the `KOPT` envvar.
- P2P buffer transfers are supported on *most* AMD GPUs when using a single python process.
- This is controlled by the `P2P` envvar.
- LLaMA 2 support.
- A requirement of this is bfloat16 support for loading the weights, which is semi-supported by casting them to float16, proper bfloat16 support is tracked at 1290.
- The LLaMA example now also supports 8-bit quantization using the flag `--quantize`.
- Most MLPerf models have working inference examples. Training these models is currently being worked on.
- Initial multigpu training support.
- *slow* multigpu training by copying through host shared memory.
- Somewhat follows torch's multiprocessing and DistributedDataParallel high-level design.
- See the [hlb_cifar10.py](/examples/hlb_cifar10.py) example.
- SymbolicShapeTracker and Symbolic JIT.
- These two things combined allow models with changing shapes to be jitted like transformers.
- This means that LLaMA can now be jitted for a massive increase in performance.
- Be warned that the API for this is very WIP and may change in the future, similarly with the rest of the tinygrad API.
- [aarch64](/tinygrad/renderer/assembly_arm64.py) and [ptx](/tinygrad/renderer/assembly_ptx.py) assembly backend.
- WebGPU backend, see the [`compile_efficientnet.py`](/examples/compile_efficientnet.py) example.
- Support for torch like tensor indexing by other tensors.
- Some more `nn` layers were promoted, namely `Embedding` and various `Conv` layers.
- [VITS](/examples/vits.py) and [so-vits-svc](/examples/so_vits_svc.py) examples added.
- Initial documentation work.
- Quickstart guide: [`/docs/quickstart.md`](/docs/quickstart.md)
- Environment variable reference: [`/docs/env_vars.md`](/docs/env_vars.md)
And lots of small optimizations all over the codebase.
See the full changelog: https://github.com/tinygrad/tinygrad/compare/v0.6.0...v0.7.0
See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc
Join the [Discord](https://discord.gg/beYbxwxVdx)!