Tinygrad

Latest version: v0.8.0

Safety actively analyzes 626801 Python packages for vulnerabilities to keep your Python projects secure.

0.8.0

Close to the new limit of 5000 lines at 4981.

Release Highlights

- Real dtype support within kernels!
- New `.schedule()` API to separate concerns of scheduling and running
- New lazy.py implementation doesn't reorder at build time. `GRAPH=1` is usable to debug issues
- 95 TFLOP FP16->FP32 matmuls on 7900XTX
- GPT2 runs (jitted) in 2 ms on NVIDIA 3090
- Powerful and fast kernel beam search with `BEAM=2`
- GPU/CUDA/HIP backends switched to `gpuctypes`
- New (alpha) multigpu sharding API with `.shard`

See the full changelog: https://github.com/tinygrad/tinygrad/compare/v0.7.0...v0.8.0

Join the [Discord](https://discord.gg/beYbxwxVdx)!

0.7.0

Bigger again at 4311 lines :( But, tons of new features this time!

Just over 500 commits since `0.6.0`.

Release Highlights

- Windows support has been dropped to focus on Linux and Mac OS.
- Some functionality may work on Windows but no support will be provided, use WSL instead.
- [DiskTensors](/tinygrad/runtime/ops_disk.py): a way to store tensors on disk has been added.
- This is coupled with functionality in [`state.py`](/tinygrad/nn/state.py) which supports saving/loading safetensors and loading torch weights.
- Tensor Cores are supported on M1/Apple Silicon and on the 7900 XTX (WMMA).
- Support on the 7900 XTX requires weights and data to be in float16, full float16 compute support will come in a later release.
- Tensor Core behaviour/usage is controlled by the `TC` envvar.
- Kernel optimization with nevergrad
- This optimizes the shapes going into the kernel, gated by the `KOPT` envvar.
- P2P buffer transfers are supported on *most* AMD GPUs when using a single python process.
- This is controlled by the `P2P` envvar.
- LLaMA 2 support.
- A requirement of this is bfloat16 support for loading the weights, which is semi-supported by casting them to float16, proper bfloat16 support is tracked at 1290.
- The LLaMA example now also supports 8-bit quantization using the flag `--quantize`.
- Most MLPerf models have working inference examples. Training these models is currently being worked on.
- Initial multigpu training support.
- *slow* multigpu training by copying through host shared memory.
- Somewhat follows torch's multiprocessing and DistributedDataParallel high-level design.
- See the [hlb_cifar10.py](/examples/hlb_cifar10.py) example.
- SymbolicShapeTracker and Symbolic JIT.
- These two things combined allow models with changing shapes to be jitted like transformers.
- This means that LLaMA can now be jitted for a massive increase in performance.
- Be warned that the API for this is very WIP and may change in the future, similarly with the rest of the tinygrad API.
- [aarch64](/tinygrad/renderer/assembly_arm64.py) and [ptx](/tinygrad/renderer/assembly_ptx.py) assembly backend.
- WebGPU backend, see the [`compile_efficientnet.py`](/examples/compile_efficientnet.py) example.
- Support for torch like tensor indexing by other tensors.
- Some more `nn` layers were promoted, namely `Embedding` and various `Conv` layers.
- [VITS](/examples/vits.py) and [so-vits-svc](/examples/so_vits_svc.py) examples added.
- Initial documentation work.
- Quickstart guide: [`/docs/quickstart.md`](/docs/quickstart.md)
- Environment variable reference: [`/docs/env_vars.md`](/docs/env_vars.md)

And lots of small optimizations all over the codebase.

See the full changelog: https://github.com/tinygrad/tinygrad/compare/v0.6.0...v0.7.0

See the known issues: https://github.com/tinygrad/tinygrad/issues?q=is%3Aissue+is%3Aopen+label%3Abug+sort%3Aupdated-desc

Join the [Discord](https://discord.gg/beYbxwxVdx)!

0.6.0

2516 lines now. Some day I promise a release will make it smaller.
* float16 support (needed for LLaMA)
* Fixed critical bug in training BatchNorm
* Limited support for multiple GPUs
* ConvNeXt + several MLPerf models in models/
* More torch-like methods in tensor.py
* Big refactor of the codegen into the Linearizer and CStyle
* Removed CompiledBuffer, use the LazyBuffer ShapeTracker

0.5.0

An upsetting 2223 lines of code, but so much great stuff!

* 7 backends : CLANG, CPU, CUDA, GPU, LLVM, METAL, and TORCH
* A TinyJit for speed (decorate your GPU function today)
* Support for a lot of onnx, including all the models in the backend tests
* No more MLOP convs, all HLOP (autodiff for convs)
* Improvements to shapetracker and symbolic engine
* 15% faster at running the openpilot model

0.4.0

0.3.0

Fairly stable and correct, though still not fast. The hlops/mlops are solid, just needs work on the llops.

The first automated release, so hopefully it works?

Releases

Has known vulnerabilities