Improvements
- Added compile backend support
- Added Aten C++ backend (experimental)
- Faster bit unpacking via pre-allocated empty tensor
- Added VLLM support
- Refactoring to call `quantize_model()` on instances
Supported models
- Llama (Hugging Face + VLLM)
- ViT-CLIP (timm)
Limitations
- HF only supports single GPU runtime.
- VLLM only supports single GPU with a single worker.
- The compile backend sometimes creates issues with async runtime
- Doesn't support PEFT (LoRA, etc.).