* Optimize context computing (GEMM) for metal backend * Support repetition penalty option for generation * Update Dockerfile for CPU & CUDA backends with full functionality, hosted on GHCR
0.2.4
* Python binding enhancement: support load-and-convert directly from original Hugging Face models. Intermediate GGML model files are no longer necessary. * Small fix for CLI demo on Windows.
0.2.3
* Windows support: enable AVX/AVX2 for better performance, fix stdout encoding issues, and support python binding on Windows. * API server: support LangChain integration & OpenAI API compatible server. * New model: Support CodeGeeX2 model inference in native c++ & python binding.
0.2.2
* Support MPS (Metal Performance Shaders) backend on Apple silicon devices for ChatGLM2. * Support Volta, Turing and Ampere CUDA architectures.
0.2.1
* 3x speedup for CUDA implementation. * Increase scratch size to accommodate up to 2k context.
0.2.0
First release: * Accelerated CPU inference for ChatGLM-6B and ChatGLM2-6B for real-time chatting on MacBook. * Support int4/int5/int8 quantization, KV cache, efficient sampling, parallel computing and streaming generation. * Python binding, web demo, and more possibilities.