Highlights
Features
Examples
**Highlights**
- Created Neural Speed project, spinning off from Intel Extension for Transformers
**Features**
- Support GPTQ models
- Enable Beam Search post-processing.
- Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
- Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
- Support Tensor Parallelism with jblas and shared memory.
- Improving the performance of Client CPUs.
- Enabling streaming LLM for Runtime
- Enhance QLoRA on CPU with optimized dropout operator.
- Add Script for PPL Evaluation.
- Refine Python API.
- Allow CompileBF16 on GCC11.
- Multi-Round chat with ChatGLM2.
- Shift-RoPE-based Streaming-LLM.
- Enable MHA fusion for LLM.
- Support AVX_VNNI and AVX2
- Optimize QBits backend.
- GELU support
**Examples**
- Enable finetune for Qwen-7b-chat on CPU.
- Enable Whisper C++ API
- Apply the STS task to BAAI/BGE models.
- Enable Qwen graph.
- Enable instruction_tuning Stable Diffusion examples.
- Enable Mistral-7b.
- Enable Falcon-180B
- Enable Baichuan/Baichuan2 example.
**Validated Configurations**
- Python 3.9, 3.10, 3.11
- GCC 13.1, 11.1
- Centos 8.4 & Ubuntu 20.04 & Windows 10