Neural-speed

Latest version: v1.0

Safety actively analyzes 723144 Python packages for vulnerabilities to keep your Python projects secure.

1.0

Highlights
Examples
Validated Configurations

**Highlights**
- Support models from [ModelScope](https://www.modelscope.cn/home)

**Examples**
- Enable Mistral-base-v0.2 ([ee40f28](https://github.com/intel/neural-speed/commit/ee40f28d72dc2f032cebc837c41d3b63ca8f72e7))

**Validated Configurations**
- Python 3.9, 3.10, 3.11
- Ubuntu 22.04

1.0a

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

**Highlights**
- Improve performance on CPU client
- Support batching and submit GPT-J results to MLPerf v4.0

**Improvements**
- Support continuous batching and beam search inference ([7c2199 ](https://github.com/intel/neural-speed/commit/7c2199 ))
- Improvement for AVX2 platform ([bc5ee16](https://github.com/intel/neural-speed/commit/bc5ee16), [aa4a8a](https://github.com/intel/neural-speed/commit/aa4a8a), [35c6d10 ](https://github.com/intel/neural-speed/commit/35c6d10 ))
- Support FFN_fusion for the ChatGLM2([96fadd ](https://github.com/intel/neural-speed/commit/96fadd ))
- Enable loading model from modelscope ([ad3d19 ](https://github.com/intel/neural-speed/commit/ad3d19 ))
- Extend long input tokens length ([eb41b9 ](https://github.com/intel/neural-speed/commit/eb41b9), [e76a58e ](https://github.com/intel/neural-speed/commit/e76a58e ))
- [BesTLA] Improve RTN quantization accuracy of int4 and int3 ([a90aea](https://github.com/intel/neural-speed/commit/a90aea))
- [BesTLA] New thread pool and hybrid dispatcher ([fd19a44 ](https://github.com/intel/neural-speed/commit/fd19a44 ))

**Examples**
- Enable Mixtral 8x7B ([9bcb612 ](https://github.com/intel/neural-speed/commit/9bcb612 ))
- Enable Mistral-GPTQ ([96dc55 ](https://github.com/intel/neural-speed/commit/96dc55 ))
- Implement the YaRN rop scaling feature ([6c36f54 ](https://github.com/intel/neural-speed/commit/6c36f54 ))
- Enable Qwen 1-5 ([750b35 ](https://github.com/intel/neural-speed/commit/750b35 ))
- Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B ([a129213](https://github.com/intel/neural-speed/commit/a129213))
• Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 ([eed9b3](https://github.com/intel/neural-speed/commit/eed9b3))
- Enable Baichuan-7B and refactor Baichuan-13B ([8d5fe2d](https://github.com/intel/neural-speed/commit/8d5fe2d))
- Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B ([872876 ](https://github.com/intel/neural-speed/commit/872876 ))
- Enable ChatGLM3 ([94e74d ](https://github.com/intel/neural-speed/commit/94e74d ))
- Enable Gemma-2B ([e4c5f71 ](https://github.com/intel/neural-speed/commit/e4c5f71 ))

**Bug Fixing**
- Fix convert_quantized model bug ([37d01f3 ](https://github.com/intel/neural-speed/commit/37d01f3 ))
- Fix Autoround acc regression ([991c35 ](https://github.com/intel/neural-speed/commit/991c35 ))
- Fix Qwen load error ([2309fbb ](https://github.com/intel/neural-speed/commit/2309fbb ))
- Fix the GGUF convert issue ([5293ffa ](https://github.com/intel/neural-speed/commit/5293ffa ))

**Validated Configurations**
- Python 3.9, 3.10, 3.11
- Ubuntu 22.04

0.3

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

**Highlights**
- Contributed GPT-J inference to MLPerf v4.0 submission ([mlperf commits](https://github.com/intel/neural-speed/commits/mlperf-v4-0))
- Enabled 3-bit low precision inference ([ee40f28](https://github.com/intel/neural-speed/commit/ee40f28d72dc2f032cebc837c41d3b63ca8f72e7))

**Improvements**
- Optimization of Layernormalization ([98ffee45](https://github.com/intel/neural-speed/commit/98ffee4518a924a4b5527566583f6fdd38a1104c))
- Update Qwen python API ([51088a](https://github.com/intel/neural-speed/commit/51088a253bbfeb8d2b7c17f68abcb5e6e30698b7))
- Load processed model automatically ([662553](https://github.com/intel/neural-speed/commit/662553906e5319befba76af835b06e07c3688413))
- Support continuous batching in Offline and Server ([66cb9f5](https://github.com/intel/neural-speed/commit/66cb9f59c977ff4e3b675d84d3e406548af78baf))
- Support loading models from HF directly ([bb80273](https://github.com/intel/neural-speed/commit/bb802735f9cdc23a3655dba80873af99def3bc7a))
- Support autoround ([e2d3652](https://github.com/intel/neural-speed/commit/e2d365206fea12b66bd2f9d7c921d2ad9de3a439))
- Enable OMP in BesTLA ([3afae427](https://github.com/intel/neural-speed/commit/3afae427d8e1b411a3a1b293e7554c5ab6b891fa))
- Enable log with NEURAL_SPEED_VERBOSE ([a8d9e7](https://github.com/intel/neural-speed/commit/a8d9e7))
- Add YaRN rope scaling data structure ([8c846d6](https://github.com/intel/neural-speed/commit/8c846d6a4bbedaee0e427ad6e70557a7caac2618))
- Improvements targeting Windows ([464239](https://github.com/intel/neural-speed/commit/46423952ec39d6ad8935f1bb4b6e1655b90cee16))

**Examples**
- Enable Qwen 1.8B ([ea4b713](https://github.com/intel/neural-speed/commit/ea4b71321ac23c070882b87798a5e3e5d126fe27))
- Enable Phi-2, Phi-1.5 and Phi-1 ([c212d8](https://github.com/intel/neural-speed/commit/c212d892d19cd251c0ba90d364d0e0b2e2b8f2be))
- Support 3bits & 4bits GPTQ for Gpt-j 6B ([4c9070](https://github.com/intel/neural-speed/commit/4c907072849eedaa58ca7d1dbd734842c802d431))
- Support Solar 10.7B with GPTQ ([26c68c7](https://github.com/intel/neural-speed/commit/26c68c75cc49a49d0abd658443d2c344d8cb2ee0), [90f5cbd](https://github.com/intel/neural-speed/commit/90f5cbded3f56e1852fad1a4cca8a959a9545b31))
- Support Qwen GGUF inference ([cd67b92](https://github.com/intel/neural-speed/commit/cd67b92225103706bd0434194bdcb825092accde))

**Bug Fixing**
- Fix log-level introduced perf problem ([6833b2f](https://github.com/intel/neural-speed/commit/6833b2fc6c15b51d18f73f65c6da345c212d54e5), [6f85518f](https://github.com/intel/neural-speed/commit/6f85518fbe5d0ea49fd30faa457a9c34e5ede3db))
- Fix straightforward-API issues ([4c082b7](https://github.com/intel/neural-speed/commit/4c082b746215a5f1057f75227dfbf087b1f29888))
- Fix a blocker on Windows platforms ([4adc15](https://github.com/intel/neural-speed/commit/4adc1504ee7cefb92b693c56508028cc0ecda239))
- Fix whisper python API. ([c97dbe](https://github.com/intel/neural-speed/commit/c97dbe081a37e971612f7a902da383880b56e645))
- Fix Qwen loading & Mistral GPTQ convert ([d47984c](https://github.com/intel/neural-speed/commit/d47984c4aac064dc2520d293da7947940e75ddeb))
- Fix clang-tidy issues ([ad54a1f](https://github.com/intel/neural-speed/commit/ad54a1fa8ad355fc6bd4b0f7f35673473211e599))
- Fix Mistral online loading issues ([0470b1f](https://github.com/intel/neural-speed/commit/0470b1febce1a32c217cc78aedd7114cf78ea5fa))
- Handles models that require a HF token access ID ([33ffaf07](https://github.com/intel/neural-speed/commit/33ffaf07e669de6903706cd8af147dd4758c383c))
- Fix the GGUF convert issue ([5293ffa5](https://github.com/intel/neural-speed/commit/5293ffa58657678ed58a2ee0a8bf6c6dc55f832d))
- Fix GPTQ & AWQ convert issue ([150e752](https://github.com/intel/neural-speed/commit/150e7527d5286ddd3a995c228dedf8d76a7a86bc))

**Validated Configurations**
- Python 3.10
- Ubuntu 22.04

0.2

Highlights
Improvements
Examples
Bug Fixing
Validated Configurations

**Highlights**
- Support Q4_0, Q5_0 and Q8_0 GGUF models and AWQ
- Enhance Tensor Parallelism with shared memory in multi-sockets in single node

**Improvements**
- Rename Bestla files and their usage ([d5c26d4](https://github.com/intel/neural-speed/commit/d5c26d4c4b372f28b07370b9922ebf246778d0b6) )
- Update Python API and reorg scripts ([40663e](https://github.com/intel/neural-speed/commit/40663eca91750332e6629f72b6681b81866e9070) )
- Enable AWQ with Llama2 example ([9be307f](https://github.com/intel/neural-speed/commit/9be307fd8eb76c772bd77667ac8074d380363a69) )
- Enable clang tidy ([227e89](https://github.com/intel/neural-speed/commit/227e89f5c58672cdc16308d19bb5ca7e6171e9fe) )
- TP support multi-node ([6dbaa0](https://github.com/intel/neural-speed/commit/6dbaa028c475225196653c83f225059b65a8151e) )
- Support accuracy calculation for GPTQ models ([7b124aa](https://github.com/intel/neural-speed/commit/7b124aa028fb7c9268ebe4d9108be633387afa1b) )
- Enable log with NEURAL_SPEED_VERBOSE ([a8d9e7](https://github.com/intel/neural-speed/commit/a8d9e7))

**Examples**
- Add Magicoder example ([749caca](https://github.com/intel/neural-speed/commit/749caca6e94b88a8e105f9f4f6ab036acf2d87a0) )
- Enable whisper large example ([24b270](https://github.com/intel/neural-speed/commit/24b270eb1b77c84e5add9a7490a9e0a32b582072) )
- Add Docker file and Readme ([f57d4e1](https://github.com/intel/neural-speed/commit/f57d4e1792bf4721d65137b14600a63b1ba83799) )
- Support multi-batch ChatGLM-V1 inference ([c9fb9d](https://github.com/intel/neural-speed/commit/c9fb9d))

**Bug Fixing**
- Fix avx512-s8-dequant and asymmetric related bug ([fad80b14](https://github.com/intel/neural-speed/commit/fad80b14fa804ac0f2bae613a2dd857b5b7592f4) )
- Fix warmup prompt length and add ns_log_level control ([070b6b](https://github.com/intel/neural-speed/commit/070b6b9b86470f184dea01cdca2ef0b2630e6455) )
- Fix convert: remove hardcode of AWQ ([7729bb](https://github.com/intel/neural-speed/commit/7729bbbacbd67b10df74f0b2ba22531ccf6a4ea3) )
- Fix the ChatGLM convert issue. ([7671467](https://github.com/intel/neural-speed/commit/7671467c8c90efb087c4a842be73182c2a2ee3f6) )
- Fix Bestla windows compile issue ([760e5f](https://github.com/intel/neural-speed/commit/760e5fdeb4bcf45b49efd3845d8c1c136c7c2b93) )

**Validated Configurations**
- Python 3.10
- Ubuntu 22.04

0.1

Highlights
Features
Examples

**Highlights**
- Created Neural Speed project, spinning off from Intel Extension for Transformers

**Features**
- Support GPTQ models
- Enable Beam Search post-processing.
- Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
- Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
- Support Tensor Parallelism with jblas and shared memory.
- Improving the performance of Client CPUs.
- Enabling streaming LLM for Runtime
- Enhance QLoRA on CPU with optimized dropout operator.
- Add Script for PPL Evaluation.
- Refine Python API.
- Allow CompileBF16 on GCC11.
- Multi-Round chat with ChatGLM2.
- Shift-RoPE-based Streaming-LLM.
- Enable MHA fusion for LLM.
- Support AVX_VNNI and AVX2
- Optimize QBits backend.
- GELU support

**Examples**
- Enable finetune for Qwen-7b-chat on CPU.
- Enable Whisper C++ API
- Apply the STS task to BAAI/BGE models.
- Enable Qwen graph.
- Enable instruction_tuning Stable Diffusion examples.
- Enable Mistral-7b.
- Enable Falcon-180B
- Enable Baichuan/Baichuan2 example.

**Validated Configurations**
- Python 3.9, 3.10, 3.11
- GCC 13.1, 11.1
- Centos 8.4 & Ubuntu 20.04 & Windows 10

Releases

Has known vulnerabilities

Neural-speed

Page 1 of 1

1.0

1.0a

0.3

0.2

0.1

Page 1 of 1

Links

Releases