
Latest version: v1.0

Safety actively analyzes 622295 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies


Validated Configurations

- Support models from [ModelScope](

- Enable Mistral-base-v0.2 ([ee40f28](

**Validated Configurations**
- Python 3.9, 3.10, 3.11
- Ubuntu 22.04


Bug Fixing
Validated Configurations

- Improve performance on CPU client
- Support batching and submit GPT-J results to MLPerf v4.0

- Support continuous batching and beam search inference ([7c2199 ]( ))
- Improvement for AVX2 platform ([bc5ee16](, [aa4a8a](, [35c6d10 ]( ))
- Support FFN_fusion for the ChatGLM2([96fadd ]( ))
- Enable loading model from modelscope ([ad3d19 ]( ))
- Extend long input tokens length ([eb41b9 ](, [e76a58e ]( ))
- [BesTLA] Improve RTN quantization accuracy of int4 and int3 ([a90aea](
- [BesTLA] New thread pool and hybrid dispatcher ([fd19a44 ]( ))

- Enable Mixtral 8x7B ([9bcb612 ]( ))
- Enable Mistral-GPTQ ([96dc55 ]( ))
- Implement the YaRN rop scaling feature ([6c36f54 ]( ))
- Enable Qwen 1-5 ([750b35 ]( ))
- Support GPTQ & AWQ inference for Qwen v1, v1.5 and Mixtral-8x7B ([a129213](
• Support GPTQ for Baichuan2-13B & Falcon 7B & Phi-1.5 ([eed9b3](
- Enable Baichuan-7B and refactor Baichuan-13B ([8d5fe2d](
- Enable StableLM2-1.6B & StableLM2-Zephyr-1.6B & StableLM-3B ([872876 ]( ))
- Enable ChatGLM3 ([94e74d ]( ))
- Enable Gemma-2B ([e4c5f71 ]( ))

**Bug Fixing**
- Fix convert_quantized model bug ([37d01f3 ]( ))
- Fix Autoround acc regression ([991c35 ]( ))
- Fix Qwen load error ([2309fbb ]( ))
- Fix the GGUF convert issue ([5293ffa ]( ))

**Validated Configurations**
- Python 3.9, 3.10, 3.11
- Ubuntu 22.04


Bug Fixing
Validated Configurations

- Contributed GPT-J inference to MLPerf v4.0 submission ([mlperf commits](
- Enabled 3-bit low precision inference ([ee40f28](

- Optimization of Layernormalization ([98ffee45](
- Update Qwen python API ([51088a](
- Load processed model automatically ([662553](
- Support continuous batching in Offline and Server ([66cb9f5](
- Support loading models from HF directly ([bb80273](
- Support autoround ([e2d3652](
- Enable OMP in BesTLA ([3afae427](
- Enable log with NEURAL_SPEED_VERBOSE ([a8d9e7](
- Add YaRN rope scaling data structure ([8c846d6](
- Improvements targeting Windows ([464239](

- Enable Qwen 1.8B ([ea4b713](
- Enable Phi-2, Phi-1.5 and Phi-1 ([c212d8](
- Support 3bits & 4bits GPTQ for Gpt-j 6B ([4c9070](
- Support Solar 10.7B with GPTQ ([26c68c7](, [90f5cbd](
- Support Qwen GGUF inference ([cd67b92](

**Bug Fixing**
- Fix log-level introduced perf problem ([6833b2f](, [6f85518f](
- Fix straightforward-API issues ([4c082b7](
- Fix a blocker on Windows platforms ([4adc15](
- Fix whisper python API. ([c97dbe](
- Fix Qwen loading & Mistral GPTQ convert ([d47984c](
- Fix clang-tidy issues ([ad54a1f](
- Fix Mistral online loading issues ([0470b1f](
- Handles models that require a HF token access ID ([33ffaf07](
- Fix the GGUF convert issue ([5293ffa5](
- Fix GPTQ & AWQ convert issue ([150e752](

**Validated Configurations**
- Python 3.10
- Ubuntu 22.04


Bug Fixing
Validated Configurations

- Support Q4_0, Q5_0 and Q8_0 GGUF models and AWQ
- Enhance Tensor Parallelism with shared memory in multi-sockets in single node

- Rename Bestla files and their usage ([d5c26d4]( )
- Update Python API and reorg scripts ([40663e]( )
- Enable AWQ with Llama2 example ([9be307f]( )
- Enable clang tidy ([227e89]( )
- TP support multi-node ([6dbaa0]( )
- Support accuracy calculation for GPTQ models ([7b124aa]( )
- Enable log with NEURAL_SPEED_VERBOSE ([a8d9e7](

- Add Magicoder example ([749caca]( )
- Enable whisper large example ([24b270]( )
- Add Docker file and Readme ([f57d4e1]( )
- Support multi-batch ChatGLM-V1 inference ([c9fb9d](

**Bug Fixing**
- Fix avx512-s8-dequant and asymmetric related bug ([fad80b14]( )
- Fix warmup prompt length and add ns_log_level control ([070b6b]( )
- Fix convert: remove hardcode of AWQ ([7729bb]( )
- Fix the ChatGLM convert issue. ([7671467]( )
- Fix Bestla windows compile issue ([760e5f]( )

**Validated Configurations**
- Python 3.10
- Ubuntu 22.04



- Created Neural Speed project, spinning off from Intel Extension for Transformers

- Support GPTQ models
- Enable Beam Search post-processing.
- Add MX-Format (FP8_E5M2, FP8_E4M3, FP4_E2M1, NF4)
- Refactor Transformers Extension for Low-bit Inference Runtime based on the latest Jblas
- Support Tensor Parallelism with jblas and shared memory.
- Improving the performance of Client CPUs.
- Enabling streaming LLM for Runtime
- Enhance QLoRA on CPU with optimized dropout operator.
- Add Script for PPL Evaluation.
- Refine Python API.
- Allow CompileBF16 on GCC11.
- Multi-Round chat with ChatGLM2.
- Shift-RoPE-based Streaming-LLM.
- Enable MHA fusion for LLM.
- Support AVX_VNNI and AVX2
- Optimize QBits backend.
- GELU support

- Enable finetune for Qwen-7b-chat on CPU.
- Enable Whisper C++ API
- Apply the STS task to BAAI/BGE models.
- Enable Qwen graph.
- Enable instruction_tuning Stable Diffusion examples.
- Enable Mistral-7b.
- Enable Falcon-180B
- Enable Baichuan/Baichuan2 example.

**Validated Configurations**
- Python 3.9, 3.10, 3.11
- GCC 13.1, 11.1
- Centos 8.4 & Ubuntu 20.04 & Windows 10



Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.