Xfastertransformer

Latest version: v1.8.2

Safety actively analyzes 688600 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 3

1.8.2

Performance
- Enable flash attention by default for `W8A8` dtype to accelerate the performance of the 1st token.

Benchmark
- When the number of ranks is 1, run in single mode to avoid the dependency on `mpirun`.
- Support `SNC-3` platform.

1.8.1

Functionality
- Expose the interface of embedding lookup.

Performance
- Optimized the performance of grouped query attention (GQA).
- Enhanced the performance of creating keys for the oneDNN primitive cache.
- Set the [bs][nh][seq][hs] layout as the default for KV Cache, resulting in better performance.
- Improved the task split imbalance issue in self-attention.

1.8.0

v1.8.0 Continuous Batching on Single ARC GPU and AMX_FP16 Support.

Highlight
- Continuous Batching on Single ARC GPU is supported and can be integrated by `vllm-xft`.
- Introduce Intel AMX instructions support for `float16` data type.

Models
- Support ChatGLM4 series models.
- Introduce BF16/FP16 full path support for Qwen series models.

BUG fix
- Fixed memory leak of oneDNN primitive cache.
- Fixed SPR-HBM flat QUAD mode detect issue in benchmark scripts.
- Fixed heads Split error for distributed Grouped-query attention(GQA).
- Fixed an issue with the invokeAttentionLLaMA API.

1.7.3

BUG fix
- Fixed SHM reduceAdd & rope error when batch size is large.
- Fixed the issue of abnormal usage of oneDNN primitive cache.

1.7.2

v1.7.2 - Continuous batching feature supports Qwen 1.0 & hybrid data types.

Functionality
- Add continuous batching support of Qwen 1.0 models.
- Enable hybrid data types for continuous batching feature, including `BF16_FP16, BF16_INT8, BF16_W8A8, BF16_INT4, BF16_NF4, W8A8_INT8, W8A8_int4, W8A8_NF4`.

BUG fix
- Fixed the convert fault in Baichuan1 models.

1.7.1

v1.7.1 - Continuous batching feature supports ChatGLM2/3.

Functionality
- Add continuous batching support of ChatGLM2/3 models.
- Qwen2Convert supports quantized Qwen2 models by GPTQ, such as GPTQ-Int8 and GPTQ-Int4, by param `from_quantized_model="gptq"`.

BUG fix
- Fixed the segament fault error when running with more than 2 ranks in vllm-xft serving.

Page 1 of 3

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.