Xfastertransformer

Latest version: v1.7.0

Safety actively analyzes 636086 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 2

1.7.0

v1.7.0 - Continuous batching feature supported.

Functionality
- Refactor framework to support continuous batching feature. `vllm-xft`, a fork of vllm, integrates the xFasterTransformer backend and maintains compatibility with most of the official vLLM's features.
- Remove FP32 data type option of KV Cache.
- Add `get_env()` python API to get recommended LD_PRELOAD set.
- Add GPU build option for Intel Arc GPU series.
- Exposed the interface of the LLaMA model, including Attention and decoder.

Performance

1.6.0

v1.6.0 - Llama3 and Qwen2 series models supported.

Functionality
- Support Llama3 and Qwen2 series models.
- Add INT8 KV cache datatype, using `kv_cache_dtype` params to specify, including `int8`, `fp16`(default) and `fp32`.
- More models enable full BF16 pipline, includes Chatglm2/3 and yarn-llama.
- Add invokeMLPLLaMA FP16 API.
- Support logits output using `forward()` api.

Dependency
- Bump `transformers` to `4.40.0` to support Llama3 models.

Performance

1.5.1

- Baichuan series models supports full FP16 pipline to improve performance.
- More FP16 data type kernel added, including MHA, MLP, YARN rotary_embedding, rmsnorm and rope.
- Kernel implementation of crossAttnByHead.

Dependency
- Bump `torch` to `2.3.0`.

BUG fix
- Fixed the segament fault error when running with more than 4 ranks.
- Fixed the bugs of core dump && hang when running croos nodes.

1.5.0

v1.5.0 - Gemma series models supported.

Functionality
- Support Gemma series medels, including Gemma and CodeGemma, and DeepSeek model.
- Llama Converter support convert quantized huggingface model by params `from_quantized_model='gptq` into xFt format INT8/INT4 model files.
- Support loading INT4 data weights directly from local files.
- Optimize memory usage during QWen model conversion, particularly for QWen 72B.

Dependency
- Bump `transformers` to `4.38.1` to support Gemma models.
- Add `protobuf` to support new behavier in `tokenzier`.

Performance

1.4.6

BUG fix
- Fix numeric overflow when calculate softmax in sampling.
- fix assert bug when concat gate&up.

1.4.5

- Add GPU kernel library gpuDNN v0.1 to support Intel Arc GPU series.
- Optimize ROPE perfermance by reducing repeated sin and cos embedding table data.
- Accelerate KVCache copy by increasing parallelism in self attention.
- Accelerate addreduce operation in long sequence case by transposing KVCache and tuned comm.

BUG fix
- Fix a incorrect computing which should be in float, but was in integer.
- Fix timeline is disordered.
- Fix runtime issue of Qwen when seq_length is bigger than 32768.

Page 1 of 2

Releases

Has known vulnerabilities

Xfastertransformer

Page 1 of 2

1.7.0

1.6.0

1.5.1

1.5.0

1.4.6

1.4.5

Page 1 of 2

Links

Releases