Xfastertransformer

Latest version: v1.6.0

Safety actively analyzes 627310 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

1.6.0

v1.6.0 - Llama3 and Qwen2 series models supported.

Functionality
- Support Llama3 and Qwen2 series models.
- Add INT8 KV cache datatype, using `kv_cache_dtype` params to specify, including `int8`, `fp16`(default) and `fp32`.
- More models enable full BF16 pipline, includes Chatglm2/3 and yarn-llama.
- Add invokeMLPLLaMA FP16 API.
- Support logits output using `forward()` api.

Dependency
- Bump `transformers` to `4.40.0` to support Llama3 models.

Performance

1.5.0

v1.5.0 - Gemma series models supported.

Functionality
- Support Gemma series medels, including Gemma and CodeGemma, and DeepSeek model.
- Llama Converter support convert quantized huggingface model by params `from_quantized_model='gptq` into xFt format INT8/INT4 model files.
- Support loading INT4 data weights directly from local files.
- Optimize memory usage during QWen model conversion, particularly for QWen 72B.

Dependency
- Bump `transformers` to `4.38.1` to support Gemma models.
- Add `protobuf` to support new behavier in `tokenzier`.

Performance

1.4.6

BUG fix
- Fix numeric overflow when calculate softmax in sampling.
- fix assert bug when concat gate&up.

1.4.5

- Add GPU kernel library gpuDNN v0.1 to support Intel Arc GPU series.
- Optimize ROPE perfermance by reducing repeated sin and cos embedding table data.
- Accelerate KVCache copy by increasing parallelism in self attention.
- Accelerate addreduce operation in long sequence case by transposing KVCache and tuned comm.

BUG fix
- Fix a incorrect computing which should be in float, but was in integer.
- Fix timeline is disordered.
- Fix runtime issue of Qwen when seq_length is bigger than 32768.

1.4.0

v1.4.0 - Fully BF16 support in Llama for better performance and serving framework support.

Functionality
- Introduce pure BF16 support to Llama series models, now can use fully BF16 data type to to utilize AMX more effectively when deploying Llama models.
- Add MLServer serving framework support and demo in `serving` directory.
- GCC for compiling release binary files has been updated from GCC 8.5 to GCC 12.
- Introduce pipeline parallel feature for distributing deployment. Enabled by `cmake .. -DWITH_PIPELINE_PARALLEL=ON` in compilation and use `XFT_PIPELINE_STAGE` Marco to define pipeline parallel stages num.
- Deprecate convert tool scripts in `tools` directory and it recommended to using `Convert` in xfastertransformer python wheel.
- Support loading int8 data weights directly from local files.

Performance
- Update xDNN to release `v1.4.4`.
- Accelerate model weights loading by optimizing cast operation after loading and gain up to 50% speed up.
- Optimize BF16 performance using AMX instruction when batchsize <= 8, and add `XFT_USE_AMX_M` to set threshold of M using AMX instead of AVX512, default `1`.

Demo & Benchmark
- Update dependency `transformers` requirement from `4.30.0` to `4.36.0` for high risk CVE Vulnerabilities.
- Add distributed inference benchmark script which support deployment across platfrom.
- Add single node platform support in benchmark script.
- Add Yi model web demo.
- Enhance the command-line chat mode in pytorch demo.py, using `--chat true` to enable.

BUG fix
- Fix calculation issue in Qwen models and enhance LogN support for long token sequence.
- Fix unsync issue in multi-rank model when `do_sample` is enabled.
- Fix Baichuan models calculation and convert issue.
- Fix repetition penalties not taking effect on other batches.

1.3.1

BUG fix
- Fix oneCCL environment is still needed when running in single-rank mode.

Page 1 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.