Text-generation

Latest version: v0.7.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 6

0.7.0

Features

- **server**: reduce vram requirements of continuous batching (contributed by njhill)
- **server**: Support BLOOMChat-176B (contributed by njhill)
- **server**: add watermarking tests (contributed by ehsanmok)
- **router**: Adding response schema for compat_generate (contributed by gsaivinay)
- **router**: use number of tokins in batch as input for dynamic batching (co-authored by njhill)
- **server**: improve download and decrease conversion to safetensors RAM requirements
- **server**: optimize flash causal lm decode token
- **server**: shard decode token
- **server**: use cuda graph in logits warping
- **server**: support trust_remote_code
- **tests**: add snapshot testing

Fix

- **server**: use float16
- **server**: fix multinomial implem in Sampling
- **server**: do not use device_map auto on single GPU

Misc

- **docker**: use nvidia base image

New Contributors
* ehsanmok made their first contribution in https://github.com/huggingface/text-generation-inference/pull/248
* gsaivinay made their first contribution in https://github.com/huggingface/text-generation-inference/pull/292
* xyang16 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/343
* oOraph made their first contribution in https://github.com/huggingface/text-generation-inference/pull/359

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v0.6.0...v0.7.0

0.6.0

Features

- **server**: flash attention past key values optimization (contributed by njhill)
- **router**: remove requests when client closes the connection (co-authored by njhill)
- **server**: support quantization for flash models
- **router**: add info route
- **server**: optimize token decode
- **server**: support flash sharded santacoder
- **security**: image signing with cosign
- **security**: image analysis with trivy
- **docker**: improve image size

Fix

- **server**: check cuda capability before importing flash attention
- **server**: fix hf_transfer issue with private repositories
- **router**: add auth token for private tokenizers

Misc

- **rust**: update to 1.69

0.5.0

Features

- **server**: add flash-attention based version of Llama
- **server**: add flash-attention based version of Santacoder
- **server**: support OPT models
- **router**: make router input validation optional
- **docker**: improve layer caching

Fix

- **server**: improve token streaming decoding
- **server**: fix escape charcaters in stop sequences
- **router**: fix NCCL desync issues
- **router**: use buckets for metrics histograms

0.4.3

Fix

- **router**: fix OTLP distributed tracing initialization

0.4.2

Features

- **benchmark**: tui based benchmarking tool
- **router**: Clear cache on error
- **server**: Add mypy-protobuf
- **server**: reduce mlp and attn in one op for flash neox
- **image**: aws sagemaker compatible image

Fix

- **server**: avoid try/except to determine the kind of AutoModel
- **server**: fix flash neox rotary embedding

0.4.1

Features

- **server**: New faster GPTNeoX implementation based on flash attention

Fix

- **server**: fix input-length discrepancy between Rust and Python tokenizers

Page 5 of 6

Releases

Has known vulnerabilities

Previous Next

Text-generation

Page 5 of 6

0.7.0

0.6.0

0.5.0

0.4.3

0.4.2

0.4.1

Page 5 of 6

Links

Releases