Features
- **server**: reduce vram requirements of continuous batching (contributed by njhill)
- **server**: Support BLOOMChat-176B (contributed by njhill)
- **server**: add watermarking tests (contributed by ehsanmok)
- **router**: Adding response schema for compat_generate (contributed by gsaivinay)
- **router**: use number of tokins in batch as input for dynamic batching (co-authored by njhill)
- **server**: improve download and decrease conversion to safetensors RAM requirements
- **server**: optimize flash causal lm decode token
- **server**: shard decode token
- **server**: use cuda graph in logits warping
- **server**: support trust_remote_code
- **tests**: add snapshot testing
Fix
- **server**: use float16
- **server**: fix multinomial implem in Sampling
- **server**: do not use device_map auto on single GPU
Misc
- **docker**: use nvidia base image
New Contributors
* ehsanmok made their first contribution in https://github.com/huggingface/text-generation-inference/pull/248
* gsaivinay made their first contribution in https://github.com/huggingface/text-generation-inference/pull/292
* xyang16 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/343
* oOraph made their first contribution in https://github.com/huggingface/text-generation-inference/pull/359
**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v0.6.0...v0.7.0