🍱 To better support LLM serving through response streaming, we are proud to introduce an experimental support of server-sent events (SSE) streaming support in this release of BentoML `v1.14` and OpenLLM `v0.2.27`. See an example [service definition](https://gist.github.com/ssheng/38e59e475f3ac5b0f9299c71f7dc3185) for SSE streaming with Llama2.
- Added response streaming through SSE to the `bentoml.io.Text` IO Descriptor type.
- Added async generator support to both API Server and Runner to `yield` incremental text responses.
- Added supported to ☁️ BentoCloud to natively support SSE streaming.
🦾 OpenLLM added token streaming capabilities to support streaming responses from LLMs.
- Added `/v1/generate_stream` endpoint for streaming responses from LLMs.
bash
curl -N -X 'POST' 'http://0.0.0.0:3000/v1/generate_stream' -H 'accept: application/json' -H 'Content-Type: application/json' -d '{
"prompt": " Instruction:\n What is the definition of time (200 words essay)?\n\n Response:",
"llm_config": {
"use_llama2_prompt": false,
"max_new_tokens": 4096,
"early_stopping": false,
"num_beams": 1,
"num_beam_groups": 1,
"use_cache": true,
"temperature": 0.89,
"top_k": 50,
"top_p": 0.76,
"typical_p": 1,
"epsilon_cutoff": 0,
"eta_cutoff": 0,
"diversity_penalty": 0,
"repetition_penalty": 1,
"encoder_repetition_penalty": 1,
"length_penalty": 1,
"no_repeat_ngram_size": 0,
"renormalize_logits": false,
"remove_invalid_values": false,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"encoder_no_repeat_ngram_size": 0,
"n": 1,
"best_of": 1,
"presence_penalty": 0.5,
"frequency_penalty": 0,
"use_beam_search": false,
"ignore_eos": false
},
"adapter_name": null
}'
What's Changed
* docs: Update the models doc by Sherlock113 in https://github.com/bentoml/BentoML/pull/4145
* docs: Add more workflows to the GitHub Actions doc by Sherlock113 in https://github.com/bentoml/BentoML/pull/4146
* docs: Add text embedding example to readme by Sherlock113 in https://github.com/bentoml/BentoML/pull/4151
* fix: bento build cache miss by xianml in https://github.com/bentoml/BentoML/pull/4153
* fix(buildx): parsing attestation on docker desktop by aarnphm in https://github.com/bentoml/BentoML/pull/4155
New Contributors
* xianml made their first contribution in https://github.com/bentoml/BentoML/pull/4153
**Full Changelog**: https://github.com/bentoml/BentoML/compare/v1.1.3...v1.1.4