v0.8.0 is our biggest release yet, featuring significant reliability improvements to VoiceAssistant. This update includes a few breaking API changes that will impact the way you build your agents. We strive to minimize breaking changes and will stabilize the API as we approach version 1.0.
Migrating to v0.8.0 (Breaking Changes)
<details>
<summary><h2>Job and Worker</h2></summary>
entrypoint moved from req.accept() to WorkerOptions
Previously the job entrypoint was in the req.accept() method call. Now the job entrypoint has been moved into WorkerOptions.
namespace removed
The WorkerOptions namespace field has been removed and will be replaced in the future.
explict connection to the room
You now need to call ctx.connect() to initiate the connection to the room. This allows for pre-connect setup (such as callback registrations) to avoid race conditions.
The following shows a minimal_worker.py example:
python
from livekit.agents import JobContext, JobRequest, WorkerOptions, cli
async def job_entrypoint(ctx: JobContext):
await ctx.connect()
...
if __name__ == "__main__":
cli.run_app(
WorkerOptions(entrypoint_fnc=job_entrypoint)
)
</details>
<details>
<summary><h2>LLM</h2></summary>
> 💡 These changes may not be relevant to users of the VoiceAssistant class.
The LLM class has been restructured to enhance ergonomics and improve the function calling experience.
Function/tool calling
Function calling has gotten a complete overhaul in v0.8.0. Most the the changes are additive and can be found in the New Features section.
The primary breaking change is that function calls are now **NOT** automatically invoked when iterating the LLM stream. `LLMStream.execute_functions` needs to be called instead.
TODO: insert code snipper showing some ai_callable fncs
`LLM.chat()` is no longer an async method
Previously, LLM.chat() was an async method that returned an LLMStream (which itself was an AsyncIterable).
We found it easier and less-confusing for LLM.chat() to be synchronous, while still returning the same AsyncIterable LLMStream.
LLM.chat ‘history’ has been renamed to ‘chat_ctx’
In order to improve consistency and reduce confusion.
TODO: insert code snippet
</details>
<details>
<summary><h2>STT</h2></summary>
> 💡 These changes may not be relevant to users of the VoiceAssistant class.
SpeechStream.flush()
Previously, to communicate to a STT provider that you have sent enough input to generate a response - you could push_frame(None) to coax the TTS into synthesizing a response.
In v0.8.0 that API has been removed and replaced with flush()
SpeechStream.end_input()
`end_input` signals to the STT provider that the input is complete and no additional input will follow. Previously, this was done using aclose(wait=True).
SpeechStream.aclose()
The “wait” arg of aclose has been removed in favor of SpeechStream.end_input (see above). Now, if you call TTS.aclose() without first calling STT.end_input, the behavior will be that the request is cancelled.
python
stt_stream = my_stt_instance.stream()
async for ev in audio_stream:
stt_stream.push_frame(ev.frame)
optionally flush when enough frames have been pushed
stt_stream.flush()
stt_stream.end_input()
await stt_stream.aclose()
</details>
<details>
<summary><h2>TTS</h2></summary>
> 💡 These changes may not be relevant to users of the VoiceAssistant class.
SynthesizedAudio changed and SynthesisEvent removed
Most of the fields of the SynthesizedAudio dataclass have been changed:
python
New SynthesizedAudio dataclass
dataclass
class SynthesizedAudio:
request_id: str
"""Request ID (one segment could be made up of multiple requests)"""
segment_id: str
"""Segment ID, each segment is separated by a flush"""
frame: rtc.AudioFrame
"""Synthesized audio frame"""
delta_text: str = ""
"""Current segment of the synthesized audio"""
Old SynthesizedAudio dataclass
dataclass
class SynthesizedAudio:
text: str
data: rtc.AudioFrame
The SynthesisEvent has been removed entirely. All occurrences of it have been replaced with SynthesizedAudio
SynthesizeStream.flush()
Similar to the STT changes, this coaxes the TTS provider into generating a response. The SynthesizedAudio response will have a new segment_id after calls to flush().
SynthesizeStream.end_input()
Similar to the STT changes, this replaces aclose(wait=True).
SynthesizeStream.aclose()
Similar to the STT changes, the wait arg has been removed.
python
tts_stream = my_tts_instance.stream()
tts_stream.push_text("This is the first sentence")
tts_stream.flush()
tts_stream.push_text("This is the second sentence")
tts_stream.end_input()
await tts_stream.aclose()
</details>
<details>
<summary><h2>VAD</h2></summary>
flush(), end_input(), aclose()
The same changes made to STT and TTS have also been made to VAD
python
vad_stream = my_vad_instance.stream()
async for ev in audio_stream:
vad_stream.push_frame(ev.frame)
optionally flush when enough frames have been pushed
vad_stream.flush()
vad_stream.end_input()
await vad_stream.aclose()
</details>
<details>
<summary><h2>VoiceAssistant</h2></summary>
Much of the VoiceAssistant API remains unchanged, despite significant improvements to functionality and internals. However, there have been changes to the configuration.
Initialization args
- Removed
- base_volume
- debug
- sentence_tokenizer, word_tokenizer, hyphenate_word
- Changed
- transcription related options now all fall into the “transcription” arg
python
class VoiceAssistant(utils.EventEmitter[EventTypes]):
def __init__(
self,
*,
vad: vad.VAD,
stt: stt.STT,
llm: LLM,
tts: tts.TTS,
chat_ctx: ChatContext | None = None,
fnc_ctx: FunctionContext | None = None,
allow_interruptions: bool = True,
interrupt_speech_duration: float = 0.6,
interrupt_min_words: int = 0,
preemptive_synthesis: bool = True,
transcription: AssistantTranscriptionOptions = AssistantTranscriptionOptions(),
will_synthesize_assistant_reply: WillSynthesizeAssistantReply = _default_will_synthesize_assistant_reply,
plotting: bool = False,
loop: asyncio.AbstractEventLoop | None = None,
) -> None:
...
</details>
New features
Job and Worker
- New prewarm_fnc in WorkerOptions that can be used to setup agent subprocesses before the agent joins the room. Useful for things like loading model weights.
- New num_idle_processes in WorkerOptions for keeping a process pool available for subsequent agents. This improves the latency of agents joining rooms and being ready to participate.
- Health server listens on 0.0.0.0 by default now instead of localhost
LLM
- You can now add AI functions at runtime.
- AI functions can now return values and throw exceptions. The return values and exception are automatically added to the chat_ctx so the LLM is aware of them.
VAD
- livekit-plugins-silero
- The onnx runtime is used directly now which removes pytorch dependency
- Model weights are included in the python package itself, you no longer need to download model weights as a build step
- The model has been updated to the latest silero model (V5) which has improved [accuracy](https://github.com/snakers4/silero-vad/issues/2#issuecomment-2195433115)
- Logic fixes to inference + hidden state which improves accuracy
TTS
- A new Cartesia plugin has been introduced
- SynthesizeStream now has flush() and end_input() for better control over which text input to audio output synchronization
- SynthesizedAudio now has a segment_id for more granularity around what audio corresponds to what input text
VoiceAssistant
- Big improvements and bug fixes to interrupt logic
- Bug fixes for duplicated responses
- Bug fixes for stuck responses
RAG
- New livekit-plugins-rag package to help with RAG related tasks
- Index builder for creating searchable index
- Nearest neighbor search on indexes based on spotify annoy library
New Contributors
Thanks to Ocupe mattherzog lukasIO seanmuirhead PaulLockett CalinR cs50victor vanics brightsparc ty-elastic naman-scogo eltociear hauntsaninja minhpq331 nbsp for their first contributions on the project!
**Full Detailed Changes**: [https://github.com/livekit/agents/compare/3c340eabfc6fc42bcd88fb08c90c101463cca8f5..596ac9042b3ecbe40c270d035d5da8f25474e569?diff=split&w=](https://github.com/livekit/agents/compare/3c340eabfc6fc42bcd88fb08c90c101463cca8f5..596ac9042b3ecbe40c270d035d5da8f25474e569?diff=split&w=)