- added lots of stuff to the algorithm:
- we unload the transcription model completely from the GPU after the first main transcription
- we then load the synthesis in a freshly cleaned VRAM and start it to take as much VRAM as it wants, because this is our bottleneck
- after the first synthesis we lazy load the transcription model AGAIN
- we can then transcript the synthesis and verify it using measuring text distance (with levenshtein and jaro winkler)
- and we can detect if the model generates hallucinations using the transcription word timestamps
So with this we have
=> a massive speed gain (x5)
=> way lower VRAM usage (because the huge transcription gets removed from VRAM, also we unload the translation model if used)
=> way more solid synthesis via verification (reducing hallucinations and strange artifacts generation by retrying synthesis)
We can now voiceturn a 20 min video on a 8GB VRAM in ~33 min
- added fades at start and end of the synthesis since it gets trimmed, so we don't clip
- autostart finished video after rendering