https://github.com/user-attachments/assets/077c8692-e148-46aa-a659-7171a1bbc165
๐ง Changes
- [x] Utilize llama cpp KV cache mechanism to make faster inference. See (1)
- [x] Summarize the chat history when it is about to exceeds the context length
- [x] Recursive check and update missing setting from te DEFAULT CONFIG
- [x] Add validators (type, min, max value) for input fields in the setting dialog
(1) llama cpp's KV cache check prefix of your chat history to reuse the K-V cache. For example:
- Generated sequence so far = "ABCDEF"
- If we modify the chat history somehow like: "ABCDXT". Then it matches prefix and reuses the cache for "ABCD" and newly computes the Key and Value vectors for "XT", then generates new responses.
-> So we need to make the most of this mechanism by keep the history prefix as fixed as possible