Latest version: v0.1.5.post1
The information on this page was curated by experts in our Cybersecurity Intelligence Team.
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
No known vulnerabilities found
Has known vulnerabilities