Major changes:
* Introduced new kernels aimed at enhancing efficiency.
* Implemented an initial Python wrapper, simplifying integration and extending accessibility.
* Incorporated new models such as Baichuan2 and ChatGLM.
* Added support for Jinja chat templates, enhancing customization and user interaction.
* Added usage statistics into responses, ensuring compatibility with OpenAI APIs.
* Enabled ccache to accelerate build speed, facilitating quicker development cycles.
What's Changed
* add timestamp into ccache cache key by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/42
* use ${GITHUB_SHA} in cache key by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/43
* replace GITHUB_SHA with ${{ github.sha }} by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/44
* encapsulate class of time for performance tracking. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/46
* upgrade paged_atten kernel to v0.2.7 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/47
* [feat] add speculative decoding. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/50
* added a new attention kernel for speculative decoding by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/52
* added support for small page size. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/53
* enable flash decoding for both prefill and decode phase. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/54
* enable split-k for flash decoding and fix bugs. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/59
* [ut] add unit tests for speculative scheduler. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/57
* added a custom command to generate instantiation for flashinfer by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/61
* add custom command to generate instantiation for flash-attn by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/62
* added gpu memory profiling to decided kv cache size precisely. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/63
* moved attention related files into attention subfolder by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/65
* add pybind11 to support python user interface. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/64
* added support to build python wrapper with installed pytorch ( pre-cxx11 abi) by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/66
* merge huggingface tokenizers and safetensors rust projects into one. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/67
* more changes to support python wrapper by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/68
* [feat] added attention handler for different implementations by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/71
* [perf] enabled speed up for gpa and mqa decoding. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/72
* [perf] use a seperate cuda stream for kv cache by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/73
* [models] added baichuan/baichuan2 model support. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/70
* [minor] cleanup redundant code for models. by liutongxuan in https://github.com/vectorch-ai/ScaleLLM/pull/74
* [feat] moved rope logic into attention handler to support apply positional embeding on the fly by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/76
* [refactor] replace dtype and device with options since they are used together usually by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/77
* [refactor] move cutlass and flashinfer into third_party folder by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/78
* [refactor] split model forward function into two: 1> get hidden states 2> get logits from hidden states by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/80
* [models] support both baichuan and baichuan2 by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/81
* [models] fix chatglm model issue. by guocuimi in https://github.com/vectorch-ai/ScaleLLM/pull/82
**Full Changelog**: https://github.com/vectorch-ai/ScaleLLM/compare/v0.0.5...v0.0.6