* Dynamic memory allocation on demand to fully utilize device memory. No preset scratch size or memory size any more.
* Drop Baichuan/InternLM support since they were integrated in llama.cpp.
* API change:
* CMake CUDA option: `-DGGML_CUBLAS` changed to `-DGGML_CUDA`
* CMake CUDA architecture: `-DCUDA_ARCHITECTURES` changed to `-DCMAKE_CUDA_ARCHITECTURES`
* `num_threads` in `GenerationConfig` was removed: the optimal thread settings will be automatically selected.