Parallel requests
Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases such as:
- Handling multiple chat sessions at the same time
- Hosting a code completion LLM for your internal team
- Processing different parts of a document simultaneously
- Running several agents at the same time.
https://github.com/ollama/ollama/assets/251292/9772a5f1-c072-41db-be6c-dd3c621aa2fd
Multiple models
Ollama now supports loading different models at the same time, dramatically improving:
- Retrieval Augmented Generation (RAG): both the embedding and text completion models can be loaded into memory simultaneously.
- Agents: multiple different agents can now run simultaneously
- Running large and small models side-by-side
Models are automatically loaded and unloaded based on requests and how much GPU memory is available.
To see which models are loaded, run `ollama ps`:
% ollama ps
NAME ID SIZE PROCESSOR UNTIL
gemma:2b 030ee63283b5 2.8 GB 100% GPU 4 minutes from now
all-minilm:latest 1b226e2802db 530 MB 100% GPU 4 minutes from now
llama3:latest 365c0bd3c000 6.7 GB 100% GPU 4 minutes from now
For more information on concurrency, see the [FAQ](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests)
New models
* [GLM-4](https://ollama.com/library/glm4): A strong multi-lingual general language model with competitive performance to Llama 3.
* [CodeGeeX4](https://ollama.com/library/codegeex4): A versatile model for AI software development scenarios, including code completion.
* [Gemma 2](https://ollama.com/library/gemma2): Improved output quality and base text generation models now available
What's Changed
* Improved Gemma 2
* Fixed issue where model would generate invalid tokens after hitting context window
* Fixed inference output issues with `gemma2:27b`
* Re-downloading the model may be required: `ollama pull gemma2` or `ollama pull gemma2:27b`
* Ollama will now show a better error if a model architecture isn't supported
* Improved handling of quotes and spaces in Modelfile `FROM` lines
* Ollama will now return an error if the system does not have enough memory to run a model on Linux
New Contributors
* Muku784 made their first contribution in https://github.com/ollama/ollama/pull/5382
* abitrolly made their first contribution in https://github.com/ollama/ollama/pull/4821
**Full Changelog**: https://github.com/ollama/ollama/compare/v0.1.48...v0.2.0