<div align="center">
<img src="https://github.com/user-attachments/assets/1a70f8b1-b794-435a-8b7f-c9d4b64ba6db" width="512" />
</div>
New models
- [Llama 3.3](https://ollama.com/library/llama3.3): a new state of the art 70B model. Llama 3.3 70B offers similar performance compared to Llama 3.1 405B model.
- [Snowflake Arctic Embed 2](https://ollama.com/library/snowflake-arctic-embed2): Snowflake's frontier embedding model. Arctic Embed 2.0 adds multilingual support without sacrificing English performance or scalability.
Structured outputs
Ollama now supports structured outputs, making it possible to constrain a model's output to a specific format defined by a JSON schema. The Ollama Python and JavaScript libraries have been updated to support structured outputs, together with Ollama's OpenAI-compatible API endpoints.
REST API
To use structured outputs in Ollama's generate or chat APIs, provide a JSON schema object in the `format` parameter:
shell
curl -X POST http://localhost:11434/api/chat -H "Content-Type: application/json" -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "Tell me about Canada."}],
"stream": false,
"format": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"capital": {
"type": "string"
},
"languages": {
"type": "array",
"items": {
"type": "string"
}
}
},
"required": [
"name",
"capital",
"languages"
]
}
}'
Python library
Using the [Ollama Python library](https://github.com/ollama/ollama-python), pass in the schema as a JSON object to the `format` parameter as either `dict` or use Pydantic (recommended) to serialize the schema using `model_json_schema()`.
py
from ollama import chat
from pydantic import BaseModel
class Country(BaseModel):
name: str
capital: str
languages: list[str]
response = chat(
messages=[
{
'role': 'user',
'content': 'Tell me about Canada.',
}
],
model='llama3.1',
format=Country.model_json_schema(),
)
country = Country.model_validate_json(response.message.content)
print(country)
JavaScript library
Using the [Ollama JavaScript library](https://github.com/ollama/ollama-js), pass in the schema as a JSON object to the `format` parameter as either `object` or use Zod (recommended) to serialize the schema using `zodToJsonSchema()`:
js
import ollama from 'ollama';
import { z } from 'zod';
import { zodToJsonSchema } from 'zod-to-json-schema';
const Country = z.object({
name: z.string(),
capital: z.string(),
languages: z.array(z.string()),
});
const response = await ollama.chat({
model: 'llama3.1',
messages: [{ role: 'user', content: 'Tell me about Canada.' }],
format: zodToJsonSchema(Country),
});
const country = Country.parse(JSON.parse(response.message.content));
console.log(country);
What's Changed
* Fixed error importing model vocabulary files
* Experimental: new flag to set KV cache quantization to 4-bit (`q4_0`), 8-bit (`q8_0`) or 16-bit (`f16`). This reduces VRAM requirements for longer context windows.
* To enable for all models, use `OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q4_0 ollama serve`
* Note: in the future flash attention will be enabled by default where available, with kv cache quantization available on a per-model basis
* Thank you sammcj for the contribution in in https://github.com/ollama/ollama/pull/7926
New Contributors
* dmayboroda made their first contribution in https://github.com/ollama/ollama/pull/7906
* Geometrein made their first contribution in https://github.com/ollama/ollama/pull/7908
* owboson made their first contribution in https://github.com/ollama/ollama/pull/7693
**Full Changelog**: https://github.com/ollama/ollama/compare/v0.4.7...v0.5.0