Ax-engine

Latest version: v6.3.4

Safety actively analyzes 945810 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 2

6.2.6

Added

- **Inline MP4/WebM video on chat (ffmpeg)** — `POST /v1/chat/completions`
`video_url` content parts now accept inline base64 MP4/WebM in addition to
animated GIF when `ffmpeg` is available on the server `PATH`. Containers
are routed by magic bytes; ffmpeg pipes PNG frames with `showinfo`
timestamps keyed by frame number. Extraction is resource-bounded: frames
are downscaled to at most 1600 px on the longest side, the decoded stream
is capped at 512 MiB, and only the ≤ 32 uniformly sampled frames are
PNG-decoded. Without `ffmpeg`, MP4/WebM still report a clear unsupported
error and `/v1/generate` keeps accepting pre-extracted frame tensors.

Fixed

- **Removed prompt-keyed output rewriting (integrity fix)** — since v6.2.2 the
OpenAI/Anthropic chat routes silently rewrote model output for requests
whose prompt matched three specific benchmark phrasings: a five-words/no-'e'
prompt could have its real output replaced with a hardcoded sentence, a
750 ml unit-conversion prompt could be replaced with a hardcoded answer
(with `finish_reason` forced to `stop`), and "return only YAML" prompts had
markdown code fences stripped. This made the server misreport what the
model actually generated, both in QA runs and for any real user whose
prompt matched. The entire `OpenAiOutputPostprocessing` mechanism is
removed; responses now always carry the model's actual output. Affected
releases: v6.2.2 through v6.2.5.
- **Inline media preprocessing no longer blocks the async executor** — chat
requests carrying inline image/audio/video parts now build on the tokio
blocking pool (`spawn_blocking`) on both the OpenAI chat and Anthropic
Messages routes. Media decoding is CPU-bound and, for MP4/WebM, waits on an
ffmpeg child process — seconds-scale work that previously stalled an async
worker thread and degraded unrelated concurrent requests. Text-only chat
requests keep building inline.
- **Anthropic Messages no-op feature payloads** — `tools: []`,
`tool_choice: {"type": "none"}` / `{"type": "auto"}`, and
`thinking: {"type": "disabled"}` are valid Anthropic payloads that use no
unsupported feature and are now accepted instead of being rejected with
400 (`tools: []` regressed in v6.2.5; the others were rejected since the
endpoint was introduced). Non-empty `tools`, enabled `thinking`, and
malformed values are still rejected.
- **Non-positive or subnormal `image_std` rejected at config load** — image
normalization validation now rejects negative, zero, and subnormal-for-f32
std channels (previously only zero or non-finite) in both the server and
the Python SDK; a subnormal value such as `1e-40` passed a naive `> 0`
check but still overflowed the per-pixel division to inf. The per-pixel
epsilon clamps — which silently rewrote a bad std to ~1.2e-7 (Rust) or
1e-12 (Python) — are removed in favor of the load-time guarantee.
- **Gemma 4 resize fallback divisibility (hardening, no output change)** —
the single-axis resize fallback floors the aspect ratio again, keeping the
fallback dimension a multiple of `patch_size * pooling_kernel_size` by
construction rather than only via the `max_side` clamp (which still binds
for every reachable input today, so resize outputs are unchanged). The
Python parity test now pins the same vectors and the divisibility contract
as the Rust test.

6.2.5

Adds API key authentication, a Prometheus metrics endpoint, agentic
response contracts (logprobs, reasoning, tool calls, JSON validation), a
CLI model manager, and hardens multimodal serving against misconfigured
checkpoints.

Added

- **API key authentication** — `--api-key` flag or `AX_ENGINE_API_KEY`
environment variable requires `Authorization: Bearer <key>` on all
`/v1/*` routes. `/health` and `/healthz` stay unauthenticated for
readiness probes. Token comparison uses constant-time equality to avoid
timing side channels.
- **Prometheus metrics** (`GET /metrics`) — HTTP request counters (total,
in-flight, 2xx/4xx/5xx) and engine-step gauges (scheduled requests and
tokens, KV block usage, prefix-cache hits). Engine-step gauges are
read-only snapshots from real generation steps; the scrape path never
advances the engine. Requires the API key when auth is enabled.
- **Logprobs** (completions and chat) — when the engine observed
sampled-token logprobs, non-streaming responses carry them in OpenAI-shaped
`logprobs` blocks. Logprobs are all-or-nothing: partially observed values
are omitted entirely to keep arrays aligned. Streaming logprobs are
rejected with `400 unsupported_parameter` for now.
- **Reasoning output** (chat) — opt-in via the `reasoning` field. Known
model-family thinking patterns are split into
`message.reasoning_content`: Qwen ` THOUGHT…` text markers and Gemma 4
thinking channels (extracted token-level during native decode). Unknown
formats are left in `content` untouched.
- **Tool call extraction** (chat) — experimental. When `tools` are present,
explicit `ARGS…` spans in model output are parsed into
`message.tool_calls`. Bare JSON is never reinterpreted as a tool call.
`/v1/models` continues to report `openai_tool_calling_supported: false`
until streaming deltas and continuation handling land.
- **JSON object validation** (`response_format: json_object`) — non-streaming
responses are validated server-side; output that is not a JSON object
returns `502 invalid_output`. This is post-hoc validation, not constrained
decoding.
- **CLI model manager** — `ax-engine models list`, `info`, and `rm`
subcommands for inspecting and cleaning up downloaded model artifacts.

Fixed

- **Image normalization division by zero** — config loading now rejects a
zero or non-finite `image_std` channel when `do_normalize` is set (server
returns `MediaError::Config`, Python SDK raises `ValueError`) instead of
producing inf/NaN pixels.
- **Anthropic request validation** — `json_value_is_present` now checks for
presence (non-null) rather than truthiness, so `thinking: false` or
`tools: 0` correctly trigger the unsupported-feature rejection.
- **Resize extreme aspect ratios** — single-axis fallback in `resize_target`
now uses `.max(unit)` to guarantee at least one patch unit per dimension,
preventing zero-dimension targets on extreme aspect ratios (e.g., 1×10000).

6.2.4

Adds an Anthropic-compatible Messages endpoint, accepts MP3 audio in
multimodal chat, and hardens Gemma 4 input validation.

Added

- **Anthropic Messages endpoint** (`POST /v1/messages`) — translates
Anthropic-style `system`, `messages`, `max_tokens`, `temperature`,
`top_p`, `top_k`, and `stop_sequences` into the internal OpenAI chat
pipeline. Content blocks, `model` validation, and usage tracking are
supported. Streaming, tool use, and extended thinking are rejected with
clear error messages. Works with native MLX, MLX-LM delegated, and
llama.cpp delegated backends.
- **MP3 audio in multimodal chat** — `/v1/chat/completions` now accepts MP3
inline audio in addition to WAV. The container is sniffed from magic bytes
(RIFF → hound WAV decoder, ID3/MPEG sync → symphonia MP3 decoder). MP3
decoding stops at the model's `audio_seq_length` frame cap to cap memory
use. AAC/OGG/FLAC remain unsupported; send pre-computed tensors via
`/v1/generate` for those. The Python SDK preprocessing helper stays WAV-only.

Changed

- **Gemma 4 video timestamp validation** — `video_timestamp_token_ids` now
validates that every entry is a non-negative integer (rejects booleans,
floats, and negative values) with per-element error messages identifying
the exact video, frame, and index.
- **Serving contract documentation** — `docs/SERVER.md` updated to reflect
WAV/MP3 audio support, magic-byte sniffing, decode cap, and fixed
soft-token budgets.

6.2.2

Patch release that fixes a critical multimodal attention bug in Gemma 4
where vision tokens lost intra-image bidirectionality on full-attention
layers, improves OpenAI API output postprocessing, and adds idempotent
PyPI publish.

Fixed
- **Gemma 4 media block overlay** — multimodal PrefixLM mask was previously
applied only to sliding-window layers, leaving full-attention layers with
a plain causal mask. Vision tokens larger than the sliding window were
silently dropped, losing intra-image bidirectionality on every global
layer. Now the bidirectional vision-block overlay is applied to both
full-attention and sliding-window layers, matching the reference
implementation. Memoized per unique window size for efficiency.
- **Gemma 4 channel output markers** — thinking-channel framing stripped
from chat responses to prevent leaking internal model markers.
- **OpenAI unusual prompt output postprocessing** — handles edge cases
where model output contains unexpected prompt echoes or structural anomalies.
- **OpenAI response formatting** — standardized postprocessing for
consistent API output.

Added
- **Tokenizer token lookup** — exposed for debugging and inspection of
tokenized inputs.
- **Idempotent PyPI release publish** — re-publishing the same version
no longer fails, enabling safe retry of interrupted releases.

6.2.1

Patch release focused on multimodal peer benchmark methodology hardening,
Gemma 4 benchmark artifact sanitization, and a new Homebrew CLI entrypoint.

Added
- **Homebrew CLI entrypoint** — `ax-engine` installable and runnable via Homebrew.
- **Cold peer benchmark** — refreshed Gemma 4 multimodal cold peer benchmark
with documented llama peer launch contract.

Changed
- **Atomic multimodal prefill scheduling** — prefill no longer split across
scheduling boundaries, eliminating race conditions in peer benchmarks.
- Peer benchmark methodology hardened with stricter validation and
reproducibility guarantees.
- Gemma 4 peer chart styling adjusted for clarity.

Fixed
- **Benchmark preview smoke binary selection** — smoke test now selects the
correct `ax-engine-bench` binary.
- **`cargo run` disambiguation** — `ax-engine-bench` cargo run commands now
unambiguous in workspace.
- **Gemma 4 benchmark artifact paths** — sanitized to prevent path traversal
in artifact naming.
- **Llama peer slot reuse** — rejected in multimodal benchmarks to prevent
stale state contamination.
- **Llama audio cap peer row** — skipped due to instability.
- **Multimodal peer fairness** — hardened scheduling to ensure equitable
resource allocation across peers.

6.2.0

Completes the **Gemma 4 12B multimodal story** (image, audio, video) with
golden-validated preprocessing, introduces **speculation profile presets** for
workload-tuned MTP gating, and ships benchmark hardening and download UX improvements.

Added
- **Gemma 4 multimodal chat** — inline video (GIF), image, and audio input in
chat conversations.
- **Golden-validated preprocessing** — Python SDK preprocessing matches the
reference implementation for audio and video vectors.
- **Video fidelity** — 70-token frames, 32-frame cap, mm:ss timestamp formatting.
- **Runtime smoke tests** — image TTFT and end-to-end multimodal probes.
- **Speculation profiles** — four presets (`auto`, `coding`, `agentic`,
`chatbot`) with calibrated MTP draft-confidence gates. Gemma 4 gates
calibrated from 12B ablation data. CLI flag `--speculation-profile` with
programmatic SDK override.
- **Hugging Face Hub** snapshot download support.
- **Same-artifact direct-vs-MTP parity harness** for Gemma 4 12B.

Changed
- **`--force` flag** now invalidates stale manifests in download destination.
- **Bundled benchmark binary** preferred; fails loudly on missing manifest.
- Multimodal benchmark modality set validation prevents invalid configurations.
- Qwen MTP improvement chart added to README.
- Gemma 4 MTP public artifacts and phase 4 results published.
- README announcement flow productized.

Fixed
- **Embed mutex poison recovery** — embedding pipeline no longer panics on
poisoned mutex.
- **EWMA clamp** — exponential moving average clamped to prevent numerical drift.
- **SSE role emission** — role field now correctly emitted in streaming responses.
- Multimodal benchmark artifact validation for missing or malformed outputs.
- Gemma4 multimodal QA probe false-positive content match.
- Multimodal config-loading divergences from the reference implementation.
- Shifted MTP norm sidecar validation.
- Bench doctor smoke status check.

Page 1 of 2

© 2026 Safety CLI Cybersecurity Inc. All Rights Reserved.