Inspect-ai

Latest version: v0.3.82

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 14

0.3.52

- Eval: `--sample-id` option for evaluating specific sample id(s).
- Bedrock: Detect and report HTTP rate limit errors.
- Azure AI: Add `emulate_tools` model arg to force tool emulation (emulation is enabled by default for Llama models).
- Basic Agent: Add `max_tool_output` parameter to override default max tool output from generate config.
- Inspect View: Correct display of sample ID for single sample tasks.
- Trace: Show custom tool views in `--trace` mode.
- Bugfix: Support for dynamic metric names in realtime scoring display.

0.3.51

- Bugfix: Task display fails to load when no scorers are defined for a task.

0.3.50

- Tools: Improved typing/schema support (unions, optional params, enums).
- Tools: Added `append` argument to `use_tools()` for adding (rather than replacing) the currently available tools.
- Docker sandbox: Streamed reads of stderr/stdout (enabling us to enforce output limits for read_file and exec at the source).
- Sandbox API: Enable passing `BaseModel` types for sandbox `config` (formerly only a file path could be passed).
- Task display: Show all task scores in realtime (expand task progress to see scores).
- Task display: Show completed samples and align progress more closely to completed samples (as opposed to steps).
- Task display: Show sample messages/tokens used (plus limits if specified).
- Task display: Resolve issue where task display would lose mouse input after VS Code reload.
- Datasets: Validate that all IDs in datasets are unique (as several downstream problems occur w/ duplicate IDs).
- Inspect View: Fix issue with incorrectly displayed custom tool views.
- Human approval: Use fullscreen display (makes approval UI async and enables rapid processing of approvals via the `Enter` key).
- Added `input_panel()` API for adding custom panels to the fullscreen task display.
- Log recorder: Methods are now async which will improve performance for fsspec filesystems with async implementations (e.g. S3)
- Log recorder: Improve `.eval` log reading performance for remote filesystem (eagerly fetch log to local buffer).
- Add `token_usage` property to `TaskState` which has current total tokens used across all calls to `generate()` (same value that is used for enforcing token limits).
- Add `time` field to `ModelOutput` that records total time spent within call to ModelAPI `generate()`.
- Web browser: Remove base64 images from web page contents (prevent filling up model context with large images).
- Match scorer: If the target of a match isn’t numeric, ignore the numeric flag and instead use text matching (improved handling for percentages).
- Hugging Face: Support for native HF tool calling for Llama, Mistral, Qwen, and others if they conform to various standard schemas.
- Hugging Face: `tokenizer_call_args` dict to specify custom args during tokenization, such as `max_length` and `truncation`.
- Azure AI: Fix schema validation error that occurred when model API returns `None` for `content`.
- Display: Throttle updating of sample list based on number of samples.
- Display: Add explicit 'ctrl+c' keybinding (as textual now disables this by default).
- Bugfix: Correct rate limit error display when running in fullscreen mode.
- Bugfix: `hf_dataset` now explicitly requires the `split` argument (previously, it would crash when not specified).
- Bugfix: Prevent cascading textual error when an error occurs during task initialisation.
- Bugfix: Correctly restore sample summaries from log file after amend.
- Bugfix: Report errors that occur during task finalisation.

0.3.49

- Logging: Only call CreateBucket on Amazon S3 when the bucket does not already exist.
- Improve cancellation feedback and prevent multiple cancellations when using fullscreen display.
- Inspect View: Prevent circular reference error when rendering complex tool input.
- Inspect View: Resolve display issue with sorting by sample then epoch.

0.3.48

- [Realtime display](https://github.com/UKGovernmentBEIS/inspect_ai/pull/865) of sample transcripts (including ability to cancel running samples).
- Scoring: When using a dictionary to map metrics to score value dictionaries, you may now use globs as keys. See our [scorer documentation](https://inspect.aisi.org.uk/scorers.html#sec-multiple-scorers) for more information.
- `EvalLog` now includes a [location](https://github.com/UKGovernmentBEIS/inspect_ai/pull/872) property indicating where it was read from.
- Use [tool views](https://inspect.aisi.org.uk/approval.html#tool-views) when rendering tool calls in Inspect View.
- Consistent behavior for `max_samples` across sandbox and non-sandbox evals (both now apply `max_samples` per task, formerly evals with sandboxes applied `max_samples` globally).
- Log files now properly deal with scores that produce Nan. (fixes [834](https://github.com/UKGovernmentBEIS/inspect_ai/issues/834))
- Bash tool: add `--login` option so that e.g. .bashrc is read before executing the command.
- Google: Support for tools/functions that have no parameters.
- Google/Vertex: Support for `logprobs` and other new 1.5 (002 series) options.
- AzureAI: Change default max_tokens for Llama models to 2048 (4096 currently yields an error w/ Llama 3.1).
- Mistral: Various compatibility changes for their client and tool calling implementation.
- Handle exponents in numeric normalisation for match, include, and answer scorers.
- hf_dataset: Added `cached` argument to control whether to use a previously cached version of the dataset if available (defaults to `True`).
- hf_dataset: Added `revision` option to load a specific branch or commit SHA (when using `revision` datasets are always revalidated on Hugging Face, i.e. `cached` is ignored).
- Log viewer: Display sample ids rather than indexes.
- Log viewer: Add timestamps to transcript events.
- Log viewer: Metadata which contains images will now render the images.
- Log viewer: Show custom tool call views in messages display.
- Bugfix: Correctly read and forward image detail property.
- Bugfix: Correct resolution of global eval override of task or sample sandboxes.
- Bugfix: Don't do eval log listing on background threads (s3fs can deadlock when run from multiple threads).

0.3.47

- Basic agent: Ensure that the scorer is only run once when max_attempts = 1.
- Basic agent: Support custom function for incorrect_message reply to model.
- Tool calling: Execute multiple tool calls serially (some models assume that multiple calls are executed this way rather than in parallel).
- Google: Combine consecutive tool messages into single content part; ensure no empty text content parts.
- AzureAI: Create and close client with each call to generate (fixes issue w/ using azureai on multiple passes of eval).
- Bedrock: Migrate to the [Converse API](https://docs.aws.amazon.com/bedrock/latest/userguide/conversation-inference-supported-models-features.html), which supports many more features including tool calling and multimodal models.
- Scoring: When using a dictionary to map metrics to score value dictionaries, you may now use globs as keys. See our [scorer documentation](https://inspect.aisi.org.uk/scorers.html#sec-multiple-scorers) for more information.
- Sample limit events will now appear in the transcript if a limit (e.g. message, token, or time limit) halt a sample. The sample list and sample detail also display the limit, if applicable.

Page 6 of 14

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.