Inspect-ai

Latest version: v0.3.82

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 12 of 14

0.3.16

- Various fixes for the use of Docker tool environments on Windows.
- Ability to disable cleanup of tool environments via `--no-toolenv-cleanup`.
- New `inspect toolenv cleanup` command for manually cleaning up tool environments.
- `ToolError` exception type for explicitly raising tool errors to the model. Formerly, any exception would be surfaced as a tool error to the model. Now, the `ToolError` exception is required for reporting to the model (otherwise other exception types go through the call stack and result in an eval error).
- Resolve `INSPECT_LOG_DIR` in `.env` file relative to `.env` file parent directory.
- Use `-` for delimiting `--limit` ranges rather than `,`.
- Use HF model device for generate (compatibility with multi-GPU).

0.3.15

- [Sandbox Environments](https://inspect.aisi.org.uk/sandboxing.html) for executing tool code in a sandbox.
- [Caching](https://inspect.aisi.org.uk/caching.html) to reduce the number of model API calls made.
- The `multiple_choice()` solver now has support for questions with multiple correct answers.
- More fine grained handling of Claude `BadRequestError` (400) errors (which were formerly all treated as content moderation errors).
- Filter out empty TextBlockParam when playing messages back to Claude.
- Automatically combine Claude user messages that include tool content.
- Revert to "auto" rather than "none" after forced tool call.
- Provide `TaskState.tools` getter/setter (where the setter automatically syncs the system messages to the specified set of tools).
- The `use_tools()` function now uses the `TaskState.tools` setter, so replaces the current set of tools entirely rather than appending to it.
- Set `state.completed = False` when `max_messages` is reached.
- Allow tools to be declared with no parameters.
- Allow for null `bytes` field in `Logprobs` and `TopLogprobs`.
- Support all Llama series models on Bedrock.
- Added `truthfulqa` benchmark.
- Added `intercode-ctf` example.

0.3.14

- Stream samples to the evaluation log as they are completed (subject to the new `--log-buffer` option). Always write completed samples in the case of an error or cancelled task.
- New `"cancelled"` status in eval log for tasks interrupted with SIGINT (e.g. Ctrl-C). Logs are now written for cancellations (previously they were not).
- Default `--max-samples` (maximum concurrent samples) to `--max-connections`, which will result in samples being more frequently completed and written to the log file.
- For `eval_retry()`, copy previously completed samples in the log file being retried so that work is not unnecessarily repeated.
- New `inspect eval-retry` command to retry a log file from a task that ended in error or cancellation.
- New `retryable_eval_logs()` function and `--retryable` option for `inspect list logs` to query for tasks not yet completed within a log directory.
- Add `shuffled` property to datasets to determine if they were shuffled.
- Remove unused `extensions` argument from `list_eval_logs()`.

0.3.13

- Bugfix: Inspect view was not reliably updating when new evaluation logs were written.

0.3.12

- Bugfix: `results` was not defined when no scorer was provided resulting in an error being thrown. Fixed by setting `results = EvalResults()` when no scorer is provided.
- Bugfix: The viewer was not properly handling samples without scores.

0.3.11

- Update to non-beta version of Anthropic tool use (remove legacy xml tools implementation).

Page 12 of 14

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.