What's Changed
* Tests no longer have a `get_metadata()` method. Dependency helper uses a Test's class name instead.
* Introduced the concept of SUT capabilities (`ProducesPerTokenLogProbabilities`, `AcceptsChatPrompt`, `AcceptsTextPrompt`). SUTs and Tests must specify their capabilities/requirements in the `newhelm_sut` and `newhelm_test` decorators.
* SUTs can now return per-token log probabilities in a `SUTCompletion`. OpenAIChat is updated with this capability.
* SafeTest updates:
* Re-structured to have one test per hazard, grouping all applicable persona types (typical, malicious, or vulnerable).
* Results are reported as mapping from persona type to PersonaResult, which consists of `num_items` in addition to `frac_safe`.
* Added tests for new hazards
* Added new test DiscrimEval
**Full Changelog**: https://github.com/mlcommons/newhelm/compare/v0.2.4...v0.2.5