Released on 18 December, 2023
Description
- Conflicting directives in the SmartSim packaging instructions were
fixed
- <span class="title-ref">sacct</span> and
<span class="title-ref">sstat</span> errors are now fatal for
Slurm-based workflow executions
- Added documentation section about ML features and TorchScript
- Added TorchScript functions to Online Analysis tutorial
- Added multi-DB example to documentation
- Improved test stability on HPC systems
- Added support for producing & consuming telemetry outputs
- Split tests into groups for parallel execution in CI/CD pipeline
- Change signature of
<span class="title-ref">Experiment.summary()</span>
- Expose first_device parameter for scripts, functions, models
- Added support for MINBATCHTIMEOUT in model execution
- Remove support for RedisAI 1.2.5, use RedisAI 1.2.7 commit
- Add support for multiple databases
Detailed Notes
- Several conflicting directives between the
<span class="title-ref">setup.py</span> and the
<span class="title-ref">setup.cfg</span> were fixed to mitigate
warnings issued when building the pip wheel.
([SmartSim-PR435](https://github.com/CrayLabs/SmartSim/pull/435))
- When the Slurm functions <span class="title-ref">sacct</span> and
<span class="title-ref">sstat</span> returned an error, it would be
ignored and SmartSim's state could become inconsistent. To prevent
this, errors raised by <span class="title-ref">sacct</span> or
<span class="title-ref">sstat</span> now result in an exception.
([SmartSim-PR392](https://github.com/CrayLabs/SmartSim/pull/392))
- A section named *ML Features* was added to documentation. It
contains multiple examples of how ML models and functions can be
added to and executed on the DB. TorchScript-based post-processing
was added to the *Online Analysis* tutorial
([SmartSim-PR411](https://github.com/CrayLabs/SmartSim/pull/411))
- An example of how to use multiple Orchestrators concurrently was
added to the documentation
([SmartSim-PR409](https://github.com/CrayLabs/SmartSim/pull/409))
- The test infrastructure was improved. Tests on HPC system are now
stable, and issues such as non-stopped
<span class="title-ref">Orchestrators</span> or experiments created
in the wrong paths have been fixed
([SmartSim-PR381](https://github.com/CrayLabs/SmartSim/pull/381))
- A telemetry monitor was added to check updates and produce events
for SmartDashboard
([SmartSim-PR426](https://github.com/CrayLabs/SmartSim/pull/426))
- Split tests into <span class="title-ref">group_a</span>,
<span class="title-ref">group_b</span>,
<span class="title-ref">slow_tests</span> for parallel execution in
CI/CD pipeline
([SmartSim-PR417](https://github.com/CrayLabs/SmartSim/pull/417),
[SmartSim-PR424](https://github.com/CrayLabs/SmartSim/pull/424))
- Change <span class="title-ref">format</span> argument to
<span class="title-ref">style</span> in
<span class="title-ref">Experiment.summary()</span>, this is an API
break
([SmartSim-PR391](https://github.com/CrayLabs/SmartSim/pull/391))
- Added support for first_device parameter for scripts, functions, and
models. This causes them to be loaded to the first num_devices
beginning with first_device
([SmartSim-PR394](https://github.com/CrayLabs/SmartSim/pull/394))
- Added support for MINBATCHTIMEOUT in model execution, which caps the
delay waiting for a minimium number of model execution operations to
accumulate before executing them as a batch
([SmartSim-PR387](https://github.com/CrayLabs/SmartSim/pull/387))
- RedisAI 1.2.5 is not supported anymore. The only RedisAI version is
now 1.2.7. Since the officially released RedisAI 1.2.7 has a bug
which breaks the build process on Mac OSX, it was decided to use
commit
[634916c](https://github.com/RedisAI/RedisAI/commit/634916c722e718cc6ea3fad46e63f7d798f9adc2)
from RedisAI's GitHub repository, where such bug has been fixed.
This applies to all operating systems.
([SmartSim-PR383](https://github.com/CrayLabs/SmartSim/pull/383))
- Add support for creation of multiple databases with unique
identifiers.
([SmartSim-PR342](https://github.com/CrayLabs/SmartSim/pull/342))