📈 Better Defaults for `StreamingDataset` (479)
- The default values for `StreamingDataset` have been updated to be more performant and are applicable for most use cases, detailed below:
| Parameter | Old Value | New Value | Benefit |
|-----------------------|------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------|
| `shuffle_algo` | `py1s` | `py1e` | Better shuffle and balanced downloading |
| `num_canonical_nodes` | `64 * physical nodes` | if `py1s` or `py2s`, `64 * physical_nodes`, otherwise `physical_nodes` | Consistently good shuffle for all shuffle algos |
| `shuffle_block_size` | `262,144` | `4,000,000 / num_canonical_nodes` | Consistently good shuffle for all `num_canonical_nodes` values |
| `predownload` | `max(batch_size, 256 * batch_size // num_canonical_nodes)` | `8 * batch_size` | Better balanced downloading |
| `partition_algo` | `orig` | `relaxed` | More flexible deterministic resumptions on nodes |
:gem: New Features
🤖 Streaming Simulator: Easily simulate the performance of training configurations. (385)
- After installing this version of streaming, simply run the command `simulator` in your terminal to open the simulation interface.
- Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
- Easily de-risk runs and find performant parameter settings.
- Check out the [docs](https://docs.mosaicml.com/projects/streaming/en/stable/fundamentals/simulator.html) for more information!
🔢 More flexible deterministic training and resumption (476)
- Deterministic training and resumptions are now possible on more numbers of nodes!
- Previously, the `num_canonical_nodes` parameter had to divide or be a multiple of the number of physical nodes for determinism.
- Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.
🐛 Bug Fixes
- Check for invalid hash algorithm names (486)
What's Changed
* Bump fastapi from 0.103.2 to 0.104.0 by dependabot in https://github.com/mosaicml/streaming/pull/480
* Bump gitpython from 3.1.37 to 3.1.40 by dependabot in https://github.com/mosaicml/streaming/pull/481
* Bump sphinx-tabs from 3.4.1 to 3.4.4 by dependabot in https://github.com/mosaicml/streaming/pull/482
* do not remove local directory when out is local by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/477
* Update __init__.py by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/484
* Check for invalid hash algorithm name by karan6181 in https://github.com/mosaicml/streaming/pull/486
* Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes by snarayan21 in https://github.com/mosaicml/streaming/pull/476
* Better default values for StreamingDataset args by snarayan21 in https://github.com/mosaicml/streaming/pull/479
* Update release yaml to not write anything to GitHub by karan6181 in https://github.com/mosaicml/streaming/pull/487
* Bump pypandoc from 1.11 to 1.12 by dependabot in https://github.com/mosaicml/streaming/pull/490
* Bump pytest from 7.4.2 to 7.4.3 by dependabot in https://github.com/mosaicml/streaming/pull/491
* Bumping version for streaming v0.7.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/495
**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.6.1...v0.7.0