Mosaicml-streaming

Latest version: v0.7.5

Safety actively analyzes 624578 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.7.5

:gem: New Features

1. Tensor/Sequence Parallelism Support
Using the `replication` argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.
* Replicating samples across devices (SP / TP enablement) by knighton in https://github.com/mosaicml/streaming/pull/597
* Expanded replication testing + documentation by snarayan21 in https://github.com/mosaicml/streaming/pull/607
* Make streaming use the correct number of unique samples with SP/TP by snarayan21 in https://github.com/mosaicml/streaming/pull/619

2. Overhauled Streaming Documentation
New and improved streaming documentation can be found [here](https://docs.mosaicml.com/projects/streaming/en/stable/#) -- please submit issues with any feedback.
* Major overhaul of Streaming documentation by snarayan21 in https://github.com/mosaicml/streaming/pull/636

3. `batch_size` is now required for StreamingDataset
As we have seen multiple errors and performance degradations from users not setting the `batch_size` argument to StreamingDataset, we are making it a requirement to iterate over the dataset.
* You must set batch size. There is no other way. by snarayan21 in https://github.com/mosaicml/streaming/pull/624

3. Support for Python 3.11, deprecate Python 3.8
* Add support for Python 3.11 and deprecate Python 3.8 by karan6181 in https://github.com/mosaicml/streaming/pull/586

🐛 Bug Fixes
* [easy typo fix] fix f-string by bigning in https://github.com/mosaicml/streaming/pull/596
* Change comparison in partitions to include equals by JAEarly in https://github.com/mosaicml/streaming/pull/587
* Use type int when initializing SharedMemory size by bchiang2 in https://github.com/mosaicml/streaming/pull/604
* COCO Dataset fix -- avoids `allow_unsafe_types=True` by snarayan21 in https://github.com/mosaicml/streaming/pull/647

🔧 Improvements
* Allow writers to overwrite existing data by JAEarly in https://github.com/mosaicml/streaming/pull/594
* Update careers link by milocress in https://github.com/mosaicml/streaming/pull/611
* Update license by b-chu in https://github.com/mosaicml/streaming/pull/568
* Updated documentation for S3-compatible object stores by AIproj in https://github.com/mosaicml/streaming/pull/592
* Make yamllint consistent with Composer by b-chu in https://github.com/mosaicml/streaming/pull/583
* Switch linting workflows to ci-testing repo by b-chu in https://github.com/mosaicml/streaming/pull/616

What's Changed
* Bump uvicorn from 0.26.0 to 0.27.1 by dependabot in https://github.com/mosaicml/streaming/pull/599
* Bump pytest-split from 0.8.1 to 0.8.2 by dependabot in https://github.com/mosaicml/streaming/pull/581
* Update ruff to 0.2.2 by Skylion007 in https://github.com/mosaicml/streaming/pull/608
* Bump fastapi from 0.109.0 to 0.110.0 by dependabot in https://github.com/mosaicml/streaming/pull/610
* Bump yamllint from 1.33.0 to 1.35.1 by dependabot in https://github.com/mosaicml/streaming/pull/601
* Bump uvicorn from 0.27.1 to 0.28.0 by dependabot in https://github.com/mosaicml/streaming/pull/626
* Update moto requirement from <5,>=4.0 to >=4.0,<6 by dependabot in https://github.com/mosaicml/streaming/pull/580
* Bump furo from 2023.7.26 to 2024.1.29 by dependabot in https://github.com/mosaicml/streaming/pull/631
* Bump pypandoc from 1.12 to 1.13 by dependabot in https://github.com/mosaicml/streaming/pull/630
* Bump databricks-sdk from 0.14.0 to 0.22.0 by dependabot in https://github.com/mosaicml/streaming/pull/629
* Add batch_size to 1 if not provided for regression testing by karan6181 in https://github.com/mosaicml/streaming/pull/635
* Fixed docstring note for getting sequential sample ordering by snarayan21 in https://github.com/mosaicml/streaming/pull/632
* Bump pytest and fix failing test by snarayan21 in https://github.com/mosaicml/streaming/pull/642
* Update pytest-cov requirement from <5,>=4 to >=4,<6 by dependabot in https://github.com/mosaicml/streaming/pull/638
* Bump pydantic from 2.5.3 to 2.6.4 by dependabot in https://github.com/mosaicml/streaming/pull/639
* Bump uvicorn from 0.28.0 to 0.29.0 by dependabot in https://github.com/mosaicml/streaming/pull/640
* Bump databricks-sdk from 0.22.0 to 0.23.0 by dependabot in https://github.com/mosaicml/streaming/pull/644
* Version bump to 0.7.5 by snarayan21 in https://github.com/mosaicml/streaming/pull/650

New Contributors
* bigning made their first contribution in https://github.com/mosaicml/streaming/pull/596
* JAEarly made their first contribution in https://github.com/mosaicml/streaming/pull/587
* AIproj made their first contribution in https://github.com/mosaicml/streaming/pull/592
* milocress made their first contribution in https://github.com/mosaicml/streaming/pull/611
* bchiang2 made their first contribution in https://github.com/mosaicml/streaming/pull/604

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.4...v0.7.5

0.7.4

🐛 Bug Fixes
* Download to temporary path from azure by philipnrmn in https://github.com/mosaicml/streaming/pull/566
* fix(merge_index): scheme was not well formatted by fwertel in https://github.com/mosaicml/streaming/pull/576
* Update misplaced params of _format_remote_index_files by lsongx in https://github.com/mosaicml/streaming/pull/584
* Modifications to resumption shared memory allowing `load_state_dict` multiple times. by snarayan21 in https://github.com/mosaicml/streaming/pull/593

What's Changed
* Bump fastapi from 0.108.0 to 0.109.0 by dependabot in https://github.com/mosaicml/streaming/pull/564
* Bump gitpython from 3.1.40 to 3.1.41 by dependabot in https://github.com/mosaicml/streaming/pull/565
* Download to temporary path from azure by philipnrmn in https://github.com/mosaicml/streaming/pull/566
* Use `tempfile.gettempdir()` instead of a hardcoded temp root. by knighton in https://github.com/mosaicml/streaming/pull/570
* fix(merge_index): scheme was not well formatted by fwertel in https://github.com/mosaicml/streaming/pull/576
* Bump uvicorn from 0.25.0 to 0.26.0 by dependabot in https://github.com/mosaicml/streaming/pull/572
* Bump sphinx-tabs from 3.4.4 to 3.4.5 by dependabot in https://github.com/mosaicml/streaming/pull/571
* Update misplaced params of _format_remote_index_files by lsongx in https://github.com/mosaicml/streaming/pull/584
* Remove .ci folder and move FILE_HEADER and CODEOWNERS by irenedea in https://github.com/mosaicml/streaming/pull/588
* Modifications to resumption shared memory allowing `load_state_dict` multiple times. by snarayan21 in https://github.com/mosaicml/streaming/pull/593
* Bump version to 0.7.4 by snarayan21 in https://github.com/mosaicml/streaming/pull/595

New Contributors
* philipnrmn made their first contribution in https://github.com/mosaicml/streaming/pull/566
* fwertel made their first contribution in https://github.com/mosaicml/streaming/pull/576
* lsongx made their first contribution in https://github.com/mosaicml/streaming/pull/584
* irenedea made their first contribution in https://github.com/mosaicml/streaming/pull/588

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.3...v0.7.4

0.7.3

🐛 Bug Fixes
- Logging messages for new defaults only show once per rank. (543)
- Fixed padding calculation for repeat samples in the partition. (544)

🔧 Other improvements
- Update copyright license year from 2023 -> 2022-2024. (560)

What's Changed
* Logging messages from new defaults only show once per rank. by snarayan21 in https://github.com/mosaicml/streaming/pull/543
* Fixed condition for warning when partitioning over tiny datasets. by snarayan21 in https://github.com/mosaicml/streaming/pull/544
* Removing stray print statement by snarayan21 in https://github.com/mosaicml/streaming/pull/553
* Bump pydantic from 2.5.2 to 2.5.3 by dependabot in https://github.com/mosaicml/streaming/pull/548
* Bump uvicorn from 0.24.0.post1 to 0.25.0 by dependabot in https://github.com/mosaicml/streaming/pull/549
* Bump fastapi from 0.104.1 to 0.108.0 by dependabot in https://github.com/mosaicml/streaming/pull/557
* Bump pytest from 7.4.3 to 7.4.4 by dependabot in https://github.com/mosaicml/streaming/pull/558
* Update copyright: 2023 -> 2022-2024. by knighton in https://github.com/mosaicml/streaming/pull/560
* Bump version to 0.7.3 by karan6181 in https://github.com/mosaicml/streaming/pull/562


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.2...v0.7.3

0.7.2

:gem: New Features
1. Canned ACL Support (512)
Add support for the Canned ACL using the environment variable `S3_CANNED_ACL` for AWS S3. Checkout [Canned ACL](https://docs.mosaicml.com/projects/streaming/en/stable/how_to_guides/configure_cloud_storage_credentials.html#canned-acl) document on how to use it.

2. Allow/reject datasets containing unsafe types (519)
The pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag `allow_unsafe_types ` in the `StreamingDataset` class to allow or reject datasets containing Pickle.



🐛 Bug Fixes
- Retrieve batch size correctly from vision yamls for the streaming simulator (501)
- Fix for CVE-2023-47248 (504)
- Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (514)
- Proportion of None instead of a string 'None' is now handled correctly.
- Repeat of None instead of a string 'None' is now handled correctly.
- Added warning for StreamingDataset subclass defaults
- Fix sample partitioning algorithm bug for tiny datasets (517)

🔧 Improvements
- Added warning messages for new streaming dataset defaults to inform users about the old and new values. (502)

What's Changed
* Migrate pydocstyle to ruff by Skylion007 in https://github.com/mosaicml/streaming/pull/500
* Bump fastapi from 0.104.0 to 0.104.1 by dependabot in https://github.com/mosaicml/streaming/pull/496
* Bump uvicorn from 0.23.2 to 0.24.0.post1 by dependabot in https://github.com/mosaicml/streaming/pull/497
* Retrieve batch size correctly from vision yamls for simulator by snarayan21 in https://github.com/mosaicml/streaming/pull/501
* Adding warning messages for new defaults by snarayan21 in https://github.com/mosaicml/streaming/pull/502
* Fix for CVE-2023-47248 by bandish-shah in https://github.com/mosaicml/streaming/pull/504
* Bump pydantic from 2.4.2 to 2.5.2 by dependabot in https://github.com/mosaicml/streaming/pull/513
* Bump yamllint from 1.32.0 to 1.33.0 by dependabot in https://github.com/mosaicml/streaming/pull/506
* Fixed comments and update dataframe_to_MDS API signature by karan6181 in https://github.com/mosaicml/streaming/pull/515
* Simulator bug fixes (proportion, repeat, yaml ingestion) by snarayan21 in https://github.com/mosaicml/streaming/pull/514
* Add support for the Canned ACL environment variable for AWS S3 by karan6181 in https://github.com/mosaicml/streaming/pull/512
* Fixed bugs when trying to use very small datasets by snarayan21 in https://github.com/mosaicml/streaming/pull/517
* Bump databricks-sdk from 0.8.0 to 0.14.0 by dependabot in https://github.com/mosaicml/streaming/pull/518
* Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) by knighton in https://github.com/mosaicml/streaming/pull/519
* improve exception error messages for downloading by Skylion007 in https://github.com/mosaicml/streaming/pull/525
* doc: add NDArray format by OrenLeung in https://github.com/mosaicml/streaming/pull/527
* Offload exception to mds_write. by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/528
* Add allow_unsafe_types parameter to the streaming regression tests by karan6181 in https://github.com/mosaicml/streaming/pull/531
* Bump version to 0.7.2 by karan6181 in https://github.com/mosaicml/streaming/pull/532

New Contributors
* OrenLeung made their first contribution in https://github.com/mosaicml/streaming/pull/527

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.1...v0.7.2

0.7.1

Not secure
🐛 Bug Fixes

- Simulation from command line with `simulator` is fixed (499)

What's Changed
* Fixing simulator command with simulation directories being included in package by snarayan21 in https://github.com/mosaicml/streaming/pull/499


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.0...v0.7.1

0.7.0

Not secure
📈 Better Defaults for `StreamingDataset` (479)
- The default values for `StreamingDataset` have been updated to be more performant and are applicable for most use cases, detailed below:

| Parameter | Old Value | New Value | Benefit |
|-----------------------|------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------|
| `shuffle_algo` | `py1s` | `py1e` | Better shuffle and balanced downloading |
| `num_canonical_nodes` | `64 * physical nodes` | if `py1s` or `py2s`, `64 * physical_nodes`, otherwise `physical_nodes` | Consistently good shuffle for all shuffle algos |
| `shuffle_block_size` | `262,144` | `4,000,000 / num_canonical_nodes` | Consistently good shuffle for all `num_canonical_nodes` values |
| `predownload` | `max(batch_size, 256 * batch_size // num_canonical_nodes)` | `8 * batch_size` | Better balanced downloading |
| `partition_algo` | `orig` | `relaxed` | More flexible deterministic resumptions on nodes |

:gem: New Features

🤖 Streaming Simulator: Easily simulate the performance of training configurations. (385)
- After installing this version of streaming, simply run the command `simulator` in your terminal to open the simulation interface.
- Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
- Easily de-risk runs and find performant parameter settings.
- Check out the [docs](https://docs.mosaicml.com/projects/streaming/en/stable/fundamentals/simulator.html) for more information!

🔢 More flexible deterministic training and resumption (476)
- Deterministic training and resumptions are now possible on more numbers of nodes!
- Previously, the `num_canonical_nodes` parameter had to divide or be a multiple of the number of physical nodes for determinism.
- Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.

🐛 Bug Fixes

- Check for invalid hash algorithm names (486)

What's Changed
* Bump fastapi from 0.103.2 to 0.104.0 by dependabot in https://github.com/mosaicml/streaming/pull/480
* Bump gitpython from 3.1.37 to 3.1.40 by dependabot in https://github.com/mosaicml/streaming/pull/481
* Bump sphinx-tabs from 3.4.1 to 3.4.4 by dependabot in https://github.com/mosaicml/streaming/pull/482
* do not remove local directory when out is local by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/477
* Update __init__.py by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/484
* Check for invalid hash algorithm name by karan6181 in https://github.com/mosaicml/streaming/pull/486
* Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes by snarayan21 in https://github.com/mosaicml/streaming/pull/476
* Better default values for StreamingDataset args by snarayan21 in https://github.com/mosaicml/streaming/pull/479
* Update release yaml to not write anything to GitHub by karan6181 in https://github.com/mosaicml/streaming/pull/487
* Bump pypandoc from 1.11 to 1.12 by dependabot in https://github.com/mosaicml/streaming/pull/490
* Bump pytest from 7.4.2 to 7.4.3 by dependabot in https://github.com/mosaicml/streaming/pull/491
* Bumping version for streaming v0.7.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/495


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.6.1...v0.7.0

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.