Mosaicml-streaming

Latest version: v0.9.1

Safety actively analyzes 681881 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 5

0.7.4

🐛 Bug Fixes
* Download to temporary path from azure by philipnrmn in https://github.com/mosaicml/streaming/pull/566
* fix(merge_index): scheme was not well formatted by fwertel in https://github.com/mosaicml/streaming/pull/576
* Update misplaced params of _format_remote_index_files by lsongx in https://github.com/mosaicml/streaming/pull/584
* Modifications to resumption shared memory allowing `load_state_dict` multiple times. by snarayan21 in https://github.com/mosaicml/streaming/pull/593

What's Changed
* Bump fastapi from 0.108.0 to 0.109.0 by dependabot in https://github.com/mosaicml/streaming/pull/564
* Bump gitpython from 3.1.40 to 3.1.41 by dependabot in https://github.com/mosaicml/streaming/pull/565
* Download to temporary path from azure by philipnrmn in https://github.com/mosaicml/streaming/pull/566
* Use `tempfile.gettempdir()` instead of a hardcoded temp root. by knighton in https://github.com/mosaicml/streaming/pull/570
* fix(merge_index): scheme was not well formatted by fwertel in https://github.com/mosaicml/streaming/pull/576
* Bump uvicorn from 0.25.0 to 0.26.0 by dependabot in https://github.com/mosaicml/streaming/pull/572
* Bump sphinx-tabs from 3.4.4 to 3.4.5 by dependabot in https://github.com/mosaicml/streaming/pull/571
* Update misplaced params of _format_remote_index_files by lsongx in https://github.com/mosaicml/streaming/pull/584
* Remove .ci folder and move FILE_HEADER and CODEOWNERS by irenedea in https://github.com/mosaicml/streaming/pull/588
* Modifications to resumption shared memory allowing `load_state_dict` multiple times. by snarayan21 in https://github.com/mosaicml/streaming/pull/593
* Bump version to 0.7.4 by snarayan21 in https://github.com/mosaicml/streaming/pull/595

New Contributors
* philipnrmn made their first contribution in https://github.com/mosaicml/streaming/pull/566
* fwertel made their first contribution in https://github.com/mosaicml/streaming/pull/576
* lsongx made their first contribution in https://github.com/mosaicml/streaming/pull/584
* irenedea made their first contribution in https://github.com/mosaicml/streaming/pull/588

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.3...v0.7.4

0.7.3

🐛 Bug Fixes
- Logging messages for new defaults only show once per rank. (543)
- Fixed padding calculation for repeat samples in the partition. (544)

🔧 Other improvements
- Update copyright license year from 2023 -> 2022-2024. (560)

What's Changed
* Logging messages from new defaults only show once per rank. by snarayan21 in https://github.com/mosaicml/streaming/pull/543
* Fixed condition for warning when partitioning over tiny datasets. by snarayan21 in https://github.com/mosaicml/streaming/pull/544
* Removing stray print statement by snarayan21 in https://github.com/mosaicml/streaming/pull/553
* Bump pydantic from 2.5.2 to 2.5.3 by dependabot in https://github.com/mosaicml/streaming/pull/548
* Bump uvicorn from 0.24.0.post1 to 0.25.0 by dependabot in https://github.com/mosaicml/streaming/pull/549
* Bump fastapi from 0.104.1 to 0.108.0 by dependabot in https://github.com/mosaicml/streaming/pull/557
* Bump pytest from 7.4.3 to 7.4.4 by dependabot in https://github.com/mosaicml/streaming/pull/558
* Update copyright: 2023 -> 2022-2024. by knighton in https://github.com/mosaicml/streaming/pull/560
* Bump version to 0.7.3 by karan6181 in https://github.com/mosaicml/streaming/pull/562


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.2...v0.7.3

0.7.2

:gem: New Features
1. Canned ACL Support (512)
Add support for the Canned ACL using the environment variable `S3_CANNED_ACL` for AWS S3. Checkout [Canned ACL](https://docs.mosaicml.com/projects/streaming/en/stable/how_to_guides/configure_cloud_storage_credentials.html#canned-acl) document on how to use it.

2. Allow/reject datasets containing unsafe types (519)
The pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag `allow_unsafe_types ` in the `StreamingDataset` class to allow or reject datasets containing Pickle.



🐛 Bug Fixes
- Retrieve batch size correctly from vision yamls for the streaming simulator (501)
- Fix for CVE-2023-47248 (504)
- Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (514)
- Proportion of None instead of a string 'None' is now handled correctly.
- Repeat of None instead of a string 'None' is now handled correctly.
- Added warning for StreamingDataset subclass defaults
- Fix sample partitioning algorithm bug for tiny datasets (517)

🔧 Improvements
- Added warning messages for new streaming dataset defaults to inform users about the old and new values. (502)

What's Changed
* Migrate pydocstyle to ruff by Skylion007 in https://github.com/mosaicml/streaming/pull/500
* Bump fastapi from 0.104.0 to 0.104.1 by dependabot in https://github.com/mosaicml/streaming/pull/496
* Bump uvicorn from 0.23.2 to 0.24.0.post1 by dependabot in https://github.com/mosaicml/streaming/pull/497
* Retrieve batch size correctly from vision yamls for simulator by snarayan21 in https://github.com/mosaicml/streaming/pull/501
* Adding warning messages for new defaults by snarayan21 in https://github.com/mosaicml/streaming/pull/502
* Fix for CVE-2023-47248 by bandish-shah in https://github.com/mosaicml/streaming/pull/504
* Bump pydantic from 2.4.2 to 2.5.2 by dependabot in https://github.com/mosaicml/streaming/pull/513
* Bump yamllint from 1.32.0 to 1.33.0 by dependabot in https://github.com/mosaicml/streaming/pull/506
* Fixed comments and update dataframe_to_MDS API signature by karan6181 in https://github.com/mosaicml/streaming/pull/515
* Simulator bug fixes (proportion, repeat, yaml ingestion) by snarayan21 in https://github.com/mosaicml/streaming/pull/514
* Add support for the Canned ACL environment variable for AWS S3 by karan6181 in https://github.com/mosaicml/streaming/pull/512
* Fixed bugs when trying to use very small datasets by snarayan21 in https://github.com/mosaicml/streaming/pull/517
* Bump databricks-sdk from 0.8.0 to 0.14.0 by dependabot in https://github.com/mosaicml/streaming/pull/518
* Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) by knighton in https://github.com/mosaicml/streaming/pull/519
* improve exception error messages for downloading by Skylion007 in https://github.com/mosaicml/streaming/pull/525
* doc: add NDArray format by OrenLeung in https://github.com/mosaicml/streaming/pull/527
* Offload exception to mds_write. by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/528
* Add allow_unsafe_types parameter to the streaming regression tests by karan6181 in https://github.com/mosaicml/streaming/pull/531
* Bump version to 0.7.2 by karan6181 in https://github.com/mosaicml/streaming/pull/532

New Contributors
* OrenLeung made their first contribution in https://github.com/mosaicml/streaming/pull/527

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.1...v0.7.2

0.7.1

Not secure
🐛 Bug Fixes

- Simulation from command line with `simulator` is fixed (499)

What's Changed
* Fixing simulator command with simulation directories being included in package by snarayan21 in https://github.com/mosaicml/streaming/pull/499


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.0...v0.7.1

0.7.0

Not secure
📈 Better Defaults for `StreamingDataset` (479)
- The default values for `StreamingDataset` have been updated to be more performant and are applicable for most use cases, detailed below:

| Parameter | Old Value | New Value | Benefit |
|-----------------------|------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------|
| `shuffle_algo` | `py1s` | `py1e` | Better shuffle and balanced downloading |
| `num_canonical_nodes` | `64 * physical nodes` | if `py1s` or `py2s`, `64 * physical_nodes`, otherwise `physical_nodes` | Consistently good shuffle for all shuffle algos |
| `shuffle_block_size` | `262,144` | `4,000,000 / num_canonical_nodes` | Consistently good shuffle for all `num_canonical_nodes` values |
| `predownload` | `max(batch_size, 256 * batch_size // num_canonical_nodes)` | `8 * batch_size` | Better balanced downloading |
| `partition_algo` | `orig` | `relaxed` | More flexible deterministic resumptions on nodes |

:gem: New Features

🤖 Streaming Simulator: Easily simulate the performance of training configurations. (385)
- After installing this version of streaming, simply run the command `simulator` in your terminal to open the simulation interface.
- Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
- Easily de-risk runs and find performant parameter settings.
- Check out the [docs](https://docs.mosaicml.com/projects/streaming/en/stable/fundamentals/simulator.html) for more information!

🔢 More flexible deterministic training and resumption (476)
- Deterministic training and resumptions are now possible on more numbers of nodes!
- Previously, the `num_canonical_nodes` parameter had to divide or be a multiple of the number of physical nodes for determinism.
- Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.

🐛 Bug Fixes

- Check for invalid hash algorithm names (486)

What's Changed
* Bump fastapi from 0.103.2 to 0.104.0 by dependabot in https://github.com/mosaicml/streaming/pull/480
* Bump gitpython from 3.1.37 to 3.1.40 by dependabot in https://github.com/mosaicml/streaming/pull/481
* Bump sphinx-tabs from 3.4.1 to 3.4.4 by dependabot in https://github.com/mosaicml/streaming/pull/482
* do not remove local directory when out is local by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/477
* Update __init__.py by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/484
* Check for invalid hash algorithm name by karan6181 in https://github.com/mosaicml/streaming/pull/486
* Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes by snarayan21 in https://github.com/mosaicml/streaming/pull/476
* Better default values for StreamingDataset args by snarayan21 in https://github.com/mosaicml/streaming/pull/479
* Update release yaml to not write anything to GitHub by karan6181 in https://github.com/mosaicml/streaming/pull/487
* Bump pypandoc from 1.11 to 1.12 by dependabot in https://github.com/mosaicml/streaming/pull/490
* Bump pytest from 7.4.2 to 7.4.3 by dependabot in https://github.com/mosaicml/streaming/pull/491
* Bumping version for streaming v0.7.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/495


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.6.1...v0.7.0

0.6.1

Not secure
:gem: New Features

:railway_car: Merge meta-data information from sub-directories dataset to form one unified dataset. (449)
- Addition of the `merge_index()` utility method to merge subdirectories index files from an MDS dataset. The subdirectories can be local or any supported cloud provider URL path.
- Checkout [dataset conversion](https://docs.mosaicml.com/projects/streaming/en/stable/examples/multiprocess_dataset_conversion.html) and [Spark Dataframe to MDS](https://docs.mosaicml.com/projects/streaming/en/stable/examples/spark_dataframe_to_MDS.html) jupyter notebook for an example in action.

:repeat: Retry uploading a file to a cloud provider path. (448)
- Added upload retry logic with backoff and jitter during dataset conversion as part of parameter `retry` in [Writer](https://github.com/mosaicml/streaming/blob/v0.6.1/streaming/base/format/base/writer.py#L65).
python
from streaming import MDSWriter

with MDSWriter(
...,
retry=3) as out:
for sample in dataset:
out.write(sample)



🐛 Bug Fixes

- Validate [Writer](https://github.com/mosaicml/streaming/blob/v0.6.1/streaming/base/format/base/writer.py#L32) arguments and raise a ValueError exception if argument(s) is/are invalid. (434)
- Terminate the main process if one of the upload threads receives an Exception during dataset conversion. (448)

🔧 Improvements

- More balancing inter-node downloading for the `py1e` shuffling algorithm by varying shard sample ranges, helping to reduce throughput drops at scale. (442)

What's Changed
* Validate writer arguments by karan6181 in https://github.com/mosaicml/streaming/pull/434
* Bump pytest from 7.4.1 to 7.4.2 by dependabot in https://github.com/mosaicml/streaming/pull/428
* Bump gitpython from 3.1.34 to 3.1.36 by dependabot in https://github.com/mosaicml/streaming/pull/435
* Fix stylistic issues (mostly 100col, docstring conventions) by knighton in https://github.com/mosaicml/streaming/pull/439
* Bump pytest-codeblocks from 0.16.1 to 0.17.0 by dependabot in https://github.com/mosaicml/streaming/pull/436
* py1e randomized by snarayan21 in https://github.com/mosaicml/streaming/pull/442
* Bump gitpython from 3.1.36 to 3.1.37 by dependabot in https://github.com/mosaicml/streaming/pull/446
* Fix BatchFeature of Transformers not handled by StreamingDataloader by Hubert-Bonisseur in https://github.com/mosaicml/streaming/pull/450
* Add a retry logic with backoff and jitter by karan6181 in https://github.com/mosaicml/streaming/pull/448
* Fix broken bibtext by Skylion007 in https://github.com/mosaicml/streaming/pull/452
* Update integration test to include sample order comparison by karan6181 in https://github.com/mosaicml/streaming/pull/456
* Bump pydantic from 2.3.0 to 2.4.2 by dependabot in https://github.com/mosaicml/streaming/pull/455
* Update MCLI credential page for Databricks by karan6181 in https://github.com/mosaicml/streaming/pull/466
* Add merge index file utility by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/449
* Add py1e warning when Shuffle block size is smaller than shard size by snarayan21 in https://github.com/mosaicml/streaming/pull/463
* Fix doc strings by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/469
* Bump fastapi from 0.103.1 to 0.103.2 by dependabot in https://github.com/mosaicml/streaming/pull/454
* Maintain order for merge_index_from_list by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/472
* Fixed codeql out of disk space issue by karan6181 in https://github.com/mosaicml/streaming/pull/473
* Bump version to 0.6.1 by karan6181 in https://github.com/mosaicml/streaming/pull/474

New Contributors
* Hubert-Bonisseur made their first contribution in https://github.com/mosaicml/streaming/pull/450

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.6.0...v0.6.1

Page 2 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.