Mosaicml-streaming

Latest version: v0.9.1

Safety actively analyzes 681812 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 5

0.9.1

What's New
1. Streaming is added to Gurubase (https://github.com/mosaicml/streaming/pull/805)
* Streaming now has an AI assistant available to help users with their questions! Try out Streaming Guru which uses the data from this repo and data from the [docs](https://docs.mosaicml.com/projects/streaming/en/stable/) to answer questions by leveraging the LLM.

Improvements
1. Permission Issue Resolution (https://github.com/mosaicml/streaming/pull/813)
* Resolved read permission issues occurring when shared memory files are created in shared computing environments. We added retry conditions to allow the creation of new shared memory files upon encountering permission errors.
* Prefix Integrity for Shared Memory Files: When creating shared memory files, both LOCALS and FILELOCKS are now validated to ensure no overlap with existing files, and they are matched with consistent prefix identifiers.
* Handling Non-Normal Program Exits: Enhanced cleanup procedures to address cases where non-normal program exits left some shared memory files uncleared. All files in SHM_TO_CLEAN are now checked to prevent duplicates.
These changes improve shared memory management and reliability in shared environments.

2. Fix Shard Eviction Hanging (https://github.com/mosaicml/streaming/pull/795)
* Changed the search for coldest shard to avoid looping over remote shards by considering local shards only as possible candidates for eviction.




What's Changed
* Bump pydantic from 2.9.1 to 2.9.2 by dependabot in https://github.com/mosaicml/streaming/pull/785
* Bump fastapi from 0.114.2 to 0.115.0 by dependabot in https://github.com/mosaicml/streaming/pull/786
* Bump uvicorn from 0.30.6 to 0.31.0 by dependabot in https://github.com/mosaicml/streaming/pull/793
* Fixed broken links in README.md by LukaszSztukiewicz in https://github.com/mosaicml/streaming/pull/794
* Shard evict fix by snarayan21 in https://github.com/mosaicml/streaming/pull/795
* Update huggingface-hub requirement from <0.25,>=0.23.4 to >=0.23.4,<0.26 by dependabot in https://github.com/mosaicml/streaming/pull/787
* Fix dataset.size() typo in docs by snarayan21 in https://github.com/mosaicml/streaming/pull/798
* Warning -> info about defaults from v0.7.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/799
* Bump uvicorn from 0.31.0 to 0.31.1 by dependabot in https://github.com/mosaicml/streaming/pull/803
* Bump fastapi from 0.115.0 to 0.115.2 by dependabot in https://github.com/mosaicml/streaming/pull/804
* Introducing Streaming Guru on Gurubase.io by kursataktas in https://github.com/mosaicml/streaming/pull/805
* Add better error message for shared prefix by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/806
* Bump uvicorn from 0.31.1 to 0.32.0 by dependabot in https://github.com/mosaicml/streaming/pull/809
* Bump pytest-split from 0.9.0 to 0.10.0 by dependabot in https://github.com/mosaicml/streaming/pull/810
* Fix logo png by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/808
* Update huggingface-hub requirement from <0.26,>=0.23.4 to >=0.23.4,<0.27 by dependabot in https://github.com/mosaicml/streaming/pull/814
* Bump fastapi from 0.115.2 to 0.115.4 by dependabot in https://github.com/mosaicml/streaming/pull/815
* Fix shared memory permission issue in a shared pod environment by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/813

New Contributors
* LukaszSztukiewicz made their first contribution in https://github.com/mosaicml/streaming/pull/794
* kursataktas made their first contribution in https://github.com/mosaicml/streaming/pull/805

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.9.0...v0.9.1

What's Changed
* Bump pydantic from 2.9.1 to 2.9.2 by dependabot in https://github.com/mosaicml/streaming/pull/785
* Bump fastapi from 0.114.2 to 0.115.0 by dependabot in https://github.com/mosaicml/streaming/pull/786
* Bump uvicorn from 0.30.6 to 0.31.0 by dependabot in https://github.com/mosaicml/streaming/pull/793
* Fixed broken links in README.md by LukaszSztukiewicz in https://github.com/mosaicml/streaming/pull/794
* Shard evict fix by snarayan21 in https://github.com/mosaicml/streaming/pull/795
* Update huggingface-hub requirement from <0.25,>=0.23.4 to >=0.23.4,<0.26 by dependabot in https://github.com/mosaicml/streaming/pull/787
* Fix dataset.size() typo in docs by snarayan21 in https://github.com/mosaicml/streaming/pull/798
* Warning -> info about defaults from v0.7.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/799
* Bump uvicorn from 0.31.0 to 0.31.1 by dependabot in https://github.com/mosaicml/streaming/pull/803
* Bump fastapi from 0.115.0 to 0.115.2 by dependabot in https://github.com/mosaicml/streaming/pull/804
* Introducing Streaming Guru on Gurubase.io by kursataktas in https://github.com/mosaicml/streaming/pull/805
* Add better error message for shared prefix by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/806
* Bump uvicorn from 0.31.1 to 0.32.0 by dependabot in https://github.com/mosaicml/streaming/pull/809
* Bump pytest-split from 0.9.0 to 0.10.0 by dependabot in https://github.com/mosaicml/streaming/pull/810
* Fix logo png by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/808
* Update huggingface-hub requirement from <0.26,>=0.23.4 to >=0.23.4,<0.27 by dependabot in https://github.com/mosaicml/streaming/pull/814
* Bump fastapi from 0.115.2 to 0.115.4 by dependabot in https://github.com/mosaicml/streaming/pull/815
* Fix shared memory permission issue in a shared pod environment by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/813

New Contributors
* LukaszSztukiewicz made their first contribution in https://github.com/mosaicml/streaming/pull/794
* kursataktas made their first contribution in https://github.com/mosaicml/streaming/pull/805

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.9.0...v0.9.1

0.9.0

Whats new
1. Improved compatibility for ndarray and json types (776, 777)
It is now possible to have columns including a map type successfully convert to JSON in an MDS file if the given type for the column is specified as 'json', and allows the JSON encoder to handle ndarray types.

What's Changed
* Bump fastapi from 0.112.1 to 0.112.2 by dependabot in https://github.com/mosaicml/streaming/pull/768
* Bump ci testing by snarayan21 in https://github.com/mosaicml/streaming/pull/770
* Bump jupyter from 1.0.0 to 1.1.1 by dependabot in https://github.com/mosaicml/streaming/pull/772
* Bump fastapi from 0.112.2 to 0.114.0 by dependabot in https://github.com/mosaicml/streaming/pull/779
* Bump pydantic from 2.8.2 to 2.9.1 by dependabot in https://github.com/mosaicml/streaming/pull/778
* Allow JSON encoder to handle ndarray by srowen in https://github.com/mosaicml/streaming/pull/777
* Add MapType as JSON-compatible by srowen in https://github.com/mosaicml/streaming/pull/776
* Bump fastapi from 0.114.0 to 0.114.2 by dependabot in https://github.com/mosaicml/streaming/pull/783
* Update datasets requirement from <3,>=2.4.0 to >=2.4.0,<4 by dependabot in https://github.com/mosaicml/streaming/pull/784
* Bump pytest from 8.3.2 to 8.3.3 by dependabot in https://github.com/mosaicml/streaming/pull/782
* Bump main branch to 0.10.0.dev0 by dakinggg in https://github.com/mosaicml/streaming/pull/790


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.8.1...v0.9.0

0.8.1

🔧 Improvements
**Dataloader hanging between epochs has now been resolved!** We've seen training time improvements of up to 40% for some many-epoch training jobs. If this was impacting your runs and has now been fixed, please let us know!
* Fix dataloader hang at the end of an epoch by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/741
* Add default compression, and warning about local paths to dataframe_to_mds by srowen in https://github.com/mosaicml/streaming/pull/748
* Throw exception when event.is_set() after write()s by srowen in https://github.com/mosaicml/streaming/pull/754

🐛 Bug Fixes
* Ensure deterministic sample order between epochs when `shuffle=False` by snarayan21 in https://github.com/mosaicml/streaming/pull/750

What's Changed
* Make Pytest log in color in Github Action by eitanturok in https://github.com/mosaicml/streaming/pull/739
* fix azure container name and blob name in download_from_azure by jaehwana2z in https://github.com/mosaicml/streaming/pull/733
* Bump uvicorn from 0.30.3 to 0.30.5 by dependabot in https://github.com/mosaicml/streaming/pull/743
* Update huggingface-hub requirement from <0.24,>=0.23.4 to >=0.23.4,<0.25 by dependabot in https://github.com/mosaicml/streaming/pull/729
* Bump fastapi from 0.111.1 to 0.112.0 by dependabot in https://github.com/mosaicml/streaming/pull/744
* Bump ci-testing to v0.1.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/745
* Patching conf.py due to Sphinx deprecating config manipulation by snarayan21 in https://github.com/mosaicml/streaming/pull/746
* Bump ci-testing to v0.1.2 by snarayan21 in https://github.com/mosaicml/streaming/pull/747
* Type hints conformant with pep 585 by snarayan21 in https://github.com/mosaicml/streaming/pull/752
* Ruff rule to remove unused imports by snarayan21 in https://github.com/mosaicml/streaming/pull/756
* Fix linting for numpy 2.1.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/764
* Bump fastapi from 0.112.0 to 0.112.1 by dependabot in https://github.com/mosaicml/streaming/pull/760
* Bump uvicorn from 0.30.5 to 0.30.6 by dependabot in https://github.com/mosaicml/streaming/pull/762
* Version 0.8.1 bump! by snarayan21 in https://github.com/mosaicml/streaming/pull/766

New Contributors
* eitanturok made their first contribution in https://github.com/mosaicml/streaming/pull/739
* jaehwana2z made their first contribution in https://github.com/mosaicml/streaming/pull/733
* srowen made their first contribution in https://github.com/mosaicml/streaming/pull/748

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.8.0...v0.8.1

0.8.0

✨ What's New ✨

**1. HF File System Streaming (711)**

Streaming now supports streaming data from HF file system! This adds another popular backend as an option to host your data.


What's Changed
* Bump fastapi from 0.110.2 to 0.111.0 by dependabot in https://github.com/mosaicml/streaming/pull/670
* Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/668
* Ensure shards cannot be larger than 4GB by snarayan21 in https://github.com/mosaicml/streaming/pull/672
* Helpful error on `py1e` for improperly written datasets by snarayan21 in https://github.com/mosaicml/streaming/pull/673
* Bump pytest from 8.2.0 to 8.2.1 by dependabot in https://github.com/mosaicml/streaming/pull/680
* Update platform references by aspfohl in https://github.com/mosaicml/streaming/pull/675
* Update CODEOWNERS by karan6181 in https://github.com/mosaicml/streaming/pull/681
* Fix `batch_size` typo for `Stream` object in docs by snarayan21 in https://github.com/mosaicml/streaming/pull/682
* Bump databricks-sdk from 0.27.0 to 0.27.1 by dependabot in https://github.com/mosaicml/streaming/pull/679
* Improve local temp directory error when only `remote` is specified by snarayan21 in https://github.com/mosaicml/streaming/pull/683
* Fix node calculation in `replication` for `World` object by snarayan21 in https://github.com/mosaicml/streaming/pull/685
* Warning condition changed for Sequence Parallelism by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/688
* Bump pydantic from 2.7.1 to 2.7.2 by dependabot in https://github.com/mosaicml/streaming/pull/692
* Bump uvicorn from 0.29.0 to 0.30.1 by dependabot in https://github.com/mosaicml/streaming/pull/691
* Make sure epoch_size is an int by snarayan21 in https://github.com/mosaicml/streaming/pull/693
* Bump databricks-sdk from 0.27.1 to 0.28.0 by dependabot in https://github.com/mosaicml/streaming/pull/687
* Bump pytest from 8.2.1 to 8.2.2 by dependabot in https://github.com/mosaicml/streaming/pull/697
* fix: expand user path for Writer's output directory. by huxuan in https://github.com/mosaicml/streaming/pull/694
* Bump pydantic from 2.7.2 to 2.7.3 by dependabot in https://github.com/mosaicml/streaming/pull/696
* Fix edge cases with scalar or empty numpy array encoding by snarayan21 in https://github.com/mosaicml/streaming/pull/702
* Raise IndexError in `Spanner` object instead of `ValueError` by snarayan21 in https://github.com/mosaicml/streaming/pull/701
* Fix linting issues with numpy 2 by snarayan21 in https://github.com/mosaicml/streaming/pull/705
* Bump pydantic from 2.7.3 to 2.7.4 by dependabot in https://github.com/mosaicml/streaming/pull/704
* Enable correct resumption from the end of an epoch by snarayan21 in https://github.com/mosaicml/streaming/pull/700
* Fix `drop_first` checking in partitioning to account for `world_size` divisibility by snarayan21 in https://github.com/mosaicml/streaming/pull/706
* fix convert imagenet by Hprairie in https://github.com/mosaicml/streaming/pull/708
* Bump pytest-split from 0.8.2 to 0.9.0 by dependabot in https://github.com/mosaicml/streaming/pull/710
* Remove duplicate `dbfs:` prefix from error message by vanshcsingh in https://github.com/mosaicml/streaming/pull/712
* enable adaptive retry for s3 download by bigning in https://github.com/mosaicml/streaming/pull/713
* Upgrade ci_testing, remove codeql by snarayan21 in https://github.com/mosaicml/streaming/pull/714
* Fix Linting from Pillow version update by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/719
* Bump pydantic from 2.7.4 to 2.8.2 by dependabot in https://github.com/mosaicml/streaming/pull/718
* Bump databricks-sdk from 0.28.0 to 0.29.0 by dependabot in https://github.com/mosaicml/streaming/pull/715
* Add HF File System Support to Streaming by orionw in https://github.com/mosaicml/streaming/pull/711
* Improve error message on non-0 rank when index file download failed by bigning in https://github.com/mosaicml/streaming/pull/723
* Bump pytest from 8.2.2 to 8.3.2 by dependabot in https://github.com/mosaicml/streaming/pull/735
* Bump uvicorn from 0.30.1 to 0.30.3 by dependabot in https://github.com/mosaicml/streaming/pull/730
* Bump fastapi from 0.111.0 to 0.111.1 by dependabot in https://github.com/mosaicml/streaming/pull/724
* Bump Streaming Version to 0.8.0 by mvpatel2000 in https://github.com/mosaicml/streaming/pull/738

New Contributors
* aspfohl made their first contribution in https://github.com/mosaicml/streaming/pull/675
* huxuan made their first contribution in https://github.com/mosaicml/streaming/pull/694
* Hprairie made their first contribution in https://github.com/mosaicml/streaming/pull/708
* vanshcsingh made their first contribution in https://github.com/mosaicml/streaming/pull/712
* orionw made their first contribution in https://github.com/mosaicml/streaming/pull/711

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.6...v0.8.0

0.7.6

:gem: New Features

1. `device_per_stream` batching method
Users can now construct batches such that each device sees only samples from a single stream. This is very useful in cases where different data sources have samples/tensors of different sizes, but the model should still see samples from these different data sources at each optimizer step.
* Adding `device_per_stream` batching by snarayan21 in https://github.com/mosaicml/streaming/pull/661

2. Add `ndarray` type for Spark dataframes.
Enable parsing Spark's ArrayType (of ShortType, LongType, IntegerType, FloatType, DoubleType) when converting a Spark dataframe to MDS.
* Add ndarray type by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/623

3. Support for Alipan storage
Adds support for Alipan, Alibaba's cloud storage service.
* Add support for Alipan Storage backend by PeterDing in https://github.com/mosaicml/streaming/pull/651

What's Changed
* Bump fastapi from 0.110.0 to 0.110.2 by dependabot in https://github.com/mosaicml/streaming/pull/660
* Bump pydantic from 2.6.4 to 2.7.0 by dependabot in https://github.com/mosaicml/streaming/pull/653
* Bump pydantic from 2.7.0 to 2.7.1 by dependabot in https://github.com/mosaicml/streaming/pull/666
* Bump pytest from 8.1.1 to 8.2.0 by dependabot in https://github.com/mosaicml/streaming/pull/664
* Bump databricks-sdk from 0.23.0 to 0.27.0 by dependabot in https://github.com/mosaicml/streaming/pull/667
* Version bump to v0.7.6 by snarayan21 in https://github.com/mosaicml/streaming/pull/669

New Contributors
* PeterDing made their first contribution in https://github.com/mosaicml/streaming/pull/651

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.5...v0.7.6

0.7.5

:gem: New Features

1. Tensor/Sequence Parallelism Support
Using the `replication` argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.
* Replicating samples across devices (SP / TP enablement) by knighton in https://github.com/mosaicml/streaming/pull/597
* Expanded replication testing + documentation by snarayan21 in https://github.com/mosaicml/streaming/pull/607
* Make streaming use the correct number of unique samples with SP/TP by snarayan21 in https://github.com/mosaicml/streaming/pull/619

2. Overhauled Streaming Documentation
New and improved streaming documentation can be found [here](https://docs.mosaicml.com/projects/streaming/en/stable/#) -- please submit issues with any feedback.
* Major overhaul of Streaming documentation by snarayan21 in https://github.com/mosaicml/streaming/pull/636

3. `batch_size` is now required for StreamingDataset
As we have seen multiple errors and performance degradations from users not setting the `batch_size` argument to StreamingDataset, we are making it a requirement to iterate over the dataset.
* You must set batch size. There is no other way. by snarayan21 in https://github.com/mosaicml/streaming/pull/624

3. Support for Python 3.11, deprecate Python 3.8
* Add support for Python 3.11 and deprecate Python 3.8 by karan6181 in https://github.com/mosaicml/streaming/pull/586

🐛 Bug Fixes
* [easy typo fix] fix f-string by bigning in https://github.com/mosaicml/streaming/pull/596
* Change comparison in partitions to include equals by JAEarly in https://github.com/mosaicml/streaming/pull/587
* Use type int when initializing SharedMemory size by bchiang2 in https://github.com/mosaicml/streaming/pull/604
* COCO Dataset fix -- avoids `allow_unsafe_types=True` by snarayan21 in https://github.com/mosaicml/streaming/pull/647

🔧 Improvements
* Allow writers to overwrite existing data by JAEarly in https://github.com/mosaicml/streaming/pull/594
* Update careers link by milocress in https://github.com/mosaicml/streaming/pull/611
* Update license by b-chu in https://github.com/mosaicml/streaming/pull/568
* Updated documentation for S3-compatible object stores by AIproj in https://github.com/mosaicml/streaming/pull/592
* Make yamllint consistent with Composer by b-chu in https://github.com/mosaicml/streaming/pull/583
* Switch linting workflows to ci-testing repo by b-chu in https://github.com/mosaicml/streaming/pull/616

What's Changed
* Bump uvicorn from 0.26.0 to 0.27.1 by dependabot in https://github.com/mosaicml/streaming/pull/599
* Bump pytest-split from 0.8.1 to 0.8.2 by dependabot in https://github.com/mosaicml/streaming/pull/581
* Update ruff to 0.2.2 by Skylion007 in https://github.com/mosaicml/streaming/pull/608
* Bump fastapi from 0.109.0 to 0.110.0 by dependabot in https://github.com/mosaicml/streaming/pull/610
* Bump yamllint from 1.33.0 to 1.35.1 by dependabot in https://github.com/mosaicml/streaming/pull/601
* Bump uvicorn from 0.27.1 to 0.28.0 by dependabot in https://github.com/mosaicml/streaming/pull/626
* Update moto requirement from <5,>=4.0 to >=4.0,<6 by dependabot in https://github.com/mosaicml/streaming/pull/580
* Bump furo from 2023.7.26 to 2024.1.29 by dependabot in https://github.com/mosaicml/streaming/pull/631
* Bump pypandoc from 1.12 to 1.13 by dependabot in https://github.com/mosaicml/streaming/pull/630
* Bump databricks-sdk from 0.14.0 to 0.22.0 by dependabot in https://github.com/mosaicml/streaming/pull/629
* Add batch_size to 1 if not provided for regression testing by karan6181 in https://github.com/mosaicml/streaming/pull/635
* Fixed docstring note for getting sequential sample ordering by snarayan21 in https://github.com/mosaicml/streaming/pull/632
* Bump pytest and fix failing test by snarayan21 in https://github.com/mosaicml/streaming/pull/642
* Update pytest-cov requirement from <5,>=4 to >=4,<6 by dependabot in https://github.com/mosaicml/streaming/pull/638
* Bump pydantic from 2.5.3 to 2.6.4 by dependabot in https://github.com/mosaicml/streaming/pull/639
* Bump uvicorn from 0.28.0 to 0.29.0 by dependabot in https://github.com/mosaicml/streaming/pull/640
* Bump databricks-sdk from 0.22.0 to 0.23.0 by dependabot in https://github.com/mosaicml/streaming/pull/644
* Version bump to 0.7.5 by snarayan21 in https://github.com/mosaicml/streaming/pull/650

New Contributors
* bigning made their first contribution in https://github.com/mosaicml/streaming/pull/596
* JAEarly made their first contribution in https://github.com/mosaicml/streaming/pull/587
* AIproj made their first contribution in https://github.com/mosaicml/streaming/pull/592
* milocress made their first contribution in https://github.com/mosaicml/streaming/pull/611
* bchiang2 made their first contribution in https://github.com/mosaicml/streaming/pull/604

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.7.4...v0.7.5

Page 1 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.