Mosaicml-streaming

Latest version: v0.11.0

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 5

0.4.1

Not secure
New Feature
- Support of Torch 2.0. (234)
- Addition of two new sample shuffling algorithm. (223)
- Support of AWS S3 requester payers bucket permission for streaming. (231)

Documentation

- Added a streaming installation guide and a streaming environment guide. (221)
- Added a instruction guide for converting a multimodal dataset into a MDS format. (220)
- Streaming documentation now support Algolia search. (224)

What's Changed
* Refactor StreamingDataset shared memory prefix setup by knighton in https://github.com/mosaicml/streaming/pull/218
* Bump pytest from 7.2.2 to 7.3.0 by dependabot in https://github.com/mosaicml/streaming/pull/222
* Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). by knighton in https://github.com/mosaicml/streaming/pull/223
* Add installation and environments documentation by karan6181 in https://github.com/mosaicml/streaming/pull/221
* Add a readme for multimodal convert script modal type by karan6181 in https://github.com/mosaicml/streaming/pull/220
* Bump sphinx-copybutton from 0.5.1 to 0.5.2 by dependabot in https://github.com/mosaicml/streaming/pull/229
* Bump pytest from 7.3.0 to 7.3.1 by dependabot in https://github.com/mosaicml/streaming/pull/230
* Bump sphinxext-opengraph from 0.8.1 to 0.8.2 by dependabot in https://github.com/mosaicml/streaming/pull/228
* Bump fastapi from 0.95.0 to 0.95.1 by dependabot in https://github.com/mosaicml/streaming/pull/227
* Virtually split the repeats of repeated shards by knighton in https://github.com/mosaicml/streaming/pull/226
* Switch documentation search to use Algolia by bandish-shah in https://github.com/mosaicml/streaming/pull/224
* Add a requester pays bucket permission args to boto3 for s3 download file by karan6181 in https://github.com/mosaicml/streaming/pull/231
* Bump yamllint from 1.30.0 to 1.31.0 by dependabot in https://github.com/mosaicml/streaming/pull/233
* Support of torch 2.0 by karan6181 in https://github.com/mosaicml/streaming/pull/234
* Removed pushing auto release branch due to GH action permission by karan6181 in https://github.com/mosaicml/streaming/pull/235
* Fixed local directory check by karan6181 in https://github.com/mosaicml/streaming/pull/238
* Skip distributed all_gather test since CI non-deterministically hangs by karan6181 in https://github.com/mosaicml/streaming/pull/240
* Bump version to 0.4.1 by karan6181 in https://github.com/mosaicml/streaming/pull/239


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.4.0...v0.4.1

0.4.0

Not secure
New Feature

🔀 Dataset Mixing
- Weighted mixing of sub-datasets on the fly during model training (184). StreamingDataset now support an optional `streams` parameter which takes one or more sub-datasets and it intelligently fetches samples across sub-datasets. You can mix (upsample or downsample) datasets by defining each either relatively (`proportion`) or absolutely (`repeat` or `samples` or none of them to sample 1:1).

Documentation

- Added a README which shows how to convert a raw dataset into an MDS format for Text and Vision dataset. (183)

Bug Fixes

- Raise an exception if the cloud storage bucket does not exist during shard file upload. (212)
- Remove unsupported ThreadPoolExecutor shutdown param for python38. (199)

What's Changed
* Update GCS cloud storage credential document by karan6181 in https://github.com/mosaicml/streaming/pull/181
* Update API reference doc to be compatible with sphinx by karan6181 in https://github.com/mosaicml/streaming/pull/182
* Add a readme for text and vision convert script modal type by karan6181 in https://github.com/mosaicml/streaming/pull/183
* Fix docstrings by knighton in https://github.com/mosaicml/streaming/pull/185
* Synchronize before destroying process group by coryMosaicML in https://github.com/mosaicml/streaming/pull/186
* Bump pytest from 7.2.1 to 7.2.2 by dependabot in https://github.com/mosaicml/streaming/pull/187
* Bump pypandoc from 1.10 to 1.11 by dependabot in https://github.com/mosaicml/streaming/pull/188
* White-box weighted mixing of streaming datasets by knighton in https://github.com/mosaicml/streaming/pull/184
* Organize partitioning code by knighton in https://github.com/mosaicml/streaming/pull/190
* Bump pydantic from 1.10.5 to 1.10.6 by dependabot in https://github.com/mosaicml/streaming/pull/194
* Bump uvicorn from 0.20.0 to 0.21.0 by dependabot in https://github.com/mosaicml/streaming/pull/196
* Bump fastapi from 0.92.0 to 0.94.0 by dependabot in https://github.com/mosaicml/streaming/pull/198
* Remove unsupported ThreadPoolExecutor shutdown param in python38 by karan6181 in https://github.com/mosaicml/streaming/pull/199
* Fix doctstrings (maybe?) by Landanjs in https://github.com/mosaicml/streaming/pull/200
* Demo: crawling, converting, and iterating weighted dataset subsets by knighton in https://github.com/mosaicml/streaming/pull/191
* Update WebVid README.md by knighton in https://github.com/mosaicml/streaming/pull/202
* Fix leftover test dirs and improve dataset method and variable names by knighton in https://github.com/mosaicml/streaming/pull/201
* Bump fastapi from 0.94.0 to 0.95.0 by dependabot in https://github.com/mosaicml/streaming/pull/205
* Bump uvicorn from 0.21.0 to 0.21.1 by dependabot in https://github.com/mosaicml/streaming/pull/206
* Raise an exception if bucket does not exist during upload by karan6181 in https://github.com/mosaicml/streaming/pull/212
* Bump yamllint from 1.29.0 to 1.30.0 by dependabot in https://github.com/mosaicml/streaming/pull/209
* Bump pydantic from 1.10.6 to 1.10.7 by dependabot in https://github.com/mosaicml/streaming/pull/211
* Register atexit handler for resource cleanup by karan6181 in https://github.com/mosaicml/streaming/pull/215
* Bump version to 0.4.0 by karan6181 in https://github.com/mosaicml/streaming/pull/216

New Contributors
* coryMosaicML made their first contribution in https://github.com/mosaicml/streaming/pull/186
* Landanjs made their first contribution in https://github.com/mosaicml/streaming/pull/200

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.3.0...v0.4.0

0.3.0

Not secure
New Features

:cloud: Cloud uploading

Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to `MDSWriter`. Track the progress of individual uploads with `progress_bar=True`, and tune background upload workers with `max_workers=4`.

User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a `out` parameter as a cloud provider bucket location as part of `Writer` class. Below is the example to upload output files to AWS S3 bucket

python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass


User can choose to keep a output shard files locally by providing a local directory path as part of `Writer`. For example,

python
output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass


User can see the progress of the cloud upload file by setting `progress_bar=True` as part of `Writer`. For example,

python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
for sample in samples:
pass


User can control the number of background upload threads via parameter `max_workers` as part of `Writer` who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if `max_workers=4`, maximum 4 threads would be active at a same time uploading one shard file each.

python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
for sample in samples:
pass


:twisted_rightwards_arrows: 2x faster shuffling

We’ve added a new shuffling algorithm `py1s` which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding `shuffle_algo` (old behavior: `py2s`). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.

📨 2x faster partitioning

We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the `partition_algo` argument to `StreamingDataset`. You will experience this as faster start and resumption for large datasets.

:link: Extensible downloads

We provide examples of modifying `StreamingDataset` to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.

**API changes**

- Class `Writer` and its derived classes (`MDSWriter`, `XSVWriter`, `TSVWriter`, `CSVWriter`, and `JSONWriter`) parameter has been changed from `dirname` to `out` with the following advanced functionalities:
- If `out` is a local directory, shard files are saved locally. For example, `out=/tmp/mds/`.
- If `out` is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example, `out=s3://bucket/dir/path`.
- If `out` is a tuple of `(local_dir, remote_dir)`, shard files are saved in the
`local_dir` and also uploaded to a remote location. For example, `out=('/tmp/mds/', 's3://bucket/dir/path')`.

- Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of `Writer` and its subclasses (like `MDSWriter`) and `StreamingDataset` to require kwargs.

Bug Fixes

- Fix broken blog post link and community email link in the README (177).
- Download the shard files as tmp extension until it finishes for OCI blob storage (178).
- Supported cloud providers documentation (169).
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [[this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
doc on how to configure cloud storage credentials.
- Make [setup.py](http://setup.py/) deterministic by sorting dependencies (#165).
- Fix overlong lines for better readability (163).

What's Changed
* Bump fastapi from 0.89.1 to 0.91.0 by dependabot in https://github.com/mosaicml/streaming/pull/154
* Bump sphinxext-opengraph from 0.7.5 to 0.8.1 by dependabot in https://github.com/mosaicml/streaming/pull/155
* Compare arrow vs mds vs parquet. by knighton in https://github.com/mosaicml/streaming/pull/160
* Improve serialization format comparison. by knighton in https://github.com/mosaicml/streaming/pull/161
* WebVid: conversion and benchmarking for storing the MP4s separately vs inside the MDS shards. by knighton in https://github.com/mosaicml/streaming/pull/143
* Update download badge link to pepy by karan6181 in https://github.com/mosaicml/streaming/pull/162
* CloudWriter interface: local=, remote=, keep=. by knighton in https://github.com/mosaicml/streaming/pull/148
* Fix overlong lines. by knighton in https://github.com/mosaicml/streaming/pull/163
* Make setup.py deterministic by sorting dependencies. by nharada1 in https://github.com/mosaicml/streaming/pull/165
* Bump pydantic from 1.10.4 to 1.10.5 by dependabot in https://github.com/mosaicml/streaming/pull/166
* Bump gitpython from 3.1.30 to 3.1.31 by dependabot in https://github.com/mosaicml/streaming/pull/167
* Bump fastapi from 0.91.0 to 0.92.0 by dependabot in https://github.com/mosaicml/streaming/pull/168
* Adjust StreamingDataset arguments by knighton in https://github.com/mosaicml/streaming/pull/170
* add 2x faster shuffle algorithm; add shuffle bench/plot by knighton in https://github.com/mosaicml/streaming/pull/137
* Docstring fix by knighton in https://github.com/mosaicml/streaming/pull/173
* Add a supported cloud providers documentation by karan6181 in https://github.com/mosaicml/streaming/pull/169
* Add callout fence to Configure Cloud Storage Credentials guide by karan6181 in https://github.com/mosaicml/streaming/pull/174
* Fix broken links in the README by knighton in https://github.com/mosaicml/streaming/pull/177
* Download the shard files as tmp extension until it finishes for OCI by karan6181 in https://github.com/mosaicml/streaming/pull/178
* Add a support of uploading shard files to a cloud as part of Writer by karan6181 in https://github.com/mosaicml/streaming/pull/171
* Refactor partitioning to be much faster. by knighton in https://github.com/mosaicml/streaming/pull/179
* Bump version to 0.3.0 by karan6181 in https://github.com/mosaicml/streaming/pull/180

New Contributors
* nharada1 made their first contribution in https://github.com/mosaicml/streaming/pull/165

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.2.5...v0.3.0

0.2.5

Not secure
Bug Fixes
* Fixed CPU crash (https://github.com/mosaicml/streaming/pull/153)
* Update example notebooks (https://github.com/mosaicml/streaming/pull/157)

What's Changed
* Update README.md by knighton in https://github.com/mosaicml/streaming/pull/152
* Fix typo by dakinggg in https://github.com/mosaicml/streaming/pull/156
* Fixed CPU crash by karan6181 in https://github.com/mosaicml/streaming/pull/153
* Update example notebooks by karan6181 in https://github.com/mosaicml/streaming/pull/157
* bump version to 0.2.5 by karan6181 in https://github.com/mosaicml/streaming/pull/158


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.2.4...v0.2.5

0.2.4

Not secure
What's Changed
* Fix Lossy JPEG reencoding for MDS format by JJGO in https://github.com/mosaicml/streaming/pull/142
* Add message to size assert & change to KeyError by samhavens in https://github.com/mosaicml/streaming/pull/146
* Synchronize prefix_int across all ranks to resolve hang issue by karan6181 in https://github.com/mosaicml/streaming/pull/147
* Pin setuptools in build requirements by dakinggg in https://github.com/mosaicml/streaming/pull/136
* Graphics. by knighton in https://github.com/mosaicml/streaming/pull/150
* bump version to 0.2.4 by karan6181 in https://github.com/mosaicml/streaming/pull/151

New Contributors
* JJGO made their first contribution in https://github.com/mosaicml/streaming/pull/142
* samhavens made their first contribution in https://github.com/mosaicml/streaming/pull/146

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.2.3...v0.2.4

0.2.3

Not secure
New Features

* Add scalar MDS encodings data types (https://github.com/mosaicml/streaming/pull/130)
* Support of WebVid-10M dataset (https://github.com/mosaicml/streaming/pull/132)
* Support of LAION-400M dataset (https://github.com/mosaicml/streaming/pull/87)
* Make `StreamingDataset[sample_id]` block to download the given sample's shard if it is not present, so that the dataset can be used lazily (https://github.com/mosaicml/streaming/pull/118)
* Support of a Streaming benchmarking script to get time taken by the individual component (https://github.com/mosaicml/streaming/pull/121)

Bug Fixes
* Nuke concat option in C4 dataset (https://github.com/mosaicml/streaming/pull/129)
* Fixed bug report markdown doc (https://github.com/mosaicml/streaming/pull/140)
* Fixed ADE20K dataset conversion script (https://github.com/mosaicml/streaming/pull/133)

What's Changed
* Make __getitem__ block to download shard if not present. by knighton in https://github.com/mosaicml/streaming/pull/118
* 2022 -> 2023. by knighton in https://github.com/mosaicml/streaming/pull/119
* Benchmark generating the epoch. by knighton in https://github.com/mosaicml/streaming/pull/121
* Move datasets dependency into .[dev]. by knighton in https://github.com/mosaicml/streaming/pull/123
* Bump sphinxcontrib-katex from 0.9.3 to 0.9.4 by dependabot in https://github.com/mosaicml/streaming/pull/113
* Bump sphinxext-opengraph from 0.7.4 to 0.7.5 by dependabot in https://github.com/mosaicml/streaming/pull/114
* Bump pytest from 7.2.0 to 7.2.1 by dependabot in https://github.com/mosaicml/streaming/pull/124
* Bump fastapi from 0.88.0 to 0.89.1 by dependabot in https://github.com/mosaicml/streaming/pull/125
* Bump yamllint from 1.28.0 to 1.29.0 by dependabot in https://github.com/mosaicml/streaming/pull/126
* Update paramiko requirement from <3,>=2.11.0 to >=2.11.0,<4 by dependabot in https://github.com/mosaicml/streaming/pull/127
* Bump nbsphinx from 0.8.11 to 0.8.12 by dependabot in https://github.com/mosaicml/streaming/pull/128
* Nuke concat option. by knighton in https://github.com/mosaicml/streaming/pull/129
* Add scalar MDS encodings (data types). by knighton in https://github.com/mosaicml/streaming/pull/130
* WebVid. by knighton in https://github.com/mosaicml/streaming/pull/132
* LAION-400M processing by knighton in https://github.com/mosaicml/streaming/pull/87
* Update isort version by karan6181 in https://github.com/mosaicml/streaming/pull/135
* Update pre-commit requirement from <3,>=2.18.1 to >=2.18.1,<4 by dependabot in https://github.com/mosaicml/streaming/pull/134
* Fixed bug report markdown by karan6181 in https://github.com/mosaicml/streaming/pull/140
* Fix ade20k conversion script by dblalock in https://github.com/mosaicml/streaming/pull/133
* bump version to 0.2.3 by karan6181 in https://github.com/mosaicml/streaming/pull/141


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.2.2...v0.2.3

Page 4 of 5

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.