Mosaicml

Latest version: v0.28.0

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 13 of 15

0.5.0

Not secure
Streaming `v0.5.0` is released! Install via `pip`:


pip install --upgrade mosaicml-streaming==0.5.0



New Features

🆕 Cold Shard Eviction. ( 219 )

Dynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument `cache_limit`. See the [shuffling](https://github.com/mosaicml/streaming/blob/main/docs/source/fundamentals/shuffling.md) guide for more details.

python
from streaming import StreamingDataset

dataset = StreamingDataset(
cache_limit='100gb',
...
)


🤙 Fetch sample using NumPy style indexing. ( 120 )

Users can now randomly access samples using NumPy-style indexing with `StreamingDataset`. For example,

python
import numpy as np
from streaming import StreamingDataset

dataset = StreamingDataset(local=local, remote=remote)

dataset[0] Fetch sample 0
dataset[-1] Fetch last sample
dataset[[10, 20]] Fetch sample 10 and 20
dataset[slice(1, 10, 2)] Fetch sample 1, 3, 5, 7, and 9
dataset[5:0:-1] Fetch sample 5, 4, 3, 2, 1
dataset[np.array([4, 7])] Fetch sample 4 and 7


🦾 Any S3 compatible object store. ( 265 )

Support of any S3 compatible object stores, meaning, an object store which uses the S3 API to communicate with any connected device or system. Some of the S3 compatible object stores are [Cloudflare R2](https://www.cloudflare.com/products/r2/), [Coreweave](https://docs.coreweave.com/storage/object-storage), [Backblaze b2](https://www.backblaze.com/b2/cloud-storage.html), etc. User needs to provide an environment variable `S3_ENDPOINT_URL` based on the object store that you are using. Details on how to configure credentials can be found [here](https://github.com/mosaicml/streaming/blob/main/docs/source/how_to_guides/configure_cloud_storage_cred.md#any-s3-compatible-object-store).

🦾 Azure cloud blob storage. ( 256 )

Support of Azure cloud blob storage. Details on how to configure credentials can be found [here](https://github.com/mosaicml/streaming/blob/main/docs/source/how_to_guides/configure_cloud_storage_cred.md#azure-blob-storage).

Bug Fixes

- Wait for download and ready thread to finish before terminating job. ( 286 )
- Fixed length calculation to use resampled epoch size, not underlying num samples. ( 278 )
- Fixed mypy errors by adding a py.typed marker file. ( 245 )
- Create a new boto3 session per thread to avoid sharing resources. ( 241 )

🔧 **API changes**

- The argument `samples_per_epoch` has been renamed to `epoch_size` in `StreamingDataset`to better distinguish the actual number of underlying samples as serialized and the number of observed samples when iterating (which may be different due to weighting sub-datasets).
- The argument `samples` has been renamed to `choose` in `Stream` to better distinguish the underlying sample vs resampled data.
- The argument `keep_raw` has been removed in `StreamingDataset` in the process of finalizing the design for shard eviction (see the newly-added `cache_limit` parameter).
- The default value of `predownload` in `StreamingDataset` was updated; it is now derived using batch size and number of canonical nodes instead of previous constant value of `100_000`. This is to prevent predownloaded shards from getting evicted before ever being used.
- The default value of `num_canonical_nodes` in `StreamingDataset` was updated to 64 times the number of nodes of the initial run instead of number of nodes of the initial run to increase data source diversity and improve convergence.
- The default value of `shuffle_algo` in `StreamingDataset` was changed from `py1b` to `py1s` as it requires less shards to be downloaded during iteration. More details about different shuffling algorithms can be found [here](https://github.com/mosaicml/streaming/blob/main/docs/source/fundamentals/shuffling.md).

What's Changed
* Redesign shard index by knighton in https://github.com/mosaicml/streaming/pull/236
* Propagate an exception raise by a thread to its caller by karan6181 in https://github.com/mosaicml/streaming/pull/241
* Raise descriptive error message when index.json is corrupted by karan6181 in https://github.com/mosaicml/streaming/pull/242
* Rename "samples" to "choose" (distinguish underlying vs resampled) by knighton in https://github.com/mosaicml/streaming/pull/243
* Added py.typed to indicate that the repository has typing annotations by karan6181 in https://github.com/mosaicml/streaming/pull/245
* Add "Array" base class, which provides numpy-style indexing. by knighton in https://github.com/mosaicml/streaming/pull/120
* Better organize code by knighton in https://github.com/mosaicml/streaming/pull/246
* Update readthedocs python version to 3.9 by karan6181 in https://github.com/mosaicml/streaming/pull/249
* Create a new boto3 session per thread by karan6181 in https://github.com/mosaicml/streaming/pull/251
* Bump uvicorn from 0.21.1 to 0.22.0 by dependabot in https://github.com/mosaicml/streaming/pull/253
* Add support for Cloudflare R2 cloud storage by hlky in https://github.com/mosaicml/streaming/pull/255
* Fix typo in documentation's conversion `pile.py` link by ouhenio in https://github.com/mosaicml/streaming/pull/259
* Add support for Azure cloud storage by hlky in https://github.com/mosaicml/streaming/pull/256
* Fix slack link in readme by growlix in https://github.com/mosaicml/streaming/pull/262
* Bugfix in user_guide.md sample code by tginart in https://github.com/mosaicml/streaming/pull/263
* Add `Stream` usage example to README by hanlint in https://github.com/mosaicml/streaming/pull/266
* Update Stream documentation by karan6181 in https://github.com/mosaicml/streaming/pull/267
* Update README.md - slack by ejyuen in https://github.com/mosaicml/streaming/pull/273
* Bump fastapi from 0.95.1 to 0.95.2 by dependabot in https://github.com/mosaicml/streaming/pull/269
* Cold shard eviction by knighton in https://github.com/mosaicml/streaming/pull/219
* Update slack link with a URL shortener by karan6181 in https://github.com/mosaicml/streaming/pull/274
* Bump pydantic from 1.10.7 to 1.10.8 by dependabot in https://github.com/mosaicml/streaming/pull/276
* Bump yamllint from 1.31.0 to 1.32.0 by dependabot in https://github.com/mosaicml/streaming/pull/277
* Fix SD length calculation when resampling by knighton in https://github.com/mosaicml/streaming/pull/278
* Fixed performance degradation when not doing shard eviction by karan6181 in https://github.com/mosaicml/streaming/pull/279
* Derived predownload value using batch size and NCN by karan6181 in https://github.com/mosaicml/streaming/pull/280
* Support any S3-compatible object store (R2, Coreweave, Backblaze, etc.) by abhi-mosaic in https://github.com/mosaicml/streaming/pull/265
* Update docs pypi package and Improved documentation by karan6181 in https://github.com/mosaicml/streaming/pull/281
* Change the default number of canonical nodes by karan6181 in https://github.com/mosaicml/streaming/pull/282
* Set predownload value correctly for all usecase by karan6181 in https://github.com/mosaicml/streaming/pull/283
* Add documentation for MDSWriter, conversion scripts, and supported format by karan6181 in https://github.com/mosaicml/streaming/pull/232
* Ensure int64 by knighton in https://github.com/mosaicml/streaming/pull/284
* Wait for thread job to finish and Fixed filelock directory structure by karan6181 in https://github.com/mosaicml/streaming/pull/286
* Bump fastapi from 0.95.2 to 0.96.0 by dependabot in https://github.com/mosaicml/streaming/pull/287
* Bump version to 0.5.0 by karan6181 in https://github.com/mosaicml/streaming/pull/289
* Remove github action workflow concurrency check by karan6181 in https://github.com/mosaicml/streaming/pull/290

New Contributors
* hlky made their first contribution in https://github.com/mosaicml/streaming/pull/255
* ouhenio made their first contribution in https://github.com/mosaicml/streaming/pull/259
* growlix made their first contribution in https://github.com/mosaicml/streaming/pull/262
* tginart made their first contribution in https://github.com/mosaicml/streaming/pull/263
* hanlint made their first contribution in https://github.com/mosaicml/streaming/pull/266
* abhi-mosaic made their first contribution in https://github.com/mosaicml/streaming/pull/265

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.4.1...v0.5.0

0.4.1

Streaming `v0.4.1` is released! Install via `pip`:


pip install --upgrade mosaicml-streaming==0.4.1


New Feature
- Support of Torch 2.0. (234)
- Addition of two new sample shuffling algorithm. (223)
- Support of AWS S3 requester payers bucket permission for streaming. (231)

Documentation

- Added a streaming installation guide and a streaming environment guide. (221)
- Added a instruction guide for converting a multimodal dataset into a MDS format. (220)
- Streaming documentation now support Algolia search. (224)

What's Changed
* Refactor StreamingDataset shared memory prefix setup by knighton in https://github.com/mosaicml/streaming/pull/218
* Bump pytest from 7.2.2 to 7.3.0 by dependabot in https://github.com/mosaicml/streaming/pull/222
* Add two shuffling algos: naive (globally) and py1b (fixed-size blocks). by knighton in https://github.com/mosaicml/streaming/pull/223
* Add installation and environments documentation by karan6181 in https://github.com/mosaicml/streaming/pull/221
* Add a readme for multimodal convert script modal type by karan6181 in https://github.com/mosaicml/streaming/pull/220
* Bump sphinx-copybutton from 0.5.1 to 0.5.2 by dependabot in https://github.com/mosaicml/streaming/pull/229
* Bump pytest from 7.3.0 to 7.3.1 by dependabot in https://github.com/mosaicml/streaming/pull/230
* Bump sphinxext-opengraph from 0.8.1 to 0.8.2 by dependabot in https://github.com/mosaicml/streaming/pull/228
* Bump fastapi from 0.95.0 to 0.95.1 by dependabot in https://github.com/mosaicml/streaming/pull/227
* Virtually split the repeats of repeated shards by knighton in https://github.com/mosaicml/streaming/pull/226
* Switch documentation search to use Algolia by bandish-shah in https://github.com/mosaicml/streaming/pull/224
* Add a requester pays bucket permission args to boto3 for s3 download file by karan6181 in https://github.com/mosaicml/streaming/pull/231
* Bump yamllint from 1.30.0 to 1.31.0 by dependabot in https://github.com/mosaicml/streaming/pull/233
* Support of torch 2.0 by karan6181 in https://github.com/mosaicml/streaming/pull/234
* Removed pushing auto release branch due to GH action permission by karan6181 in https://github.com/mosaicml/streaming/pull/235
* Fixed local directory check by karan6181 in https://github.com/mosaicml/streaming/pull/238
* Skip distributed all_gather test since CI non-deterministically hangs by karan6181 in https://github.com/mosaicml/streaming/pull/240
* Bump version to 0.4.1 by karan6181 in https://github.com/mosaicml/streaming/pull/239


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.4.0...v0.4.1

0.4.0

Not secure
Streaming `v0.4.0` is released! Install via `pip`:


pip install --upgrade mosaicml-streaming==0.4.0


New Feature

🔀 Dataset Mixing
- Weighted mixing of sub-datasets on the fly during model training (184). StreamingDataset now support an optional `streams` parameter which takes one or more sub-datasets and it intelligently fetches samples across sub-datasets. You can mix (upsample or downsample) datasets by defining each either relatively (`proportion`) or absolutely (`repeat` or `samples` or none of them to sample 1:1).

Documentation

- Added a README which shows how to convert a raw dataset into an MDS format for Text and Vision dataset. (183)

Bug Fixes

- Raise an exception if the cloud storage bucket does not exist during shard file upload. (212)
- Remove unsupported ThreadPoolExecutor shutdown param for python38. (199)

What's Changed
* Update GCS cloud storage credential document by karan6181 in https://github.com/mosaicml/streaming/pull/181
* Update API reference doc to be compatible with sphinx by karan6181 in https://github.com/mosaicml/streaming/pull/182
* Add a readme for text and vision convert script modal type by karan6181 in https://github.com/mosaicml/streaming/pull/183
* Fix docstrings by knighton in https://github.com/mosaicml/streaming/pull/185
* Synchronize before destroying process group by coryMosaicML in https://github.com/mosaicml/streaming/pull/186
* Bump pytest from 7.2.1 to 7.2.2 by dependabot in https://github.com/mosaicml/streaming/pull/187
* Bump pypandoc from 1.10 to 1.11 by dependabot in https://github.com/mosaicml/streaming/pull/188
* White-box weighted mixing of streaming datasets by knighton in https://github.com/mosaicml/streaming/pull/184
* Organize partitioning code by knighton in https://github.com/mosaicml/streaming/pull/190
* Bump pydantic from 1.10.5 to 1.10.6 by dependabot in https://github.com/mosaicml/streaming/pull/194
* Bump uvicorn from 0.20.0 to 0.21.0 by dependabot in https://github.com/mosaicml/streaming/pull/196
* Bump fastapi from 0.92.0 to 0.94.0 by dependabot in https://github.com/mosaicml/streaming/pull/198
* Remove unsupported ThreadPoolExecutor shutdown param in python38 by karan6181 in https://github.com/mosaicml/streaming/pull/199
* Fix doctstrings (maybe?) by Landanjs in https://github.com/mosaicml/streaming/pull/200
* Demo: crawling, converting, and iterating weighted dataset subsets by knighton in https://github.com/mosaicml/streaming/pull/191
* Update WebVid README.md by knighton in https://github.com/mosaicml/streaming/pull/202
* Fix leftover test dirs and improve dataset method and variable names by knighton in https://github.com/mosaicml/streaming/pull/201
* Bump fastapi from 0.94.0 to 0.95.0 by dependabot in https://github.com/mosaicml/streaming/pull/205
* Bump uvicorn from 0.21.0 to 0.21.1 by dependabot in https://github.com/mosaicml/streaming/pull/206
* Raise an exception if bucket does not exist during upload by karan6181 in https://github.com/mosaicml/streaming/pull/212
* Bump yamllint from 1.29.0 to 1.30.0 by dependabot in https://github.com/mosaicml/streaming/pull/209
* Bump pydantic from 1.10.6 to 1.10.7 by dependabot in https://github.com/mosaicml/streaming/pull/211
* Register atexit handler for resource cleanup by karan6181 in https://github.com/mosaicml/streaming/pull/215
* Bump version to 0.4.0 by karan6181 in https://github.com/mosaicml/streaming/pull/216

New Contributors
* coryMosaicML made their first contribution in https://github.com/mosaicml/streaming/pull/186
* Landanjs made their first contribution in https://github.com/mosaicml/streaming/pull/200

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.3.0...v0.4.0

0.3.1

Not secure
Hotfix

Hotfix to fix installation of the `composer` package

0.3.0

Not secure
Streaming `v0.3.0` is released! Install via `pip`:


pip install --upgrade mosaicml-streaming==0.3.0


New Features

:cloud: Cloud uploading

Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to `MDSWriter`. Track the progress of individual uploads with `progress_bar=True`, and tune background upload workers with `max_workers=4`.

User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a `out` parameter as a cloud provider bucket location as part of `Writer` class. Below is the example to upload output files to AWS S3 bucket

python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass


User can choose to keep a output shard files locally by providing a local directory path as part of `Writer`. For example,

python
output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass


User can see the progress of the cloud upload file by setting `progress_bar=True` as part of `Writer`. For example,

python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
for sample in samples:
pass


User can control the number of background upload threads via parameter `max_workers` as part of `Writer` who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if `max_workers=4`, maximum 4 threads would be active at a same time uploading one shard file each.

python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
for sample in samples:
pass


:twisted_rightwards_arrows: 2x faster shuffling

We’ve added a new shuffling algorithm `py1s` which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding `shuffle_algo` (old behavior: `py2s`). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.

📨 2x faster partitioning

We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the `partition_algo` argument to `StreamingDataset`. You will experience this as faster start and resumption for large datasets.

:link: Extensible downloads

We provide examples of modifying `StreamingDataset` to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.

**API changes**

- Class `Writer` and its derived classes (`MDSWriter`, `XSVWriter`, `TSVWriter`, `CSVWriter`, and `JSONWriter`) parameter has been changed from `dirname` to `out` with the following advanced functionalities:
- If `out` is a local directory, shard files are saved locally. For example, `out=/tmp/mds/`.
- If `out` is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example, `out=s3://bucket/dir/path`.
- If `out` is a tuple of `(local_dir, remote_dir)`, shard files are saved in the
`local_dir` and also uploaded to a remote location. For example, `out=('/tmp/mds/', 's3://bucket/dir/path')`.

- Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of `Writer` and its subclasses (like `MDSWriter`) and `StreamingDataset` to require kwargs.

Bug Fixes

- Fix broken blog post link and community email link in the README (177).
- Download the shard files as tmp extension until it finishes for OCI blob storage (178).
- Supported cloud providers documentation (169).
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [[this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
doc on how to configure cloud storage credentials.
- Make [setup.py](http://setup.py/) deterministic by sorting dependencies (#165).
- Fix overlong lines for better readability (163).

What's Changed
* Bump fastapi from 0.89.1 to 0.91.0 by dependabot in https://github.com/mosaicml/streaming/pull/154
* Bump sphinxext-opengraph from 0.7.5 to 0.8.1 by dependabot in https://github.com/mosaicml/streaming/pull/155
* Compare arrow vs mds vs parquet. by knighton in https://github.com/mosaicml/streaming/pull/160
* Improve serialization format comparison. by knighton in https://github.com/mosaicml/streaming/pull/161
* WebVid: conversion and benchmarking for storing the MP4s separately vs inside the MDS shards. by knighton in https://github.com/mosaicml/streaming/pull/143
* Update download badge link to pepy by karan6181 in https://github.com/mosaicml/streaming/pull/162
* CloudWriter interface: local=, remote=, keep=. by knighton in https://github.com/mosaicml/streaming/pull/148
* Fix overlong lines. by knighton in https://github.com/mosaicml/streaming/pull/163
* Make setup.py deterministic by sorting dependencies. by nharada1 in https://github.com/mosaicml/streaming/pull/165
* Bump pydantic from 1.10.4 to 1.10.5 by dependabot in https://github.com/mosaicml/streaming/pull/166
* Bump gitpython from 3.1.30 to 3.1.31 by dependabot in https://github.com/mosaicml/streaming/pull/167
* Bump fastapi from 0.91.0 to 0.92.0 by dependabot in https://github.com/mosaicml/streaming/pull/168
* Adjust StreamingDataset arguments by knighton in https://github.com/mosaicml/streaming/pull/170
* add 2x faster shuffle algorithm; add shuffle bench/plot by knighton in https://github.com/mosaicml/streaming/pull/137
* Docstring fix by knighton in https://github.com/mosaicml/streaming/pull/173
* Add a supported cloud providers documentation by karan6181 in https://github.com/mosaicml/streaming/pull/169
* Add callout fence to Configure Cloud Storage Credentials guide by karan6181 in https://github.com/mosaicml/streaming/pull/174
* Fix broken links in the README by knighton in https://github.com/mosaicml/streaming/pull/177
* Download the shard files as tmp extension until it finishes for OCI by karan6181 in https://github.com/mosaicml/streaming/pull/178
* Add a support of uploading shard files to a cloud as part of Writer by karan6181 in https://github.com/mosaicml/streaming/pull/171
* Refactor partitioning to be much faster. by knighton in https://github.com/mosaicml/streaming/pull/179
* Bump version to 0.3.0 by karan6181 in https://github.com/mosaicml/streaming/pull/180

New Contributors
* nharada1 made their first contribution in https://github.com/mosaicml/streaming/pull/165

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.2.5...v0.3.0

0.2.5

Streaming v0.2.5 is released! Install via pip:


pip install --upgrade mosaicml-streaming==0.2.5


Bug Fixes
* Fixed CPU crash (https://github.com/mosaicml/streaming/pull/153)
* Update example notebooks (https://github.com/mosaicml/streaming/pull/157)

What's Changed
* Update README.md by knighton in https://github.com/mosaicml/streaming/pull/152
* Fix typo by dakinggg in https://github.com/mosaicml/streaming/pull/156
* Fixed CPU crash by karan6181 in https://github.com/mosaicml/streaming/pull/153
* Update example notebooks by karan6181 in https://github.com/mosaicml/streaming/pull/157
* bump version to 0.2.5 by karan6181 in https://github.com/mosaicml/streaming/pull/158


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.2.4...v0.2.5

Page 13 of 15

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.