Streaming `v0.3.0` is released! Install via `pip`:
pip install --upgrade mosaicml-streaming==0.3.0
New Features
:cloud: Cloud uploading
Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to `MDSWriter`. Track the progress of individual uploads with `progress_bar=True`, and tune background upload workers with `max_workers=4`.
User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a `out` parameter as a cloud provider bucket location as part of `Writer` class. Below is the example to upload output files to AWS S3 bucket
python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass
User can choose to keep a output shard files locally by providing a local directory path as part of `Writer`. For example,
python
output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass
User can see the progress of the cloud upload file by setting `progress_bar=True` as part of `Writer`. For example,
python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
for sample in samples:
pass
User can control the number of background upload threads via parameter `max_workers` as part of `Writer` who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if `max_workers=4`, maximum 4 threads would be active at a same time uploading one shard file each.
python
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
for sample in samples:
pass
:twisted_rightwards_arrows: 2x faster shuffling
We’ve added a new shuffling algorithm `py1s` which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding `shuffle_algo` (old behavior: `py2s`). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.
📨 2x faster partitioning
We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the `partition_algo` argument to `StreamingDataset`. You will experience this as faster start and resumption for large datasets.
:link: Extensible downloads
We provide examples of modifying `StreamingDataset` to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.
**API changes**
- Class `Writer` and its derived classes (`MDSWriter`, `XSVWriter`, `TSVWriter`, `CSVWriter`, and `JSONWriter`) parameter has been changed from `dirname` to `out` with the following advanced functionalities:
- If `out` is a local directory, shard files are saved locally. For example, `out=/tmp/mds/`.
- If `out` is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example, `out=s3://bucket/dir/path`.
- If `out` is a tuple of `(local_dir, remote_dir)`, shard files are saved in the
`local_dir` and also uploaded to a remote location. For example, `out=('/tmp/mds/', 's3://bucket/dir/path')`.
- Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of `Writer` and its subclasses (like `MDSWriter`) and `StreamingDataset` to require kwargs.
Bug Fixes
- Fix broken blog post link and community email link in the README (177).
- Download the shard files as tmp extension until it finishes for OCI blob storage (178).
- Supported cloud providers documentation (169).
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [[this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
doc on how to configure cloud storage credentials.
- Make [setup.py](http://setup.py/) deterministic by sorting dependencies (#165).
- Fix overlong lines for better readability (163).
What's Changed
* Bump fastapi from 0.89.1 to 0.91.0 by dependabot in https://github.com/mosaicml/streaming/pull/154
* Bump sphinxext-opengraph from 0.7.5 to 0.8.1 by dependabot in https://github.com/mosaicml/streaming/pull/155
* Compare arrow vs mds vs parquet. by knighton in https://github.com/mosaicml/streaming/pull/160
* Improve serialization format comparison. by knighton in https://github.com/mosaicml/streaming/pull/161
* WebVid: conversion and benchmarking for storing the MP4s separately vs inside the MDS shards. by knighton in https://github.com/mosaicml/streaming/pull/143
* Update download badge link to pepy by karan6181 in https://github.com/mosaicml/streaming/pull/162
* CloudWriter interface: local=, remote=, keep=. by knighton in https://github.com/mosaicml/streaming/pull/148
* Fix overlong lines. by knighton in https://github.com/mosaicml/streaming/pull/163
* Make setup.py deterministic by sorting dependencies. by nharada1 in https://github.com/mosaicml/streaming/pull/165
* Bump pydantic from 1.10.4 to 1.10.5 by dependabot in https://github.com/mosaicml/streaming/pull/166
* Bump gitpython from 3.1.30 to 3.1.31 by dependabot in https://github.com/mosaicml/streaming/pull/167
* Bump fastapi from 0.91.0 to 0.92.0 by dependabot in https://github.com/mosaicml/streaming/pull/168
* Adjust StreamingDataset arguments by knighton in https://github.com/mosaicml/streaming/pull/170
* add 2x faster shuffle algorithm; add shuffle bench/plot by knighton in https://github.com/mosaicml/streaming/pull/137
* Docstring fix by knighton in https://github.com/mosaicml/streaming/pull/173
* Add a supported cloud providers documentation by karan6181 in https://github.com/mosaicml/streaming/pull/169
* Add callout fence to Configure Cloud Storage Credentials guide by karan6181 in https://github.com/mosaicml/streaming/pull/174
* Fix broken links in the README by knighton in https://github.com/mosaicml/streaming/pull/177
* Download the shard files as tmp extension until it finishes for OCI by karan6181 in https://github.com/mosaicml/streaming/pull/178
* Add a support of uploading shard files to a cloud as part of Writer by karan6181 in https://github.com/mosaicml/streaming/pull/171
* Refactor partitioning to be much faster. by knighton in https://github.com/mosaicml/streaming/pull/179
* Bump version to 0.3.0 by karan6181 in https://github.com/mosaicml/streaming/pull/180
New Contributors
* nharada1 made their first contribution in https://github.com/mosaicml/streaming/pull/165
**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.2.5...v0.3.0