New Features
1. **Elastic world size deterministic shuffle**
Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. **This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.**
2. **Instant Mid-Epoch Resumption**
Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.
3. **NEW StreamingDataLoader**
A `StreamingDataLoader` is a drop-in replacement for your PyTorch `DataLoader` with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader.
4. **Support for Oracle Cloud Infrastructure (OCI) blob storage**
Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either `oci://<bucket_name><namespace>/<folder_name>/<filename>` or `oci://<bucket_name>/<folder_name>/<filename>` to a `StreamingDataset` class. For example:
bash
from streaming import StreamingDataset
remote = 'oci://<bucket><namespace>/<path>'
local = '/tmp/dataset/'
train_dataset = StreamingDataset(local=local, remote=remote, split='train')
Streaming expects the credentials to be present in `~/.oci/config` path.
5. **Support for public AWS S3 buckets**
Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the `StreamingDataset` class with an AWS S3 bucket as follows
from streaming import StreamingDataset
remote = 's3://<bucket>/<path>'
local = '/tmp/dataset/'
train_dataset = StreamingDataset(local=local, remote=remote, split='train')
API changes
- The class `Dataset` has been renamed as class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- Similarly, built-in most popular datasets class has also been renamed. For example,
- `C4` renamed as `StreamingC4`
- `EnWiki` renamed as `StreamingEnWiki`
- `Pile` renamed as `StreamingEnWiki`
- `ADE20K` renamed as `StreamingADE20K`
- `CIFAR10` renamed as `StreamingCIFAR10`
- `COCO` renamed as `StreamingCOCO`
- `ImageNet` renamed as `StreamingImageNet`
- The parameter `prefetch` in class `Dataset` has been renamed as `predownload` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- The parameter `retry` in class `Dataset` has been renamed as `download_retry` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- The parameter `timeout` in class `Dataset` has been renamed as `download_timeout` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- The parameter `hash` in class `Dataset` has been renamed as `validate_hash` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
What's Changed
* Bump nbsphinx from 0.8.9 to 0.8.10 by dependabot in https://github.com/mosaicml/streaming/pull/73
* Bump sphinx-argparse from 0.3.2 to 0.4.0 by dependabot in https://github.com/mosaicml/streaming/pull/74
* The Pile (conversion + streaming dataset) by knighton in https://github.com/mosaicml/streaming/pull/71
* [Docs] Switch back to RTD search by bandish-shah in https://github.com/mosaicml/streaming/pull/83
* make pyright precommit check actually run by dblalock in https://github.com/mosaicml/streaming/pull/84
* Fixed stale URL references by bandish-shah in https://github.com/mosaicml/streaming/pull/85
* Bump sphinx-copybutton from 0.5.0 to 0.5.1 by dependabot in https://github.com/mosaicml/streaming/pull/78
* Bump pandoc from 2.2 to 2.3 by dependabot in https://github.com/mosaicml/streaming/pull/79
* Bump sphinxcontrib-katex from 0.9.0 to 0.9.3 by dependabot in https://github.com/mosaicml/streaming/pull/80
* Bump sphinxext-opengraph from 0.7.2 to 0.7.3 by dependabot in https://github.com/mosaicml/streaming/pull/81
* Support for concat option in C4 Dataset by karan6181 in https://github.com/mosaicml/streaming/pull/77
* Elastic world size deterministic shuffle with mid-epoch resumption by knighton in https://github.com/mosaicml/streaming/pull/37
* Support for S3 public bucket by karan6181 in https://github.com/mosaicml/streaming/pull/88
* Add OCI Cloud Storage support by karan6181 in https://github.com/mosaicml/streaming/pull/86
* Make StreamingDataset state_dict() more flexible by knighton in https://github.com/mosaicml/streaming/pull/90
* Bump version to 0.2.0 by karan6181 in https://github.com/mosaicml/streaming/pull/92
**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.1.2...v0.2.0