Mosaicml-streaming

Latest version: v0.9.1

Safety actively analyzes 681866 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 5

0.2.0

Not secure
New Features

1. **Elastic world size deterministic shuffle**

Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. **This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.**

2. **Instant Mid-Epoch Resumption**

Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.

3. **NEW StreamingDataLoader**
A `StreamingDataLoader` is a drop-in replacement for your PyTorch `DataLoader` with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader.

4. **Support for Oracle Cloud Infrastructure (OCI) blob storage**

Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either `oci://<bucket_name><namespace>/<folder_name>/<filename>` or `oci://<bucket_name>/<folder_name>/<filename>` to a `StreamingDataset` class. For example:

bash
from streaming import StreamingDataset

remote = 'oci://<bucket><namespace>/<path>'
local = '/tmp/dataset/'

train_dataset = StreamingDataset(local=local, remote=remote, split='train')


Streaming expects the credentials to be present in `~/.oci/config` path.
5. **Support for public AWS S3 buckets**

Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the `StreamingDataset` class with an AWS S3 bucket as follows


from streaming import StreamingDataset

remote = 's3://<bucket>/<path>'
local = '/tmp/dataset/'

train_dataset = StreamingDataset(local=local, remote=remote, split='train')


API changes
- The class `Dataset` has been renamed as class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- Similarly, built-in most popular datasets class has also been renamed. For example,
- `C4` renamed as `StreamingC4`
- `EnWiki` renamed as `StreamingEnWiki`
- `Pile` renamed as `StreamingEnWiki`
- `ADE20K` renamed as `StreamingADE20K`
- `CIFAR10` renamed as `StreamingCIFAR10`
- `COCO` renamed as `StreamingCOCO`
- `ImageNet` renamed as `StreamingImageNet`
- The parameter `prefetch` in class `Dataset` has been renamed as `predownload` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- The parameter `retry` in class `Dataset` has been renamed as `download_retry` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- The parameter `timeout` in class `Dataset` has been renamed as `download_timeout` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).
- The parameter `hash` in class `Dataset` has been renamed as `validate_hash` in class `StreamingDataset` (https://github.com/mosaicml/streaming/pull/37).

What's Changed
* Bump nbsphinx from 0.8.9 to 0.8.10 by dependabot in https://github.com/mosaicml/streaming/pull/73
* Bump sphinx-argparse from 0.3.2 to 0.4.0 by dependabot in https://github.com/mosaicml/streaming/pull/74
* The Pile (conversion + streaming dataset) by knighton in https://github.com/mosaicml/streaming/pull/71
* [Docs] Switch back to RTD search by bandish-shah in https://github.com/mosaicml/streaming/pull/83
* make pyright precommit check actually run by dblalock in https://github.com/mosaicml/streaming/pull/84
* Fixed stale URL references by bandish-shah in https://github.com/mosaicml/streaming/pull/85
* Bump sphinx-copybutton from 0.5.0 to 0.5.1 by dependabot in https://github.com/mosaicml/streaming/pull/78
* Bump pandoc from 2.2 to 2.3 by dependabot in https://github.com/mosaicml/streaming/pull/79
* Bump sphinxcontrib-katex from 0.9.0 to 0.9.3 by dependabot in https://github.com/mosaicml/streaming/pull/80
* Bump sphinxext-opengraph from 0.7.2 to 0.7.3 by dependabot in https://github.com/mosaicml/streaming/pull/81
* Support for concat option in C4 Dataset by karan6181 in https://github.com/mosaicml/streaming/pull/77
* Elastic world size deterministic shuffle with mid-epoch resumption by knighton in https://github.com/mosaicml/streaming/pull/37
* Support for S3 public bucket by karan6181 in https://github.com/mosaicml/streaming/pull/88
* Add OCI Cloud Storage support by karan6181 in https://github.com/mosaicml/streaming/pull/86
* Make StreamingDataset state_dict() more flexible by knighton in https://github.com/mosaicml/streaming/pull/90
* Bump version to 0.2.0 by karan6181 in https://github.com/mosaicml/streaming/pull/92


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.1.2...v0.2.0

0.1.2

Not secure
What's Changed
* Fixed contributing page link by karan6181 in https://github.com/mosaicml/streaming/pull/61
* Add Distributed test and supported multi device unittest by karan6181 in https://github.com/mosaicml/streaming/pull/57
* Added template and adhere to standard coding practice by karan6181 in https://github.com/mosaicml/streaming/pull/62
* Bump pytest from 7.1.3 to 7.2.0 by dependabot in https://github.com/mosaicml/streaming/pull/63
* Bump pypandoc from 1.9 to 1.10 by dependabot in https://github.com/mosaicml/streaming/pull/65
* Add code coverage report and moved scripts outside of src by karan6181 in https://github.com/mosaicml/streaming/pull/66
* Bump sphinxext-opengraph from 0.6.3 to 0.7.2 by dependabot in https://github.com/mosaicml/streaming/pull/67
* Add Google Cloud Storage support by karan6181 in https://github.com/mosaicml/streaming/pull/68
* Create and push release branch as part of workflow by karan6181 in https://github.com/mosaicml/streaming/pull/69
* Add test CI badge in README by karan6181 in https://github.com/mosaicml/streaming/pull/70
* Add unit test for download, encodings, hashing, and others by karan6181 in https://github.com/mosaicml/streaming/pull/72
* Bump version to 0.1.2 by karan6181 in https://github.com/mosaicml/streaming/pull/75


**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.1.1...v0.1.2

0.1.1

Not secure
What's Changed
* Streaming datasets V2 by knighton in https://github.com/mosaicml/streaming/pull/2
* Initial Docs Site by bandish-shah in https://github.com/mosaicml/streaming/pull/3
* Added a ADE20K and COCO2017 data conversion scripts by karan6181 in https://github.com/mosaicml/streaming/pull/5
* Added pre-commit config by karan6181 in https://github.com/mosaicml/streaming/pull/6
* Added pre-commit config for a License Header by karan6181 in https://github.com/mosaicml/streaming/pull/7
* Convert relative imports to absolute imports by karan6181 in https://github.com/mosaicml/streaming/pull/8
* C4 dataset by knighton in https://github.com/mosaicml/streaming/pull/4
* Add a ADE20K streaming dataset class by karan6181 in https://github.com/mosaicml/streaming/pull/9
* PyPi mods for setup.py by bandish-shah in https://github.com/mosaicml/streaming/pull/10
* Disable local shard deletion by knighton in https://github.com/mosaicml/streaming/pull/12
* Add a COCO streaming dataset class by karan6181 in https://github.com/mosaicml/streaming/pull/13
* Add docstrings. by knighton in https://github.com/mosaicml/streaming/pull/14
* Added unittest for Writer and Reader by karan6181 in https://github.com/mosaicml/streaming/pull/16
* added new streaming logos by ejyuen in https://github.com/mosaicml/streaming/pull/15
* Update package version code for unification by karan6181 in https://github.com/mosaicml/streaming/pull/17
* Fix wait-for-unzip race by knighton in https://github.com/mosaicml/streaming/pull/18
* Added algolia search to streaming docs site by nqn in https://github.com/mosaicml/streaming/pull/19
* Add a pre-commit GitHub workflow by karan6181 in https://github.com/mosaicml/streaming/pull/21
* Added pydocstyle and docformatter in pre-commit config by karan6181 in https://github.com/mosaicml/streaming/pull/20
* Improve algorithmic complexity of sample-to-shard lookup from O(log N) to O(1) by knighton in https://github.com/mosaicml/streaming/pull/22
* Add enwiki-20200101 streaming dataset by knighton in https://github.com/mosaicml/streaming/pull/23
* Add submodules to api reference doc by karan6181 in https://github.com/mosaicml/streaming/pull/24
* Initial Docs site content by bandish-shah in https://github.com/mosaicml/streaming/pull/11
* Add unittest for compression by karan6181 in https://github.com/mosaicml/streaming/pull/25
* Fix hang when compression is used but compressed files are not retained by knighton in https://github.com/mosaicml/streaming/pull/26
* Add long_description for packaging by bandish-shah in https://github.com/mosaicml/streaming/pull/29
* Update tutorial notebooks to have it run end-to-end by karan6181 in https://github.com/mosaicml/streaming/pull/30
* Adjustment for last partition bug by knighton in https://github.com/mosaicml/streaming/pull/27
* Fix preprocessing for English Wikipedia dataset by knighton in https://github.com/mosaicml/streaming/pull/28
* Fix enwiki dataset by dskhudia in https://github.com/mosaicml/streaming/pull/31
* Skip pre-commit check for enwiki convert skip to have code parity by karan6181 in https://github.com/mosaicml/streaming/pull/32
* Update doc and fixed reference links by karan6181 in https://github.com/mosaicml/streaming/pull/33
* Parallel tfrecord creation, validate sample counts vs MDS by knighton in https://github.com/mosaicml/streaming/pull/34
* Bump up the version to 0.0.1b by karan6181 in https://github.com/mosaicml/streaming/pull/35
* Add NLP synthetic dataset jupyter notebook tutorial by karan6181 in https://github.com/mosaicml/streaming/pull/36
* Add README and CONTRIBUTING guide by karan6181 in https://github.com/mosaicml/streaming/pull/38
* Typos + copy editing in README by dblalock in https://github.com/mosaicml/streaming/pull/40
* Re-factor docs tutorials to top-level examples by bandish-shah in https://github.com/mosaicml/streaming/pull/39
* Fixed typos and update documentation by karan6181 in https://github.com/mosaicml/streaming/pull/42
* Add CodeQL security scanner and Dependabot workflow by karan6181 in https://github.com/mosaicml/streaming/pull/43
* Bump gitpython from 3.1.28 to 3.1.29 by dependabot in https://github.com/mosaicml/streaming/pull/46
* Bump myst-parser from 0.16.1 to 0.18.1 by dependabot in https://github.com/mosaicml/streaming/pull/47
* Add bug report and feature request template by karan6181 in https://github.com/mosaicml/streaming/pull/48
* mlperf enwiki conversion code mild cleanup by knighton in https://github.com/mosaicml/streaming/pull/41
* Add Build publish to PyPI and create GitHub release workflow by karan6181 in https://github.com/mosaicml/streaming/pull/50
* Added writer unittest and update existing test by karan6181 in https://github.com/mosaicml/streaming/pull/52
* Bump version to 0.1.0 by karan6181 in https://github.com/mosaicml/streaming/pull/53
* Fixed dead image link in pypi home page by karan6181 in https://github.com/mosaicml/streaming/pull/54
* Add TorchVision VisionDataset inheritance. by knighton in https://github.com/mosaicml/streaming/pull/55
* bump version to 0.1.1b0 by karan6181 in https://github.com/mosaicml/streaming/pull/56
* Fixed rendering of pypi image by karan6181 in https://github.com/mosaicml/streaming/pull/59
* Bump version to 0.1.1 by karan6181 in https://github.com/mosaicml/streaming/pull/60

New Contributors
* knighton made their first contribution in https://github.com/mosaicml/streaming/pull/2
* bandish-shah made their first contribution in https://github.com/mosaicml/streaming/pull/3
* karan6181 made their first contribution in https://github.com/mosaicml/streaming/pull/5
* ejyuen made their first contribution in https://github.com/mosaicml/streaming/pull/15
* nqn made their first contribution in https://github.com/mosaicml/streaming/pull/19
* dskhudia made their first contribution in https://github.com/mosaicml/streaming/pull/31
* dblalock made their first contribution in https://github.com/mosaicml/streaming/pull/40
* dependabot made their first contribution in https://github.com/mosaicml/streaming/pull/46

**Full Changelog**: https://github.com/mosaicml/streaming/commits/v0.1.1

Page 5 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.