Mosaicml-streaming

Latest version: v0.11.0

Safety actively analyzes 723177 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 5

0.7.0

Not secure

📈 Better Defaults for `StreamingDataset` (479)
- The default values for `StreamingDataset` have been updated to be more performant and are applicable for most use cases, detailed below:

| Parameter | Old Value | New Value | Benefit |
|-----------------------|------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------|
| `shuffle_algo` | `py1s` | `py1e` | Better shuffle and balanced downloading |
| `num_canonical_nodes` | `64 * physical nodes` | if `py1s` or `py2s`, `64 * physical_nodes`, otherwise `physical_nodes` | Consistently good shuffle for all shuffle algos |
| `shuffle_block_size` | `262,144` | `4,000,000 / num_canonical_nodes` | Consistently good shuffle for all `num_canonical_nodes` values |
| `predownload` | `max(batch_size, 256 * batch_size // num_canonical_nodes)` | `8 * batch_size` | Better balanced downloading |
| `partition_algo` | `orig` | `relaxed` | More flexible deterministic resumptions on nodes |

:gem: New Features

🤖 Streaming Simulator: Easily simulate the performance of training configurations. (385)
- After installing this version of streaming, simply run the command `simulator` in your terminal to open the simulation interface.
- Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations.
- Easily de-risk runs and find performant parameter settings.
- Check out the [docs](https://docs.mosaicml.com/projects/streaming/en/stable/fundamentals/simulator.html) for more information!

🔢 More flexible deterministic training and resumption (476)
- Deterministic training and resumptions are now possible on more numbers of nodes!
- Previously, the `num_canonical_nodes` parameter had to divide or be a multiple of the number of physical nodes for determinism.
- Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.

🐛 Bug Fixes

- Check for invalid hash algorithm names (486)

What's Changed
* Bump fastapi from 0.103.2 to 0.104.0 by dependabot in https://github.com/mosaicml/streaming/pull/480
* Bump gitpython from 3.1.37 to 3.1.40 by dependabot in https://github.com/mosaicml/streaming/pull/481
* Bump sphinx-tabs from 3.4.1 to 3.4.4 by dependabot in https://github.com/mosaicml/streaming/pull/482
* do not remove local directory when out is local by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/477
* Update __init__.py by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/484
* Check for invalid hash algorithm name by karan6181 in https://github.com/mosaicml/streaming/pull/486
* Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes by snarayan21 in https://github.com/mosaicml/streaming/pull/476
* Better default values for StreamingDataset args by snarayan21 in https://github.com/mosaicml/streaming/pull/479
* Update release yaml to not write anything to GitHub by karan6181 in https://github.com/mosaicml/streaming/pull/487
* Bump pypandoc from 1.11 to 1.12 by dependabot in https://github.com/mosaicml/streaming/pull/490
* Bump pytest from 7.4.2 to 7.4.3 by dependabot in https://github.com/mosaicml/streaming/pull/491
* Bumping version for streaming v0.7.0 by snarayan21 in https://github.com/mosaicml/streaming/pull/495

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.6.1...v0.7.0

0.6.1

Not secure

:gem: New Features

:railway_car: Merge meta-data information from sub-directories dataset to form one unified dataset. (449)
- Addition of the `merge_index()` utility method to merge subdirectories index files from an MDS dataset. The subdirectories can be local or any supported cloud provider URL path.
- Checkout [dataset conversion](https://docs.mosaicml.com/projects/streaming/en/stable/examples/multiprocess_dataset_conversion.html) and [Spark Dataframe to MDS](https://docs.mosaicml.com/projects/streaming/en/stable/examples/spark_dataframe_to_MDS.html) jupyter notebook for an example in action.

:repeat: Retry uploading a file to a cloud provider path. (448)
- Added upload retry logic with backoff and jitter during dataset conversion as part of parameter `retry` in [Writer](https://github.com/mosaicml/streaming/blob/v0.6.1/streaming/base/format/base/writer.py#L65).
python
from streaming import MDSWriter

with MDSWriter(
...,
retry=3) as out:
for sample in dataset:
out.write(sample)

🐛 Bug Fixes

- Validate [Writer](https://github.com/mosaicml/streaming/blob/v0.6.1/streaming/base/format/base/writer.py#L32) arguments and raise a ValueError exception if argument(s) is/are invalid. (434)
- Terminate the main process if one of the upload threads receives an Exception during dataset conversion. (448)

🔧 Improvements

- More balancing inter-node downloading for the `py1e` shuffling algorithm by varying shard sample ranges, helping to reduce throughput drops at scale. (442)

What's Changed
* Validate writer arguments by karan6181 in https://github.com/mosaicml/streaming/pull/434
* Bump pytest from 7.4.1 to 7.4.2 by dependabot in https://github.com/mosaicml/streaming/pull/428
* Bump gitpython from 3.1.34 to 3.1.36 by dependabot in https://github.com/mosaicml/streaming/pull/435
* Fix stylistic issues (mostly 100col, docstring conventions) by knighton in https://github.com/mosaicml/streaming/pull/439
* Bump pytest-codeblocks from 0.16.1 to 0.17.0 by dependabot in https://github.com/mosaicml/streaming/pull/436
* py1e randomized by snarayan21 in https://github.com/mosaicml/streaming/pull/442
* Bump gitpython from 3.1.36 to 3.1.37 by dependabot in https://github.com/mosaicml/streaming/pull/446
* Fix BatchFeature of Transformers not handled by StreamingDataloader by Hubert-Bonisseur in https://github.com/mosaicml/streaming/pull/450
* Add a retry logic with backoff and jitter by karan6181 in https://github.com/mosaicml/streaming/pull/448
* Fix broken bibtext by Skylion007 in https://github.com/mosaicml/streaming/pull/452
* Update integration test to include sample order comparison by karan6181 in https://github.com/mosaicml/streaming/pull/456
* Bump pydantic from 2.3.0 to 2.4.2 by dependabot in https://github.com/mosaicml/streaming/pull/455
* Update MCLI credential page for Databricks by karan6181 in https://github.com/mosaicml/streaming/pull/466
* Add merge index file utility by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/449
* Add py1e warning when Shuffle block size is smaller than shard size by snarayan21 in https://github.com/mosaicml/streaming/pull/463
* Fix doc strings by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/469
* Bump fastapi from 0.103.1 to 0.103.2 by dependabot in https://github.com/mosaicml/streaming/pull/454
* Maintain order for merge_index_from_list by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/472
* Fixed codeql out of disk space issue by karan6181 in https://github.com/mosaicml/streaming/pull/473
* Bump version to 0.6.1 by karan6181 in https://github.com/mosaicml/streaming/pull/474

New Contributors
* Hubert-Bonisseur made their first contribution in https://github.com/mosaicml/streaming/pull/450

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.6.0...v0.6.1

0.6.0

Not secure

New Features

**🆕** Databricks File System and Databricks Unity Catalog (362)

Support for reading and writing data from and to the Databricks File System (DBFS) and Unity Catalog (UC) Volumes. This means that you can now use DBFS and UC Volumes as a source or sink for your streaming data pipelines or model training. Below is the path structure:

**Databricks File System (DBFS)**

DBFS path structure is a hierarchical namespace that is organized into directories and files. The DBFS prefix must starts with `dbfs:/`.

**UC Volumes**

The path structure for UC Volumes is similar to the path structure for DBFS, but with a few key differences.

The root of the UC Volumes namespace is `dbfs:/Volumes/<catalog>/<schema>/<volume>`, where:

- `<catalog>` is the name of the catalog where the volume is created.
- `<schema>` is the name of the schema where the volume is created.
- `<volume>` is the name of the volume.

Hence, use a `dbfs://Volumes` prefix to specify a UC Volumes path.

💽 Spark Dataframe to MDS convertor (363)

Introducing the new `DataFrameToMDS` API, empowering users to effortlessly leverage Spark's capabilities for handling diverse datasets in various formats. This API enables seamless conversion of Spark DataFrames into MDS datasets, with the flexibility to specify output locations to both local and cloud storage. Index files are optionally merged. Additionally, users can add data preprocessing steps by defining custom iterator functions and arguments. All these features are seamlessly bundled into a single Spark job, ensuring an efficient and streamlined workflow for data transformation. An example [notebook](https://github.com/mosaicml/streaming/blob/main/examples/spark_dataframe_to_MDS.ipynb) is provided to help users get started.

🔀 Randomize and offset shuffle blocks algorithm (373)

The new `py1br` shuffle algorithm helps mitigate download spikes that occur when using the `py1b` algorithm. With `py1b`, shuffle blocks are all the same size, so when progressing through training, nodes will have to download many shards at the same time. In contrast, with `py1br`, shuffle blocks are offset from each other and are variably sized. This results in more balanced downloads over time. The `py1br` algorithm is a replacement for the `py1b` algorithm, which will be deprecated soon.

python
from streaming import StreamingDataset

dataset = StreamingDataset(
shuffle_algo='py1br',
...
)

🔀 Expanded range shuffle algorithm (394)

The new `py1e` shuffle algorithm helps reduce the minimum cache limit needed for training, and results in much smoother downloads than both `py1br` and `py1e`. However, its shuffle quality is slightly lower. Rather than shuffling all samples in blocks of size `shuffle_block_size`, it instead spreads the samples of each shard over a range of maximum size `shuffle_block_size`, retaining most of the shuffle quality from `py1b` and `py1br` while reducing download spikes across the duration of training.

python
from streaming import StreamingDataset

dataset = StreamingDataset(
shuffle_algo='py1e',
...
)

🔥 Per-Stream Batching (407)

Users are now able to ensure that each batch comes has samples from only a single stream. You can now set the new parameter `batching_method` to `per_stream` to access this functionality. Per-stream batching will still take into account upsampling and downsampling of streams, set by `proportion`, `repeat`, or `choose`. To make batches contain only samples from a group of streams, merge streams’ `index.json` files to create a single one for each group.

python
from streaming import StreamingDataset

dataset = StreamingDataset(
batching_method='per_stream',
...
)

🔥 Stratified Batching (408)

Users are now able to ensure that each batch has a consistent number of samples from every stream. Previously, stream proportions were satisfied in the aggregate but not at the batch level. You can now set the new parameter `batching_method` to `stratified` to access this functionality. Stratified batching will still take into account upsampling and downsampling of streams, set by `proportion`, `repeat`, or `choose`.

python
from streaming import StreamingDataset

dataset = StreamingDataset(
batching_method='stratified',
...
)

💪 Download-Efficient Sparse Sampling (391)

Previous versions of StreamingDataset implement downsampling/upsampling by giving each sample equal probability of being selected (plus or minus one due when sampling is fractional), without regard to what shard a sample is on. This means that no matter how small your desired downsampling is, StreamingDataset will still use each shard at as equal a rate as possible. This is problematic for downloading performance.

In this version of Streaming, we have added a new optional StreamingDataset argument `sampling_granularity` which can be used to configure how sampling is done. It is an integer, defaulting to 1, that determines how many samples are to be drawn at a time from a single random shard until we have enough samples.

Note that the default setting of 1 is equivalent to the old non-shard-aware behavior. Setting it high, e.g. the number of samples in a full shard or more, means it will draw all the samples in a randomly chosen (without replacement) shard until it has enough samples, which is much more download-effiicient but results in the samples of each shard always being seen close together in training, which may have implications to convergence depending on your workload. Setting sampling granularity to half a shard means, roughly speaking, you'll see half the samples of a shard at a time during training.

python
from streaming import StreamingDataset

dataset = StreamingDataset(
sampling_granularity=1,
...
)

📑 Reusable local directory (406)

Users can now instantiate more than one StreamingDataset with same `local` directory and `remote=None`. This would be useful if there is a high-speed storage mounted on a node and multiple folks are trying to read the dataset directly from mount storage on the same node without having to copy the data on local disk.

python

from streaming import StreamingDataset

local = '<local disk directory or a mount point directory>'
dataset_0 = StreamingDataset(local=local, remote=None)
dataset_1 = StreamingDataset(local=local, remote=None)

🐛 Bug Fixes

- Terminate the worker threads when process terminates to avoid deadlock. (425)
- Raise an exception if `cache_limit` is lower than the size of a single shard file to avoid deadlock. (420)
- Fixed `predownload` value to zero issue where users can now provide `predownload=0` in `StreamingDataset`. (383)

🔧 Improvements

- Add google Application Default Credentials (376).
- The order of authentication has changed and added a new App Engine or Compute Engine authentication channel if these are available. The order of authentication is as follows:
1. HMAC
2. Google service account
3. App Engine
4. Compute Engine
5. Raise an error
- Check if `index.json` exists locally before downloading to avoid duplicate downloads (372).

What's Changed
* Bump fastapi from 0.100.0 to 0.101.0 by dependabot in https://github.com/mosaicml/streaming/pull/367
* Bump uvicorn from 0.23.1 to 0.23.2 by dependabot in https://github.com/mosaicml/streaming/pull/368
* Check if index.json exists locally before downloading by karan6181 in https://github.com/mosaicml/streaming/pull/372
* Bench/plot sample access times across data and across formats by knighton in https://github.com/mosaicml/streaming/pull/365
* Apply ruff pre-commit hook by Skylion007 in https://github.com/mosaicml/streaming/pull/364
* Add a regression test for shuffling sample order by b-chu in https://github.com/mosaicml/streaming/pull/359
* Epoch size default behavior by snarayan21 in https://github.com/mosaicml/streaming/pull/374
* Stream unspecified docstring change by snarayan21 in https://github.com/mosaicml/streaming/pull/377
* fixed comments by snarayan21 in https://github.com/mosaicml/streaming/pull/378
* Add google Application Default Credentials to download by fgerzer in https://github.com/mosaicml/streaming/pull/376
* Fixed fake AWS credentials by karan6181 in https://github.com/mosaicml/streaming/pull/382
* Fixed predownload value to zero issue by karan6181 in https://github.com/mosaicml/streaming/pull/383
* Bump fastapi from 0.101.0 to 0.101.1 by dependabot in https://github.com/mosaicml/streaming/pull/387
* Bump pydantic from 2.1.1 to 2.2.1 by dependabot in https://github.com/mosaicml/streaming/pull/389
* Add a regression test for mixing of different dataset streams by b-chu in https://github.com/mosaicml/streaming/pull/375
* Add support for Databricks File System backend by maddiedawson in https://github.com/mosaicml/streaming/pull/362
* Add support for downloading from Unity Catalog volumes by maddiedawson in https://github.com/mosaicml/streaming/pull/361
* Fix MosaicML platform credential setup links by karan6181 in https://github.com/mosaicml/streaming/pull/396
* Plug hole in MDS type system: add arbitrary-precision decimal by knighton in https://github.com/mosaicml/streaming/pull/390
* Bump fastapi from 0.101.1 to 0.103.0 by dependabot in https://github.com/mosaicml/streaming/pull/402
* Bump pydantic from 2.2.1 to 2.3.0 by dependabot in https://github.com/mosaicml/streaming/pull/403
* Bump databricks-sdk from 0.3.1 to 0.6.0 by dependabot in https://github.com/mosaicml/streaming/pull/404
* Py1br algorithm implementation by snarayan21 in https://github.com/mosaicml/streaming/pull/373
* Benchmarking partitioning by knighton in https://github.com/mosaicml/streaming/pull/379
* Expanded range shuffle by snarayan21 in https://github.com/mosaicml/streaming/pull/394
* Reusable local directory when remote is None by karan6181 in https://github.com/mosaicml/streaming/pull/406
* Bump gitpython from 3.1.32 to 3.1.34 by dependabot in https://github.com/mosaicml/streaming/pull/410
* Bump pytest from 7.4.0 to 7.4.1 by dependabot in https://github.com/mosaicml/streaming/pull/411
* Bump fastapi from 0.103.0 to 0.103.1 by dependabot in https://github.com/mosaicml/streaming/pull/413
* Bump databricks-sdk from 0.6.0 to 0.8.0 by dependabot in https://github.com/mosaicml/streaming/pull/414
* Per Stream Batching by snarayan21 in https://github.com/mosaicml/streaming/pull/407
* Update Databricks download and upload functionality using new Databricks python sdk by karan6181 in https://github.com/mosaicml/streaming/pull/418
* Add delta to mds converter by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/363
* Stratified Batching by snarayan21 in https://github.com/mosaicml/streaming/pull/408
* Raise an exception if cache limit is too low by karan6181 in https://github.com/mosaicml/streaming/pull/420
* Remove torchtext by mvpatel2000 in https://github.com/mosaicml/streaming/pull/423
* Fix nb by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/422
* Fixed python version by karan6181 in https://github.com/mosaicml/streaming/pull/424
* Improve shard efficiency of sampling for fractional stream repeats. by knighton in https://github.com/mosaicml/streaming/pull/391
* Optimize dataframe writer (small change) by Skylion007 in https://github.com/mosaicml/streaming/pull/426
* Fix deadlock by acutkosky in https://github.com/mosaicml/streaming/pull/425
* changed choose to epoch_size in stream proportion docstring by snarayan21 in https://github.com/mosaicml/streaming/pull/432
* Bump version to 0.6.0 by XiaohanZhangCMU in https://github.com/mosaicml/streaming/pull/433

New Contributors
* Skylion007 made their first contribution in https://github.com/mosaicml/streaming/pull/364
* fgerzer made their first contribution in https://github.com/mosaicml/streaming/pull/376
* maddiedawson made their first contribution in https://github.com/mosaicml/streaming/pull/362
* XiaohanZhangCMU made their first contribution in https://github.com/mosaicml/streaming/pull/363
* mvpatel2000 made their first contribution in https://github.com/mosaicml/streaming/pull/423
* acutkosky made their first contribution in https://github.com/mosaicml/streaming/pull/425

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.5.2...v0.6.0

0.5.2

Not secure

New features
- Allow authentication with GCS for service accounts 315
- human-readable suffixes for size_limit and epoch_size 333
- static sampling 348

Documentation changes
- Update contribution guide and improved unittest logic 343
- static sampling 348

Testing
- Add a regression test for StreamingDataset instantiation and iteration 318
- Fixed accidental shard delete test 341
- Add a regression test for StreamingDataset using cloud providers 319
- Add iteration time test as part of regression testing 358

Bug fix
- Fix init local dir zip-only shard handling 330
- added default behavior if no streams and epoch_size specified 348

What's Changed
* Bump myst-parser from 1.0.0 to 2.0.0 by dependabot in https://github.com/mosaicml/streaming/pull/309
* Added files to support azure datalake storage by shivshandilya in https://github.com/mosaicml/streaming/pull/311
* Add secrets check as part of pre-commit by karan6181 in https://github.com/mosaicml/streaming/pull/312
* Bump pytest from 7.3.2 to 7.4.0 by dependabot in https://github.com/mosaicml/streaming/pull/313
* Bump fastapi from 0.97.0 to 0.98.0 by dependabot in https://github.com/mosaicml/streaming/pull/314
* Add GCS authentication for service accounts by b-chu in https://github.com/mosaicml/streaming/pull/315
* Bump fastapi from 0.98.0 to 0.100.0 by dependabot in https://github.com/mosaicml/streaming/pull/322
* Bump uvicorn from 0.22.0 to 0.23.0 by dependabot in https://github.com/mosaicml/streaming/pull/327
* Bump gitpython from 3.1.31 to 3.1.32 by dependabot in https://github.com/mosaicml/streaming/pull/329
* Bump pydantic from 1.10.9 to 1.10.11 by dependabot in https://github.com/mosaicml/streaming/pull/328
* Sync tmp directory by b-chu in https://github.com/mosaicml/streaming/pull/316
* Add a regression test for StreamingDataset instantiation and iteration by b-chu in https://github.com/mosaicml/streaming/pull/318
* human-readable suffixes for size_limit and epoch_size by snarayan21 in https://github.com/mosaicml/streaming/pull/333
* Updated pre commit packages by snarayan21 in https://github.com/mosaicml/streaming/pull/340
* Fix init local dir zip-only shard handling by knighton in https://github.com/mosaicml/streaming/pull/330
* Fixed accidental shard delete test by karan6181 in https://github.com/mosaicml/streaming/pull/341
* Bump uvicorn from 0.23.0 to 0.23.1 by dependabot in https://github.com/mosaicml/streaming/pull/338
* Download the index.json file as tmp extension until it finishes by karan6181 in https://github.com/mosaicml/streaming/pull/346
* Update contribution guide and improved unittest logic by karan6181 in https://github.com/mosaicml/streaming/pull/343
* Bump fastapi from 0.100.0 to 0.100.1 by dependabot in https://github.com/mosaicml/streaming/pull/351
* Bump uvicorn from 0.23.1 to 0.23.2 by dependabot in https://github.com/mosaicml/streaming/pull/352
* Bump furo from 2023.5.20 to 2023.7.26 by dependabot in https://github.com/mosaicml/streaming/pull/354
* Bump pydantic from 1.10.11 to 2.1.1 by dependabot in https://github.com/mosaicml/streaming/pull/353
* added default behavior if no streams and epoch_size specified by snarayan21 in https://github.com/mosaicml/streaming/pull/348
* Add a regression test for StreamingDataset using cloud providers by b-chu in https://github.com/mosaicml/streaming/pull/319
* Fixed sampling by snarayan21 in https://github.com/mosaicml/streaming/pull/356
* mds ndarray int conversion by snarayan21 in https://github.com/mosaicml/streaming/pull/357
* Add iteration time test as part of regression testing by karan6181 in https://github.com/mosaicml/streaming/pull/358
* Bump pydantic from 1.10.11 to 2.1.1 by dependabot in https://github.com/mosaicml/streaming/pull/366
* Fixed CI test to perform proper directory cleanup by karan6181 in https://github.com/mosaicml/streaming/pull/369
* version bump to 0.5.2 by snarayan21 in https://github.com/mosaicml/streaming/pull/370

New Contributors
* shivshandilya made their first contribution in https://github.com/mosaicml/streaming/pull/311
* b-chu made their first contribution in https://github.com/mosaicml/streaming/pull/315
* snarayan21 made their first contribution in https://github.com/mosaicml/streaming/pull/333

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.5.1...v0.5.2

0.5.1

Not secure

What's Changed
* Improved shard eviction test execution time by karan6181 in https://github.com/mosaicml/streaming/pull/291
* Bump fastapi from 0.96.0 to 0.97.0 by dependabot in https://github.com/mosaicml/streaming/pull/294
* Bump pytest from 7.3.1 to 7.3.2 by dependabot in https://github.com/mosaicml/streaming/pull/295
* Bump pydantic from 1.10.8 to 1.10.9 by dependabot in https://github.com/mosaicml/streaming/pull/296
* Terminate the main process if thread died unexpectedly by karan6181 in https://github.com/mosaicml/streaming/pull/297
* Improved existing exception and exception messages by karan6181 in https://github.com/mosaicml/streaming/pull/298
* Round drop_first to be divisible by num_physical_nodes. by knighton in https://github.com/mosaicml/streaming/pull/301
* Added a utility method to clean stale shared memory by karan6181 in https://github.com/mosaicml/streaming/pull/299
* Propagate exception between threads and processes and improved error message by karan6181 in https://github.com/mosaicml/streaming/pull/304
* Fix LocalDataset (property size for fancy __getitem__). by knighton in https://github.com/mosaicml/streaming/pull/305
* Natively support encoding and decoding ndarrays in MDS by knighton in https://github.com/mosaicml/streaming/pull/82
* Bump version to 0.5.1 by karan6181 in https://github.com/mosaicml/streaming/pull/308

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.5.0...v0.5.1

0.5.0

Not secure

New Features

🆕 Cold Shard Eviction. ( 219 )

Dynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument `cache_limit`. See the [shuffling](https://github.com/mosaicml/streaming/blob/main/docs/source/fundamentals/shuffling.md) guide for more details.

python
from streaming import StreamingDataset

dataset = StreamingDataset(
cache_limit='100gb',
...
)

🤙 Fetch sample using NumPy style indexing. ( 120 )

Users can now randomly access samples using NumPy-style indexing with `StreamingDataset`. For example,

python
import numpy as np
from streaming import StreamingDataset

dataset = StreamingDataset(local=local, remote=remote)

dataset[0] Fetch sample 0
dataset[-1] Fetch last sample
dataset[[10, 20]] Fetch sample 10 and 20
dataset[slice(1, 10, 2)] Fetch sample 1, 3, 5, 7, and 9
dataset[5:0:-1] Fetch sample 5, 4, 3, 2, 1
dataset[np.array([4, 7])] Fetch sample 4 and 7

🦾 Any S3 compatible object store. ( 265 )

Support of any S3 compatible object stores, meaning, an object store which uses the S3 API to communicate with any connected device or system. Some of the S3 compatible object stores are [Cloudflare R2](https://www.cloudflare.com/products/r2/), [Coreweave](https://docs.coreweave.com/storage/object-storage), [Backblaze b2](https://www.backblaze.com/b2/cloud-storage.html), etc. User needs to provide an environment variable `S3_ENDPOINT_URL` based on the object store that you are using. Details on how to configure credentials can be found [here](https://github.com/mosaicml/streaming/blob/main/docs/source/how_to_guides/configure_cloud_storage_cred.md#any-s3-compatible-object-store).

🦾 Azure cloud blob storage. ( 256 )

Support of Azure cloud blob storage. Details on how to configure credentials can be found [here](https://github.com/mosaicml/streaming/blob/main/docs/source/how_to_guides/configure_cloud_storage_cred.md#azure-blob-storage).

Bug Fixes

- Wait for download and ready thread to finish before terminating job. ( 286 )
- Fixed length calculation to use resampled epoch size, not underlying num samples. ( 278 )
- Fixed mypy errors by adding a py.typed marker file. ( 245 )
- Create a new boto3 session per thread to avoid sharing resources. ( 241 )

🔧 **API changes**

- The argument `samples_per_epoch` has been renamed to `epoch_size` in `StreamingDataset`to better distinguish the actual number of underlying samples as serialized and the number of observed samples when iterating (which may be different due to weighting sub-datasets).
- The argument `samples` has been renamed to `choose` in `Stream` to better distinguish the underlying sample vs resampled data.
- The argument `keep_raw` has been removed in `StreamingDataset` in the process of finalizing the design for shard eviction (see the newly-added `cache_limit` parameter).
- The default value of `predownload` in `StreamingDataset` was updated; it is now derived using batch size and number of canonical nodes instead of previous constant value of `100_000`. This is to prevent predownloaded shards from getting evicted before ever being used.
- The default value of `num_canonical_nodes` in `StreamingDataset` was updated to 64 times the number of nodes of the initial run instead of number of nodes of the initial run to increase data source diversity and improve convergence.
- The default value of `shuffle_algo` in `StreamingDataset` was changed from `py1b` to `py1s` as it requires less shards to be downloaded during iteration. More details about different shuffling algorithms can be found [here](https://github.com/mosaicml/streaming/blob/main/docs/source/fundamentals/shuffling.md).

What's Changed
* Redesign shard index by knighton in https://github.com/mosaicml/streaming/pull/236
* Propagate an exception raise by a thread to its caller by karan6181 in https://github.com/mosaicml/streaming/pull/241
* Raise descriptive error message when index.json is corrupted by karan6181 in https://github.com/mosaicml/streaming/pull/242
* Rename "samples" to "choose" (distinguish underlying vs resampled) by knighton in https://github.com/mosaicml/streaming/pull/243
* Added py.typed to indicate that the repository has typing annotations by karan6181 in https://github.com/mosaicml/streaming/pull/245
* Add "Array" base class, which provides numpy-style indexing. by knighton in https://github.com/mosaicml/streaming/pull/120
* Better organize code by knighton in https://github.com/mosaicml/streaming/pull/246
* Update readthedocs python version to 3.9 by karan6181 in https://github.com/mosaicml/streaming/pull/249
* Create a new boto3 session per thread by karan6181 in https://github.com/mosaicml/streaming/pull/251
* Bump uvicorn from 0.21.1 to 0.22.0 by dependabot in https://github.com/mosaicml/streaming/pull/253
* Add support for Cloudflare R2 cloud storage by hlky in https://github.com/mosaicml/streaming/pull/255
* Fix typo in documentation's conversion `pile.py` link by ouhenio in https://github.com/mosaicml/streaming/pull/259
* Add support for Azure cloud storage by hlky in https://github.com/mosaicml/streaming/pull/256
* Fix slack link in readme by growlix in https://github.com/mosaicml/streaming/pull/262
* Bugfix in user_guide.md sample code by tginart in https://github.com/mosaicml/streaming/pull/263
* Add `Stream` usage example to README by hanlint in https://github.com/mosaicml/streaming/pull/266
* Update Stream documentation by karan6181 in https://github.com/mosaicml/streaming/pull/267
* Update README.md - slack by ejyuen in https://github.com/mosaicml/streaming/pull/273
* Bump fastapi from 0.95.1 to 0.95.2 by dependabot in https://github.com/mosaicml/streaming/pull/269
* Cold shard eviction by knighton in https://github.com/mosaicml/streaming/pull/219
* Update slack link with a URL shortener by karan6181 in https://github.com/mosaicml/streaming/pull/274
* Bump pydantic from 1.10.7 to 1.10.8 by dependabot in https://github.com/mosaicml/streaming/pull/276
* Bump yamllint from 1.31.0 to 1.32.0 by dependabot in https://github.com/mosaicml/streaming/pull/277
* Fix SD length calculation when resampling by knighton in https://github.com/mosaicml/streaming/pull/278
* Fixed performance degradation when not doing shard eviction by karan6181 in https://github.com/mosaicml/streaming/pull/279
* Derived predownload value using batch size and NCN by karan6181 in https://github.com/mosaicml/streaming/pull/280
* Support any S3-compatible object store (R2, Coreweave, Backblaze, etc.) by abhi-mosaic in https://github.com/mosaicml/streaming/pull/265
* Update docs pypi package and Improved documentation by karan6181 in https://github.com/mosaicml/streaming/pull/281
* Change the default number of canonical nodes by karan6181 in https://github.com/mosaicml/streaming/pull/282
* Set predownload value correctly for all usecase by karan6181 in https://github.com/mosaicml/streaming/pull/283
* Add documentation for MDSWriter, conversion scripts, and supported format by karan6181 in https://github.com/mosaicml/streaming/pull/232
* Ensure int64 by knighton in https://github.com/mosaicml/streaming/pull/284
* Wait for thread job to finish and Fixed filelock directory structure by karan6181 in https://github.com/mosaicml/streaming/pull/286
* Bump fastapi from 0.95.2 to 0.96.0 by dependabot in https://github.com/mosaicml/streaming/pull/287
* Bump version to 0.5.0 by karan6181 in https://github.com/mosaicml/streaming/pull/289
* Remove github action workflow concurrency check by karan6181 in https://github.com/mosaicml/streaming/pull/290

New Contributors
* hlky made their first contribution in https://github.com/mosaicml/streaming/pull/255
* ouhenio made their first contribution in https://github.com/mosaicml/streaming/pull/259
* growlix made their first contribution in https://github.com/mosaicml/streaming/pull/262
* tginart made their first contribution in https://github.com/mosaicml/streaming/pull/263
* hanlint made their first contribution in https://github.com/mosaicml/streaming/pull/266
* abhi-mosaic made their first contribution in https://github.com/mosaicml/streaming/pull/265

**Full Changelog**: https://github.com/mosaicml/streaming/compare/v0.4.1...v0.5.0

Page 3 of 5

Releases

Has known vulnerabilities

Previous Next

Mosaicml-streaming

Page 3 of 5

0.7.0

0.6.1

0.6.0

0.5.2

0.5.1

0.5.0

Page 3 of 5

Links

Releases