* Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found [here](https://pytorch.org/data/0.5/tutorial.html#working-with-cloud-storage-providers)
* AWS S3 Benchmarking [result](https://github.com/pytorch/data/blob/main/benchmarks/cloud/aws_s3_results.md)
* Consolidated API for `DataLoader2` and provided a few `ReadingServices`, with detailed documentation now [available here](https://pytorch.org/data/0.5/dataloader2.html)
* Provided more comprehensive `DataPipe` operations, e.g., `random_split`, `repeat`, `set_length`, and `prefetch`.
* Provided pre-compiled torchdata binaries for arm64 Apple Silicon
Backwards Incompatible Change
DataPipe
Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (https://github.com/pytorch/pytorch/pull/83202)
`IterDataPipe` is used to to preserve data order
<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">MapDataPipe.shuffle</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
</pre></sub></td>
<td><sub><pre lang="python">
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
</pre></sub></td>
</tr>
</table>
</p>
`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (https://github.com/pytorch/data/pull/810)
<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">on_disk_cache</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
</pre></sub></td>
<td><sub><pre lang="python">
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
AssertionError
</pre></sub></td>
</tr>
</table>
DataLoader2
Imposed single iterator constraint on `DataLoader2` (https://github.com/pytorch/data/pull/700)
<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">DataLoader2 with a single iterator</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
</pre></sub></td>
<td><sub><pre lang="python">
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
Raises exception, since it1 is no longer valid
</pre></sub></td>
</tr>
</table>
</p>
Deep copy `DataPipe` during `DataLoader2` initialization or restoration (https://github.com/pytorch/data/pull/786, https://github.com/pytorch/data/pull/833)
Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.
<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">Deep copy DataPipe during DataLoader2 constructor</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
0 0
1 1
2 2
3 3
4 4
</pre></sub></td>
</tr>
</table>
</p>
Deprecations
DataLoader2
Deprecated `traverse` function and `only_datapipe` argument (https://github.com/pytorch/pytorch/pull/85667)
Please use `traverse_dps` with the behavior the same as `only_datapipe=True`. (https://github.com/pytorch/data/pull/793)
<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">DataPipe traverse function</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
</pre></sub></td>
</tr>
</table>
</p>
New Features
DataPipe
* Added AIStore DataPipe (https://github.com/pytorch/data/pull/545, https://github.com/pytorch/data/pull/667)
* Added support for `IterDataPipe` to trace DataFrames operations (https://github.com/pytorch/pytorch/pull/71931,
* Added support for `DataFrameMakerIterDataPipe` to accept `dtype_generator` to solve unserializable `dtype` (https://github.com/pytorch/data/pull/537)
* Added graph snapshotting by counting number of successful yields for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/79479, https://github.com/pytorch/pytorch/pull/79657)
* Implemented `drop` operation for `IterDataPipe` to drop column(s) (https://github.com/pytorch/data/pull/725)
* Implemented `FullSyncIterDataPipe` to synchronize distributed shards (https://github.com/pytorch/data/pull/713)
* Implemented `slice` and `flatten` operations for `IterDataPipe` (https://github.com/pytorch/data/pull/730)
* Implemented `repeat` operation for `IterDataPipe` (https://github.com/pytorch/data/pull/748)
* Added `LengthSetterIterDataPipe` (https://github.com/pytorch/data/pull/747)
* Added `RandomSplitter` (without buffer) (https://github.com/pytorch/data/pull/724)
* Added `padden_tokens` to `max_token_bucketize` to bucketize samples based on total padded token length (https://github.com/pytorch/data/pull/789)
* Implemented thread based `PrefetcherIterDataPipe` (https://github.com/pytorch/data/pull/770, https://github.com/pytorch/data/pull/818, https://github.com/pytorch/data/pull/826, https://github.com/pytorch/data/pull/842)
DataLoader2
* Added `CacheTimeout` `Adapter` to redefine cache timeout of the `DataPipe` graph (https://github.com/pytorch/data/pull/571)
* Added `DistribtuedReadingService` to support uneven data sharding (https://github.com/pytorch/data/pull/727)
* Added `PrototypeMultiProcessingReadingService`
* Added prefetching (https://github.com/pytorch/data/pull/826)
* Fixed process termination (https://github.com/pytorch/data/pull/837)
* Enabled deterministic training in distributed/non-distributed environment (https://github.com/pytorch/data/pull/827)
* Handled empty queue exception properly (https://github.com/pytorch/data/pull/785)
Releng
* Provided pre-compiled torchdata binaries for arm64 Apple Silicon (https://github.com/pytorch/data/pull/692)
Improvements
DataPipe
* Fixed error message coming from singler iterator constraint (https://github.com/pytorch/pytorch/pull/79547)
* Enabled profiler record context in `__next__` for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/79757)
* Raised warning for unpickable local function (547) (https://github.com/pytorch/pytorch/pull/80232, https://github.com/pytorch/data/pull/547)
* Cleaned up opened streams on the best effort basis (https://github.com/pytorch/data/pull/560, https://github.com/pytorch/pytorch/pull/78952)
* Used streaming reading mode for unseekable streams in `TarArchiveLoader` (https://github.com/pytorch/data/pull/653)
Improved GDrive 'content-disposition' error message (https://github.com/pytorch/data/pull/654)
* Added `as_tuple` argument for CSVParserIterDataPipe` to convert output from list to tuple (https://github.com/pytorch/data/pull/646)
* Raised Error when `HTTPReader` get 404 Response (160) (https://github.com/pytorch/data/pull/569)
* Added default no-op behavior for `flatmap` (https://github.com/pytorch/data/pull/749)
* Added support to validate `input_col` with the provided map function for `DataPipe` (https://github.com/pytorch/pytorch/pull/80267, https://github.com/pytorch/data/pull/755, https://github.com/pytorch/pytorch/pull/84279)
* Made `ShufflerIterDataPipe` support snapshotting ([83535](https://github.com/pytorch/pytorch/pull/83535))
* Unified implementations between `in_batch_shuffle` with `shuffle` for `IterDataPipe` (https://github.com/pytorch/data/pull/745)
* Made `IterDataPipe.to_map_datapipe` loading data lazily (https://github.com/pytorch/data/pull/765)
* Added `kwargs` to open files for `FSSpecFileLister` and `FSSpecSaver` (https://github.com/pytorch/data/pull/804)
* Added missing functional name for `FileLister` ([86497](https://github.com/pytorch/pytorch/pull/86497))
DataLoader
* Controlled shuffle option to all `DataPipes` with `set_shuffle` API https://github.com/pytorch/pytorch/pull/83741)
* Made distributed process group lazily initialized & share seed via the process group (https://github.com/pytorch/pytorch/pull/85279)
DataLoader2
* Improved graph traverse function
* Added support for unhashable `DataPipe` (https://github.com/pytorch/pytorch/pull/80509, https://github.com/pytorch/data/pull/559)
* Added support for all python collection objects (https://github.com/pytorch/pytorch/pull/84079, https://github.com/pytorch/data/pull/773)
* Ensured `finalize` and `finalize_iteration` are called during shutdown or exception (https://github.com/pytorch/data/pull/846)
Releng
* Enabled conda release to support GLIBC_2.27 (https://github.com/pytorch/data/pull/859)
Bug Fixes
DataPipe
* Fixed error for static typing (https://github.com/pytorch/data/pull/572, https://github.com/pytorch/data/pull/645, https://github.com/pytorch/data/pull/651, https://github.com/pytorch/pytorch/pull/81275, https://github.com/pytorch/data/pull/758)
* Fixed `fork` and `unzip` operations for the case of a single child (https://github.com/pytorch/pytorch/pull/81502)
* Corrected the type of exception that is being raised by `ShufflerMapDataPipe` (https://github.com/pytorch/pytorch/pull/82666)
* Fixed buffer overflow for `unzip` when `columns_to_skip` is specified (https://github.com/pytorch/data/pull/658)
* Fixed `TarArchiveLoader` to skip `open` for opened TarFile stream (https://github.com/pytorch/data/pull/679)
* Fixed mishandling of exception message in `IterDataPipe` (https://github.com/pytorch/pytorch/pull/84676)
* Fixed interface generation in setup.py ([87081](https://github.com/pytorch/pytorch/pull/87081))
Performance
DataLoader2
* Added benchmarking for `DataLoader2`
* Added AWS cloud configurations (https://github.com/pytorch/data/pull/680)
* Added benchmark from torchvision training references (https://github.com/pytorch/data/pull/714)
Documentation
DataPipe
* Added examples for data loading with `DataPipe`
* Read Criteo TSV and Parquet files and apply TorchArrow operations (https://github.com/pytorch/data/pull/561)
* Read caltech256 and coco with `AIStoreDataPipe` (https://github.com/pytorch/data/pull/582)
* Read from tigergraph database (https://github.com/pytorch/data/pull/783)
* Improved docstring for `DataPipe`
* `DataPipe` converters (https://github.com/pytorch/data/pull/710)
* `S3` DataPipe (https://github.com/pytorch/data/pull/784)
* `FileOpenerIterDataPipe` (https://github.com/pytorch/pytorch/pull/81407)
* `buffer_size` for `MaxTokenBucketizer` (https://github.com/pytorch/data/pull/834)
* `Prefetcher` (https://github.com/pytorch/data/pull/835)
* Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (https://github.com/pytorch/data/pull/812, https://github.com/pytorch/data/pull/836)
* Improved tutorial
* Fixed tutorial for newline on Windows in `generate_csv` (https://github.com/pytorch/data/pull/675)
* Improved note on shuffling behavior (https://github.com/pytorch/data/pull/688)
* Fixed tutorial about shuffing before sharding (https://github.com/pytorch/data/pull/715)
* Added `random_split` example (https://github.com/pytorch/data/pull/843)
* Simplified long type names for online doc (https://github.com/pytorch/data/pull/838)
DataLoader2
* Improved docstring for `DataLoader2` (https://github.com/pytorch/data/pull/581, https://github.com/pytorch/data/pull/817)
* Added training examples using `DataLoader2`, `ReadingService` and `DataPipe` (https://github.com/pytorch/data/pull/563, https://github.com/pytorch/data/pull/664, https://github.com/pytorch/data/pull/670, https://github.com/pytorch/data/pull/787)
Releng
* Added contribution guide for third-party library (https://github.com/pytorch/data/pull/663)
Future Plans
We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making `DataLoader2` and related `ReadingService` more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.
Beta Usage Note
This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.