* DataPipe graph is now backward compatible with `DataLoader` regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial [here](https://pytorch.org/data/0.4.0/tutorial.html#working-with-dataloader).
* [`AWSSDK`](https://github.com/aws/aws-sdk-cpp) is integrated to support listing/loading files from AWS S3.
* Adding support to read from `TFRecord` and Hugging Face Hub.
* `DataLoader2` became available in prototype mode. For more details, please check our [future plans](Future-Plans).
Backwards Incompatible Change
DataPipe
Updated `Multiplexer` (functional API `mux`) to stop merging multiple `DataPipes` whenever the shortest one is exhausted (https://github.com/pytorch/pytorch/pull/77145)
Please use `MultiplexerLongest` (functional API `mux_longgest`) to achieve the previous functionality.
<p align="center">
<table align="center">
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
>>> len(output_dp)
13
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> len(output_dp)
9
</pre></sub></td>
</tr>
</table>
</p>
Enforcing single valid iterator for `IterDataPipes` w/wo multiple outputs https://github.com/pytorch/pytorch/pull/70479, (https://github.com/pytorch/pytorch/pull/75995)
If you need to reference the same `IterDataPipe` multiple times, please apply `.fork()` on the `IterDataPipe` instance.
<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">IterDataPipe with a single output</th>
</tr>
</thead>
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)
0
>>> next(it1)
1
Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
[(0, 0), ..., (9, 9)]
</pre></sub></td>
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp) This doesn't raise any warning or error
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2) Invalidates `it1`
0
>>> next(it1)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
</pre></sub></td>
</tr>
</table>
</p>
<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">IterDataPipe with multiple outputs</th>
</tr>
</thead>
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)
Basically share the same reference as `it1`
doesn't reset because `cdp1` hasn't been read since reset
>>> next(it1)
0
>>> next(it2)
0
>>> next(it3)
1
The next line resets all ChildDataPipe
because `cdp2` has started reading
>>> it4 = iter(cdp2)
>>> next(it3)
0
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
</pre></sub></td>
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1) This invalidates `it1` and `it2`
>>> next(it1)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it2)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it3)
0
The next line should not invalidate anything, as there was no new iterator created
for `cdp2` after `it2` was invalidated
>>> it4 = iter(cdp2)
>>> next(it3)
1
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
</pre></sub></td>
</tr>
</table>
</p>
Deprecations
DataPipe
Deprecated functional APIs of `open_file_by_fsspec` and `open_file_by_iopath` for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/78970, https://github.com/pytorch/pytorch/pull/79302)
Please use `open_files_by_fsspec` and `open_files_by_iopath`
<p align="center">
<table align="center">
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec() No Warning
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath() No Warning
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()
FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_fsspec()` instead.
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()
FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_iopath()` instead.
</pre></sub></td>
</tr>
</table>
</p>
Argument `drop_empty_batches` of `Filter` (functional API `filter`) is deprecated and going to be removed in the future release (https://github.com/pytorch/pytorch/pull/76060)
<p align="center">
<table align="center">
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
</pre></sub></td>
</tr>
</table>
</p>
New Features
DataPipe
* Added utility to visualize `DataPipe` graphs (https://github.com/pytorch/data/pull/330)
IterDataPipe
* Added `Bz2FileLoader` with functional API of `load_from_bz2` (https://github.com/pytorch/data/pull/312)
* Added `BatchMapper` (functional API: `map_batches`) and `FlatMapper` (functional API: `flat_map`) (https://github.com/pytorch/data/pull/359)
* Added support for WebDataset-style archives (https://github.com/pytorch/data/pull/367)
* Added `MultiplexerLongest` with functional API of `mux_longest` (https://github.com/pytorch/data/pull/372)
* Add `ZipperLongest` with functional API of `zip_longest` (https://github.com/pytorch/data/pull/373)
* Added `MaxTokenBucketizer` with functional API of `max_token_bucketize` (https://github.com/pytorch/data/pull/283)
* Added `S3FileLister` (functional API: `list_files_by_s3`) and `S3FileLoader` (functional API: `load_files_by_s3`) integrated with the native AWSSDK (https://github.com/pytorch/data/pull/165)
* Added `HuggingFaceHubReader` (https://github.com/pytorch/data/pull/490)
* Added `TFRecordLoader` with functional API of `load_from_tfrecord` (https://github.com/pytorch/data/pull/308)
MapDataPipe
* Added `UnZipper` with functional API of `unzip` (https://github.com/pytorch/data/pull/325)
* Added `MapToIterConverter` with functional API of `to_iter_datapipe` (https://github.com/pytorch/data/pull/327)
* Added `InMemoryCacheHolder` with functional API of `in_memory_cache` (https://github.com/pytorch/data/pull/328)
Releng
* Added nightly releases for TorchData. Users should be able to install nightly TorchData via
* `pip install –pre torchdata -f https://download.pytorch.org/whl/nightly/cpu`
* `conda install -c pytorch-nightly torchdata`
* Added support of AWSSDK enabled `DataPipes`. See: [README](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/README.md)
* AWSSDK was pre-compiled and assembled in TorchData for both nightly and 0.4.0 releases
Improvements
DataPipe
* Added optional `encoding` argument to `FileOpener` (https://github.com/pytorch/pytorch/pull/72715)
* Renamed `BucketBatcher` argument to avoid name collision (https://github.com/pytorch/data/pull/304)
* Removed default parameter of `ShufflerIterDataPipe` (https://github.com/pytorch/pytorch/pull/74370)
* Made profiler wrapper can delegating function calls to `DataPipe` iterator (https://github.com/pytorch/pytorch/pull/75275)
* Added `input_col` argument to `flatmap` for applying `fn` to the specific column(s) (https://github.com/pytorch/data/pull/363)
* Improved debug message when exceptions are raised within `IterDataPipe` (https://github.com/pytorch/pytorch/pull/75618)
* Improved debug message when argument is a tuple/list of `DataPipes` (https://github.com/pytorch/pytorch/pull/76134)
* Add functional API to `StreamReader` (functional API: `open_files`) and `FileOpener` (functional API: `read_from_stream`) (https://github.com/pytorch/pytorch/pull/76233)
* Enabled graph traversal for `MapDataPipe` (https://github.com/pytorch/pytorch/pull/74851)
* Added `input_col` argument to `filter` for applying `filter_fn` to the specific column(s) (https://github.com/pytorch/pytorch/pull/76060)
* Added functional APIs for `OnlineReaders` (https://github.com/pytorch/data/pull/369)
* `HTTPReaderIterDataPipe`: `read_from_http`
* `GDriveReaderDataPipe`: `read_from_gdrive`
* `OnlineReaderIterDataPipe`: `read_from_remote`
* Cleared buffer for `DataPipe` during `__del__` (https://github.com/pytorch/pytorch/pull/76345)
* Overrode wrong python https proxy on Windows (https://github.com/pytorch/data/pull/371)
* Exposed functional API of 'to_map_datapipe' from `IterDataPipe`'s pyi interface (https://github.com/pytorch/data/pull/326)
* Moved buffer for `IterDataPipe` from iterator to instance (self) (https://github.com/pytorch/data/pull/388)
* Improved `DataPipe` serialization:
* Enabled serialization of `ForkerIterDataPipe` (https://github.com/pytorch/pytorch/pull/73118)
* Fixed issue with `DataPipe` serialization with dill (https://github.com/pytorch/pytorch/pull/72896)
* Applied special serialization when dill is installed (https://github.com/pytorch/pytorch/pull/74958)
* Applied dill serialization for `demux` and added cache to graph traverse (https://github.com/pytorch/pytorch/pull/75034)
* Revamp serialization logic of `DataPipes` (https://github.com/pytorch/pytorch/pull/74984)
* Prevented automatic reset after state is restored (https://github.com/pytorch/pytorch/pull/77774)
* Moved `IterDataPipe` buffers from __iter__ to instance (self) ([76999](https://github.com/pytorch/pytorch/pull/76999))
* Refactored buffer of `Multiplexer` from `__iter__` to instance (self) (https://github.com/pytorch/pytorch/pull/77775)
* Made `GDriveReader` handling Virus Scan Warning (https://github.com/pytorch/data/pull/442)
* Added `**kwargs` arguments to `HttpReader` to specify extra parameters for HTTP requests (https://github.com/pytorch/data/pull/392)
* Updated `FSSpecFileLister` and `IoPathFileLister` to support multiple root paths and updated `FSSpecFileLister` to support S3 urls (https://github.com/pytorch/data/pull/383)
* Fixed racing condition issue with writing files in multiprocessing
* Added `filelock` to `IoPathSaver` to prevent racing condition (https://github.com/pytorch/data/pull/413)
* Added lock mechanism to prevent `on_disk_cache` downloading twice https://github.com/pytorch/data/pull/409)
* Add instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
* Added a 's' to the functional names of open/list `DataPipes` (https://github.com/pytorch/data/pull/479)
* Added `list_file` functional API to `FSSpecFileLister` and `IoPathFileLister` (https://github.com/pytorch/data/pull/463)
* Added `list_files` functional API to `FileLister` (https://github.com/pytorch/pytorch/pull/78419)
* Improved FSSpec `DataPipes` to accept extra keyword arguments (https://github.com/pytorch/data/pull/495)
* Pass through `kwargs` to `json.loads` call in JsonParse (https://github.com/pytorch/data/pull/518)
DataLoader
* Added ability to use `dill` to pass `DataPipes` in multiprocessing (https://github.com/pytorch/pytorch/pull/77288))
* `DataLoader` automatically apply sharding to `DataPipe` graph in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/78762, https://github.com/pytorch/pytorch/pull/78950, https://github.com/pytorch/pytorch/pull/79041, https://github.com/pytorch/pytorch/pull/79124, https://github.com/pytorch/pytorch/pull/79524)
* Made `ShufflerDataPipe` deterministic with `DataLoader` in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/77741, https://github.com/pytorch/pytorch/pull/77855, https://github.com/pytorch/pytorch/pull/78765, https://github.com/pytorch/pytorch/pull/79829)
* Prevented overriding shuffle settings in `DataLoader` for `DataPipe` (https://github.com/pytorch/pytorch/pull/75505)
Releng
* Made `requirements.txt` as the single source of truth for TorchData version (https://github.com/pytorch/data/pull/414)
* Prohibited Release GHA workflows running on forked branches. (https://github.com/pytorch/data/pull/361)
Performance
DataPipe
* Lazily generated exception message for performance (https://github.com/pytorch/pytorch/pull/78673)
* Fixes regression introduced from single iterator constraint related PRs.
* Disabled profiler for `IterDataPipe` by default (https://github.com/pytorch/pytorch/pull/78674)
* By skipping over the record function when the profiler is not enabled, the speedup is up to [5-6x](https://github.com/pytorch/pytorch/pull/78674#issuecomment-1146233729) for `DataPipes` when their internal operations are very simple (e.g. `IterableWrapper`)
Documentation
DataPipe
* Fixed typo in TorchVision example (https://github.com/pytorch/data/pull/311)
* Updated `DataPipe` naming guidelines (https://github.com/pytorch/data/pull/428)
* Updated documents from `DataSet` to PyTorch `Dataset` (https://github.com/pytorch/data/pull/292)
* Added examples for graphs, meshes and point clouds using `DataPipe` (https://github.com/pytorch/data/pull/337)
* Added examples for semantic segmentation and time series using `DataPipe` (https://github.com/pytorch/data/pull/340)
* Expanded the contribution guide, especially including instructions to add a new `DataPipe` (https://github.com/pytorch/data/pull/354)
* Updated tutorial about placing `sharding_filter` (https://github.com/pytorch/data/pull/487)
* Improved graph visualization documentation (https://github.com/pytorch/data/pull/504)
* Added instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
* Updated examples to avoid lambdas (https://github.com/pytorch/data/pull/524)
* Updated documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)
* Updated links for tutorial (https://github.com/pytorch/data/pull/543)
IterDataPipe
* Fixed documentation for `IterToMapConverter`, `S3FileLister` and `S3FileLoader` (https://github.com/pytorch/data/pull/381)
* Update documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)
MapDataPipe
* Updated contributing guide and added guidance for `MapDataPipe` (https://github.com/pytorch/data/pull/379)
* Rather than re-implementing the same functionalities twice for both `IterDataPipe` and `MapDataPipe`, we encourage users to use the built-in functionalities of `IterDataPipe` and use the converter to `MapDataPipe` as needed.
DataLoader/DataLoader2
* Fixed tutorial about `DataPipe` working with `DataLoader` (https://github.com/pytorch/data/pull/458)
* Updated examples and tutorial after automatic sharding has landed (https://github.com/pytorch/data/pull/505)
* Add README for DataLoader2 (https://github.com/pytorch/data/pull/526, https://github.com/pytorch/data/pull/541)
Releng
* Added nightly documentation for TorchData in https://pytorch.org/data/main/
* Fixed instruction to install TorchData (https://github.com/pytorch/data/pull/455)
Future Plans
For `DataLoader2`, we are introducing new ways to interact between `DataPipes`, DataLoading API, and backends (aka `ReadingServices`). Feature is stable in terms of API, but functionally not complete yet. We welcome early adopters and feedback, as well as potential contributors.
Beta Usage Note
This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.