* Graduation of `MultiProcessingReadingService` from prototype to beta
* This is the default `ReadingService` that we expect most users to use; it closely aligns with the functionalities of old `DataLoader` with improvements
* With this graduation, we expect the APIs and behaviors to be mostly stable going forward. We will continue to add new features as they become ready.
* Introduction of Sequential ReadingService
* Enables the usage of multiple `ReadingService`s at the same time
* Adding comprehensive tutorial of `DataLoader2` and its subcomponents
Backwards Incompatible Change
DataLoader2
* Officially graduate PrototypeMultiProcessingReadingService to MultiProcessingReadingService ([1009]([https://github.com/pytorch/data/pull/1009](https://github.com/pytorch/data/pull/1009)))
* The APIs of `MultiProcessingReadingService` as well as the internal implementation have changed. Overall, this should provide a better user experience.
* Please refer to [our documentation]([https://pytorch.org/data/0.6/dataloader2.html#readingservice](https://pytorch.org/data/0.6/dataloader2.html#readingservice)) for details.
<p align="center">
<table align="center">
<tr><th>0.5.0</th><th>0.6.0</th></tr>
<tr valign="top">
<td><sub> It previously took the following arguments:
<pre lang="python">
MultiProcessingReadingService(
num_workers: int = 0,
pin_memory: bool = False,
timeout: float = 0,
worker_init_fn: Optional[Callable[[int], None]] = None,
multiprocessing_context=None,
prefetch_factor: Optional[int] = None,
persistent_workers: bool = False,
)
</pre></sub></td>
<td><sub> The new version takes these arguments: <pre lang="python">
MultiProcessingReadingService(
num_workers: int = 0,
multiprocessing_context: Optional[str] = None,
worker_prefetch_cnt: int = 10,
main_prefetch_cnt: int = 10,
worker_init_fn: Optional[Callable[[DataPipe, WorkerInfo], DataPipe]] = None,
worker_reset_fn: Optional[Callable[[DataPipe, WorkerInfo, SeedGenerator], DataPipe]] = None,
)
</pre></sub></td>
</tr>
</table>
</p>
* Deep copy ReadingService during `DataLoader2` initialization ([746]([https://github.com/pytorch/data/pull/746](https://github.com/pytorch/data/pull/746)))
* Within `DataLoader2`, a deep copy of the passed-in `ReadingService` object is created during initialization and will be subsequently used.
* This prevents multiple `DataLoader2`s from accidentally sharing states when the same `ReadingService` object is passed into them.
<p align="center">
<table align="center">
<tr><th>0.5.0</th><th>0.6.0</th></tr>
<tr valign="top">
<td><sub> Previously, a ReadingService object that is used in multiple DataLoader2 shared state among them.
<pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 4
</pre></sub></td>
<td><sub> DataLoader2 now deep copies the ReadingService object during initialization and the ReadingService state is no longer shared.
<pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
</pre></sub></td>
</tr>
</table>
</p>
Deprecations
DataPipe
In PyTorch Core
* Remove previously deprecated `FileLoaderDataPipe` ([89794](https://github.com/pytorch/pytorch/pull/89794))
* Mark imports from ``torch.utils.data.datapipes.iter.grouping`` as deprecated ([94527](https://github.com/pytorch/pytorch/pull/94527))
TorchData
* Remove certain deprecated functional APIs as previously scheduled (890)
Releng
* Drop support for Python 3.7 as aligned with PyTorch core library ([974]([https://github.com/pytorch/data/pull/974](https://github.com/pytorch/data/pull/974)))
New Features
DataLoader2
* Add graph function to list DataPipes from DataPipe graphs (888)
* Add functions to set seeds to DataPipe graphs (894)
* Add `worker_init_fn` and `worker_reset_fn` to MultiProcessingReadingService (907)
* Add round robin sharding to support non-replicable DataPipe for MultiProcessing (919)
* Guarantee that DataPipes execute `reset_iterator` when all loops have received reset request in the dispatching process (994)
* Add initial support for randomness control within `DataLoader2` (801)
* Add support for Sequential ReadingService ([commit](https://github.com/pytorch/data/commit/807db8f8c7282b2f48b48b1e07439c119a2ba12f#diff-d5ce955f25b587c0cbadcc87ad1b22b6027053e46e2920a8de7abbf5312cc24c))
* Enable SequentialReadingService to support MultiProcessing + Distributed (985)
* Add `limit`, `pause`, `resume` operations to halt DataPipes in `DataLoader2` (879)
DataPipe
* Add `ShardExpander` IterDataPipe (405)
* Add `RoundRobinDemux` IterDataPipe (903)
* Implement `PinMemory` IterDataPipe (1014)
Releng
* Add conda Python 3.11 builds (1010)
* Enable Python 3.11 conda builds for Mac/Windows (1026)
* Update C++ standard to 17 (1051)
Improvements
DataLoader2
In PyTorch Core
* Fix `apply_sharding` to accept one `sharding_filter` per branch ([90769](https://github.com/pytorch/pytorch/pull/90769))
TorchData
* Consolidate checkpoint contract with checkpoint component ([867](https://github.com/pytorch/data/pull/867))
* Update `load_state_dict()` signature to align with `TorchSnapshot` ([887](https://github.com/pytorch/data/pull/887))
* Apply sharding based on priority and combine `DistInfo` and `ExtraInfo` (used to store distributed metadata) ([916](https://github.com/pytorch/data/pull/916))
* Prevent reset iteration message from being sent to workers twice ([917](https://github.com/pytorch/data/pull/917))
* Add support to keep non-replicable DataPipe in the main process ([950](https://github.com/pytorch/data/pull/950))
* Safeguard `DataLoader2Iterator`'s `__getattr__` method ([1004](https://github.com/pytorch/data/pull/1004))
* Forward worker exceptions and have `DataLoader2` exit with them ([1003](https://github.com/pytorch/data/pull/1003))
* Attach traceback to Exception and test dispatching process ([1036](https://github.com/pytorch/data/pull/1036))
DataPipe
In PyTorch Core
* Add auto-completion to DataPipes in REPLs (e.g. Jupyter notebook) ([86960](https://github.com/pytorch/pytorch/pull/86960))
* Add group support to `sharding_filter` ([88424](https://github.com/pytorch/pytorch/pull/88424))
* Add `keep_key` option to `Grouper` ([92532](https://github.com/pytorch/pytorch/pull/92532))
TorchData
* Add a masks option to filter files in S3 DataPipe ([880](https://github.com/pytorch/data/pull/880))
* Make HeaderIterDataPipe with `limit=None` a no-op ([908](https://github.com/pytorch/data/pull/908))
* Update `fsspec` DataPipe to be compatible with the latest version of `fsspec` ([957](https://github.com/pytorch/data/pull/957))
* Expand the possible input options for HuggingFace DataPipe ([952](https://github.com/pytorch/data/pull/952))
* Improve exception handling/skipping in online DataPipes ([968]([https://github.com/pytorch/data/pull/968](https://github.com/pytorch/data/pull/968)))
* Allow the option to place key in output in `MapKeyZipper` ([1042]([https://github.com/pytorch/data/pull/1042](https://github.com/pytorch/data/pull/1042)))
* Allow single key option for `Slicer` ([1041]([https://github.com/pytorch/data/pull/1041](https://github.com/pytorch/data/pull/1041)))
Releng
* Add pure Python platform-agnostic wheel ([988](https://github.com/pytorch/data/pull/988))
Bug Fixes
DataLoader2
In PyTorch Core
* Change serialization wrapper implementation to be an iterator ([87459](https://github.com/pytorch/pytorch/pull/87459))
DataPipe
In PyTorch Core
* Fix type checking to accept both Iter and Map DataPipe ([87285](https://github.com/pytorch/pytorch/pull/87285))
* Fix: Make ``__len__`` of datapipes dynamic ([88302](https://github.com/pytorch/pytorch/pull/88302))
* Properly cleanup unclosed files within generator function ([89973](https://github.com/pytorch/pytorch/pull/89973))
* Remove iterator depletion in `Zipper` ([89974](https://github.com/pytorch/pytorch/pull/89974))
TorchData
* Fix `to_graph` DataPipeGraph visualization function (872)
* Make lengths of DataPipe dynamic (873)
* Fix `max_token_bucketize` to accept incomparable data (883)
* Fix `S3FileLoader` local file clobbering (895)
* Fix `fsspec` DataPipe for paths starting with `az://` (849)
* Properly cleanup unclosed files within generator function (910)
Performance
DataLoader2
* Add minimal, reproducible AWS S3 benchmark ([847](https://github.com/pytorch/data/pull/847))
Docs
DataLoader2
* Add Distributed ReadingService `DataLoader2` training loop example ([863]([https://github.com/pytorch/data/pull/863](https://github.com/pytorch/data/pull/863)))
* Update README and documentation with latest changes ([954]([https://github.com/pytorch/data/pull/954](https://github.com/pytorch/data/pull/954)))
* Update Colab example with `DataLoader2` content ([979]([https://github.com/pytorch/data/pull/979](https://github.com/pytorch/data/pull/979)))
* Add initial `DataLoader2` Tutorial ([980](https://github.com/pytorch/data/pull/980))
* Add LAION-5B Example with `DataLoader2` ([1034]([https://github.com/pytorch/data/pull/1034](https://github.com/pytorch/data/pull/1034)))
* Add Round Robin Sharding documentation ([1050]([https://github.com/pytorch/data/pull/1050](https://github.com/pytorch/data/pull/1050)))
DataPipe
* Add `pin_memory` to documentation (1046)
Releng
* Fix links in README ([995]([https://github.com/pytorch/data/pull/995](https://github.com/pytorch/data/pull/995)))
* Fix links in contribution guide ([995](https://github.com/pytorch/data/pull/1053))
Devs
DataPipe
In PyTorch Core
* Add container template for _Fork and _Demux ([89216](https://github.com/pytorch/pytorch/pull/89216))
* Refactor sharding data pipe into a separate file ([94095](https://github.com/pytorch/pytorch/pull/94095))
* Fix interface generation in setup.py ([87081](https://github.com/pytorch/pytorch/pull/87081))
TorchData
* Add tests to validate iteration over combining DataPipe with infinite input (912)
Releng
* Update GHA version to utilize Node16 ([830](https://github.com/pytorch/data/pull/830))
* Enable usage of `sphinx` doctest ([850](https://github.com/pytorch/data/pull/850))
* Update submodule ([955]([https://github.com/pytorch/data/pull/955](https://github.com/pytorch/data/pull/955)))
* Make `portalocker` optional dependency ([1007](https://github.com/pytorch/data/pull/1007))
Future Plans
For `DataLoader2`, we are actively developing new features such as the checkpointing and the ability to execute part of the DataPipe graph on a single process before dispatching the outputs to worker processes. You may begin to see some of these features in nightly builds. We expect them to be part of the next release.
We welcome feedback and feature requests (let us know your use cases!). We always welcome potential contributors.
Beta Usage Note
This library is currently in the Beta stage and currently does not have a fully stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.