Torchdata

Latest version: v0.7.1

Safety actively analyzes 625610 Python packages for vulnerabilities to keep your Python projects secure.

Page 1 of 2

0.7.0

Current status

**:warning: As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use [1196](https://github.com/pytorch/data/issues/1196) for feedback).**

Bug Fixes

- MPRS request/response cycle for workers (https://github.com/pytorch/data/commit/40dd648bdd2b7b9c078ba3d2f47316b6dd4446d3)
- Sequential reading service checkpointing (https://github.com/pytorch/data/commit/8d452cf4d0688fdce478089fe77cba52fc27e1c3)
- Cancel future object and always run callback in FullSync during shutdown (1171)
- DataPipe, Ensures Prefetcher shuts down properly (1166)
- DataPipe, Fix FullSync shutdown hanging issue while paused (1153)
- DataPipe, Fix a word in WebDS DataPipe (1156)
- DataPipe, Add handler argument to iopath DataPipes (1154)
- Prevent in_memory_cache from yielding from source_dp when it's fully cache (1160)
- Fix pin_memory to support single-element batch (1158)
- DataLoader2, Removing delegation for 'pause', 'limit', and 'resume' (1067)
- DataLoader2, Handle MapDataPipe by converting to IterDataPipe internally by default (1146)

New Features

- Implement InProcessReadingService (1139)
- Enable miniepoch for MultiProcessingReadingService (1170)
- DataPipe, Implement pause/resume for FullSync (1130)
- DataLoader2, Saving and restoring initial seed generator (998)
- Add ThreadPoolMapper (1052)

0.6.1

Highlights

This minor release is aligned with PyTorch 2.0.1 and primarily fixes bugs that are introduced in the 0.6.0 release. We sincerely thank our users and contributors for spotting various bugs and helping us to fix them.

Bug Fixes

DataLoader2

* Properly clean up processes and queues for MPRS and Fix pause for prefetch (1096)
* Fix DataLoader2 `seed = 0` bug (1098)
* Previously, if `seed = 0` was passed into `DataLoader2`, the `seed` value in `DataLoader2` would not be set and the seed would be unused. This change fixes that and allow `seed = 0` to be used normally.
* Fix `worker_init_fn` to update DataPipe graph and move worker prefetch to the end of Worker pipeline (1100)

DataPipe

* Fix `pin_memory_fn` to support `namedtuple` (1086)
* Fix typo for `portalocker` at import time (1099)

Improvements

DataPipe

* Skip `FullSync` operation when `world_size == 1` (1065)

Docs

* Add long project description to `setup.py` for display on PyPI (1094)

Beta Usage Note

This library is currently in the Beta stage and currently does not have a fully stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback. As always, we welcome new contributors to our repo.

0.6.0

* Graduation of `MultiProcessingReadingService` from prototype to beta
* This is the default `ReadingService` that we expect most users to use; it closely aligns with the functionalities of old `DataLoader` with improvements
* With this graduation, we expect the APIs and behaviors to be mostly stable going forward. We will continue to add new features as they become ready.
* Introduction of Sequential ReadingService
* Enables the usage of multiple `ReadingService`s at the same time
* Adding comprehensive tutorial of `DataLoader2` and its subcomponents

Backwards Incompatible Change

DataLoader2

* Officially graduate PrototypeMultiProcessingReadingService to MultiProcessingReadingService ([1009]([https://github.com/pytorch/data/pull/1009](https://github.com/pytorch/data/pull/1009)))
* The APIs of `MultiProcessingReadingService` as well as the internal implementation have changed. Overall, this should provide a better user experience.
* Please refer to [our documentation]([https://pytorch.org/data/0.6/dataloader2.html#readingservice](https://pytorch.org/data/0.6/dataloader2.html#readingservice)) for details.

<p align="center">
<table align="center">
<tr><th>0.5.0</th><th>0.6.0</th></tr>
<tr valign="top">
<td><sub> It previously took the following arguments:
<pre lang="python">
MultiProcessingReadingService(
num_workers: int = 0,
pin_memory: bool = False,
timeout: float = 0,
worker_init_fn: Optional[Callable[[int], None]] = None,
multiprocessing_context=None,
prefetch_factor: Optional[int] = None,
persistent_workers: bool = False,
)
</pre></sub></td>
<td><sub> The new version takes these arguments: <pre lang="python">
MultiProcessingReadingService(
num_workers: int = 0,
multiprocessing_context: Optional[str] = None,
worker_prefetch_cnt: int = 10,
main_prefetch_cnt: int = 10,
worker_init_fn: Optional[Callable[[DataPipe, WorkerInfo], DataPipe]] = None,
worker_reset_fn: Optional[Callable[[DataPipe, WorkerInfo, SeedGenerator], DataPipe]] = None,
)
</pre></sub></td>
</tr>
</table>
</p>

* Deep copy ReadingService during `DataLoader2` initialization ([746]([https://github.com/pytorch/data/pull/746](https://github.com/pytorch/data/pull/746)))
* Within `DataLoader2`, a deep copy of the passed-in `ReadingService` object is created during initialization and will be subsequently used.
* This prevents multiple `DataLoader2`s from accidentally sharing states when the same `ReadingService` object is passed into them.

<p align="center">
<table align="center">
<tr><th>0.5.0</th><th>0.6.0</th></tr>
<tr valign="top">
<td><sub> Previously, a ReadingService object that is used in multiple DataLoader2 shared state among them.
<pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 4
</pre></sub></td>
<td><sub> DataLoader2 now deep copies the ReadingService object during initialization and the ReadingService state is no longer shared.
<pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> rs = MultiProcessingReadingService(num_workers=2)
>>> dl1 = DataLoader2(dp, reading_service=rs)
>>> dl2 = DataLoader2(dp, reading_service=rs)
>>> next(iter(dl1))
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
>>> next(iter(dl2))
Note that we are still examining `dl1.read_service` below
>>> print(f"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}")
Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2
</pre></sub></td>
</tr>
</table>
</p>

Deprecations

DataPipe

In PyTorch Core

* Remove previously deprecated `FileLoaderDataPipe` ([89794](https://github.com/pytorch/pytorch/pull/89794))
* Mark imports from ``torch.utils.data.datapipes.iter.grouping`` as deprecated ([94527](https://github.com/pytorch/pytorch/pull/94527))

TorchData

* Remove certain deprecated functional APIs as previously scheduled (890)

Releng

* Drop support for Python 3.7 as aligned with PyTorch core library ([974]([https://github.com/pytorch/data/pull/974](https://github.com/pytorch/data/pull/974)))

New Features

DataLoader2

* Add graph function to list DataPipes from DataPipe graphs (888)
* Add functions to set seeds to DataPipe graphs (894)
* Add `worker_init_fn` and `worker_reset_fn` to MultiProcessingReadingService (907)
* Add round robin sharding to support non-replicable DataPipe for MultiProcessing (919)
* Guarantee that DataPipes execute `reset_iterator` when all loops have received reset request in the dispatching process (994)
* Add initial support for randomness control within `DataLoader2` (801)
* Add support for Sequential ReadingService ([commit](https://github.com/pytorch/data/commit/807db8f8c7282b2f48b48b1e07439c119a2ba12f#diff-d5ce955f25b587c0cbadcc87ad1b22b6027053e46e2920a8de7abbf5312cc24c))
* Enable SequentialReadingService to support MultiProcessing + Distributed (985)
* Add `limit`, `pause`, `resume` operations to halt DataPipes in `DataLoader2` (879)

DataPipe

* Add `ShardExpander` IterDataPipe (405)
* Add `RoundRobinDemux` IterDataPipe (903)
* Implement `PinMemory` IterDataPipe (1014)

Releng

* Add conda Python 3.11 builds (1010)
* Enable Python 3.11 conda builds for Mac/Windows (1026)
* Update C++ standard to 17 (1051)

Improvements

DataLoader2

In PyTorch Core

* Fix `apply_sharding` to accept one `sharding_filter` per branch ([90769](https://github.com/pytorch/pytorch/pull/90769))

TorchData

* Consolidate checkpoint contract with checkpoint component ([867](https://github.com/pytorch/data/pull/867))
* Update `load_state_dict()` signature to align with `TorchSnapshot` ([887](https://github.com/pytorch/data/pull/887))
* Apply sharding based on priority and combine `DistInfo` and `ExtraInfo` (used to store distributed metadata) ([916](https://github.com/pytorch/data/pull/916))
* Prevent reset iteration message from being sent to workers twice ([917](https://github.com/pytorch/data/pull/917))
* Add support to keep non-replicable DataPipe in the main process ([950](https://github.com/pytorch/data/pull/950))
* Safeguard `DataLoader2Iterator`'s `__getattr__` method ([1004](https://github.com/pytorch/data/pull/1004))
* Forward worker exceptions and have `DataLoader2` exit with them ([1003](https://github.com/pytorch/data/pull/1003))
* Attach traceback to Exception and test dispatching process ([1036](https://github.com/pytorch/data/pull/1036))

DataPipe

In PyTorch Core

* Add auto-completion to DataPipes in REPLs (e.g. Jupyter notebook) ([86960](https://github.com/pytorch/pytorch/pull/86960))
* Add group support to `sharding_filter` ([88424](https://github.com/pytorch/pytorch/pull/88424))
* Add `keep_key` option to `Grouper` ([92532](https://github.com/pytorch/pytorch/pull/92532))

TorchData

* Add a masks option to filter files in S3 DataPipe ([880](https://github.com/pytorch/data/pull/880))
* Make HeaderIterDataPipe with `limit=None` a no-op ([908](https://github.com/pytorch/data/pull/908))
* Update `fsspec` DataPipe to be compatible with the latest version of `fsspec` ([957](https://github.com/pytorch/data/pull/957))
* Expand the possible input options for HuggingFace DataPipe ([952](https://github.com/pytorch/data/pull/952))
* Improve exception handling/skipping in online DataPipes ([968]([https://github.com/pytorch/data/pull/968](https://github.com/pytorch/data/pull/968)))
* Allow the option to place key in output in `MapKeyZipper` ([1042]([https://github.com/pytorch/data/pull/1042](https://github.com/pytorch/data/pull/1042)))
* Allow single key option for `Slicer` ([1041]([https://github.com/pytorch/data/pull/1041](https://github.com/pytorch/data/pull/1041)))

Releng

* Add pure Python platform-agnostic wheel ([988](https://github.com/pytorch/data/pull/988))

Bug Fixes

DataLoader2

In PyTorch Core

* Change serialization wrapper implementation to be an iterator ([87459](https://github.com/pytorch/pytorch/pull/87459))

DataPipe

In PyTorch Core

* Fix type checking to accept both Iter and Map DataPipe ([87285](https://github.com/pytorch/pytorch/pull/87285))
* Fix: Make ``__len__`` of datapipes dynamic ([88302](https://github.com/pytorch/pytorch/pull/88302))
* Properly cleanup unclosed files within generator function ([89973](https://github.com/pytorch/pytorch/pull/89973))
* Remove iterator depletion in `Zipper` ([89974](https://github.com/pytorch/pytorch/pull/89974))

TorchData

* Fix `to_graph` DataPipeGraph visualization function (872)
* Make lengths of DataPipe dynamic (873)
* Fix `max_token_bucketize` to accept incomparable data (883)
* Fix `S3FileLoader` local file clobbering (895)
* Fix `fsspec` DataPipe for paths starting with `az://` (849)
* Properly cleanup unclosed files within generator function (910)

Performance

DataLoader2

* Add minimal, reproducible AWS S3 benchmark ([847](https://github.com/pytorch/data/pull/847))

Docs

DataLoader2

* Add Distributed ReadingService `DataLoader2` training loop example ([863]([https://github.com/pytorch/data/pull/863](https://github.com/pytorch/data/pull/863)))
* Update README and documentation with latest changes ([954]([https://github.com/pytorch/data/pull/954](https://github.com/pytorch/data/pull/954)))
* Update Colab example with `DataLoader2` content ([979]([https://github.com/pytorch/data/pull/979](https://github.com/pytorch/data/pull/979)))
* Add initial `DataLoader2` Tutorial ([980](https://github.com/pytorch/data/pull/980))
* Add LAION-5B Example with `DataLoader2` ([1034]([https://github.com/pytorch/data/pull/1034](https://github.com/pytorch/data/pull/1034)))
* Add Round Robin Sharding documentation ([1050]([https://github.com/pytorch/data/pull/1050](https://github.com/pytorch/data/pull/1050)))

DataPipe

* Add `pin_memory` to documentation (1046)

Releng

* Fix links in README ([995]([https://github.com/pytorch/data/pull/995](https://github.com/pytorch/data/pull/995)))
* Fix links in contribution guide ([995](https://github.com/pytorch/data/pull/1053))

Devs

DataPipe

In PyTorch Core

* Add container template for _Fork and _Demux ([89216](https://github.com/pytorch/pytorch/pull/89216))
* Refactor sharding data pipe into a separate file ([94095](https://github.com/pytorch/pytorch/pull/94095))
* Fix interface generation in setup.py ([87081](https://github.com/pytorch/pytorch/pull/87081))

TorchData

* Add tests to validate iteration over combining DataPipe with infinite input (912)

Releng

* Update GHA version to utilize Node16 ([830](https://github.com/pytorch/data/pull/830))
* Enable usage of `sphinx` doctest ([850](https://github.com/pytorch/data/pull/850))
* Update submodule ([955]([https://github.com/pytorch/data/pull/955](https://github.com/pytorch/data/pull/955)))
* Make `portalocker` optional dependency ([1007](https://github.com/pytorch/data/pull/1007))

Future Plans

For `DataLoader2`, we are actively developing new features such as the checkpointing and the ability to execute part of the DataPipe graph on a single process before dispatching the outputs to worker processes. You may begin to see some of these features in nightly builds. We expect them to be part of the next release.

We welcome feedback and feature requests (let us know your use cases!). We always welcome potential contributors.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a fully stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

0.5.1

This is a minor release to update PyTorch dependency from `1.13.0` to `1.13.1`. Please check the [release note](https://github.com/pytorch/data/releases/tag/v0.5.0) of TorchData `0.5.0` major release for more detail.

0.5.0

* Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found [here](https://pytorch.org/data/0.5/tutorial.html#working-with-cloud-storage-providers)
* AWS S3 Benchmarking [result](https://github.com/pytorch/data/blob/main/benchmarks/cloud/aws_s3_results.md)
* Consolidated API for `DataLoader2` and provided a few `ReadingServices`, with detailed documentation now [available here](https://pytorch.org/data/0.5/dataloader2.html)
* Provided more comprehensive `DataPipe` operations, e.g., `random_split`, `repeat`, `set_length`, and `prefetch`.
* Provided pre-compiled torchdata binaries for arm64 Apple Silicon

Backwards Incompatible Change

DataPipe

Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (https://github.com/pytorch/pytorch/pull/83202)
`IterDataPipe` is used to to preserve data order

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">MapDataPipe.shuffle</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
</pre></sub></td>
<td><sub><pre lang="python">
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
</pre></sub></td>
</tr>
</table>
</p>

`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (https://github.com/pytorch/data/pull/810)

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">on_disk_cache</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
</pre></sub></td>
<td><sub><pre lang="python">
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
AssertionError
</pre></sub></td>
</tr>
</table>

DataLoader2

Imposed single iterator constraint on `DataLoader2` (https://github.com/pytorch/data/pull/700)

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">DataLoader2 with a single iterator</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
</pre></sub></td>
<td><sub><pre lang="python">
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
Raises exception, since it1 is no longer valid
</pre></sub></td>
</tr>
</table>
</p>

Deep copy `DataPipe` during `DataLoader2` initialization or restoration (https://github.com/pytorch/data/pull/786, https://github.com/pytorch/data/pull/833)
Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">Deep copy DataPipe during DataLoader2 constructor</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
0 0
1 1
2 2
3 3
4 4
</pre></sub></td>
</tr>
</table>
</p>

Deprecations

DataLoader2

Deprecated `traverse` function and `only_datapipe` argument (https://github.com/pytorch/pytorch/pull/85667)
Please use `traverse_dps` with the behavior the same as `only_datapipe=True`. (https://github.com/pytorch/data/pull/793)

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">DataPipe traverse function</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
</pre></sub></td>
</tr>
</table>
</p>

New Features

DataPipe

* Added AIStore DataPipe (https://github.com/pytorch/data/pull/545, https://github.com/pytorch/data/pull/667)
* Added support for `IterDataPipe` to trace DataFrames operations (https://github.com/pytorch/pytorch/pull/71931,
* Added support for `DataFrameMakerIterDataPipe` to accept `dtype_generator` to solve unserializable `dtype` (https://github.com/pytorch/data/pull/537)
* Added graph snapshotting by counting number of successful yields for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/79479, https://github.com/pytorch/pytorch/pull/79657)
* Implemented `drop` operation for `IterDataPipe` to drop column(s) (https://github.com/pytorch/data/pull/725)
* Implemented `FullSyncIterDataPipe` to synchronize distributed shards (https://github.com/pytorch/data/pull/713)
* Implemented `slice` and `flatten` operations for `IterDataPipe` (https://github.com/pytorch/data/pull/730)
* Implemented `repeat` operation for `IterDataPipe` (https://github.com/pytorch/data/pull/748)
* Added `LengthSetterIterDataPipe` (https://github.com/pytorch/data/pull/747)
* Added `RandomSplitter` (without buffer) (https://github.com/pytorch/data/pull/724)
* Added `padden_tokens` to `max_token_bucketize` to bucketize samples based on total padded token length (https://github.com/pytorch/data/pull/789)
* Implemented thread based `PrefetcherIterDataPipe` (https://github.com/pytorch/data/pull/770, https://github.com/pytorch/data/pull/818, https://github.com/pytorch/data/pull/826, https://github.com/pytorch/data/pull/842)

DataLoader2

* Added `CacheTimeout` `Adapter` to redefine cache timeout of the `DataPipe` graph (https://github.com/pytorch/data/pull/571)
* Added `DistribtuedReadingService` to support uneven data sharding (https://github.com/pytorch/data/pull/727)
* Added `PrototypeMultiProcessingReadingService`
* Added prefetching (https://github.com/pytorch/data/pull/826)
* Fixed process termination (https://github.com/pytorch/data/pull/837)
* Enabled deterministic training in distributed/non-distributed environment (https://github.com/pytorch/data/pull/827)
* Handled empty queue exception properly (https://github.com/pytorch/data/pull/785)

Releng

* Provided pre-compiled torchdata binaries for arm64 Apple Silicon (https://github.com/pytorch/data/pull/692)

Improvements

DataPipe

* Fixed error message coming from singler iterator constraint (https://github.com/pytorch/pytorch/pull/79547)
* Enabled profiler record context in `__next__` for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/79757)
* Raised warning for unpickable local function (547) (https://github.com/pytorch/pytorch/pull/80232, https://github.com/pytorch/data/pull/547)
* Cleaned up opened streams on the best effort basis (https://github.com/pytorch/data/pull/560, https://github.com/pytorch/pytorch/pull/78952)
* Used streaming reading mode for unseekable streams in `TarArchiveLoader` (https://github.com/pytorch/data/pull/653)
Improved GDrive 'content-disposition' error message (https://github.com/pytorch/data/pull/654)
* Added `as_tuple` argument for CSVParserIterDataPipe` to convert output from list to tuple (https://github.com/pytorch/data/pull/646)
* Raised Error when `HTTPReader` get 404 Response (160) (https://github.com/pytorch/data/pull/569)
* Added default no-op behavior for `flatmap` (https://github.com/pytorch/data/pull/749)
* Added support to validate `input_col` with the provided map function for `DataPipe` (https://github.com/pytorch/pytorch/pull/80267, https://github.com/pytorch/data/pull/755, https://github.com/pytorch/pytorch/pull/84279)
* Made `ShufflerIterDataPipe` support snapshotting ([83535](https://github.com/pytorch/pytorch/pull/83535))
* Unified implementations between `in_batch_shuffle` with `shuffle` for `IterDataPipe` (https://github.com/pytorch/data/pull/745)
* Made `IterDataPipe.to_map_datapipe` loading data lazily (https://github.com/pytorch/data/pull/765)
* Added `kwargs` to open files for `FSSpecFileLister` and `FSSpecSaver` (https://github.com/pytorch/data/pull/804)
* Added missing functional name for `FileLister` ([86497](https://github.com/pytorch/pytorch/pull/86497))

DataLoader

* Controlled shuffle option to all `DataPipes` with `set_shuffle` API https://github.com/pytorch/pytorch/pull/83741)
* Made distributed process group lazily initialized & share seed via the process group (https://github.com/pytorch/pytorch/pull/85279)

DataLoader2

* Improved graph traverse function
* Added support for unhashable `DataPipe` (https://github.com/pytorch/pytorch/pull/80509, https://github.com/pytorch/data/pull/559)
* Added support for all python collection objects (https://github.com/pytorch/pytorch/pull/84079, https://github.com/pytorch/data/pull/773)
* Ensured `finalize` and `finalize_iteration` are called during shutdown or exception (https://github.com/pytorch/data/pull/846)

Releng

* Enabled conda release to support GLIBC_2.27 (https://github.com/pytorch/data/pull/859)

Bug Fixes

DataPipe

* Fixed error for static typing (https://github.com/pytorch/data/pull/572, https://github.com/pytorch/data/pull/645, https://github.com/pytorch/data/pull/651, https://github.com/pytorch/pytorch/pull/81275, https://github.com/pytorch/data/pull/758)
* Fixed `fork` and `unzip` operations for the case of a single child (https://github.com/pytorch/pytorch/pull/81502)
* Corrected the type of exception that is being raised by `ShufflerMapDataPipe` (https://github.com/pytorch/pytorch/pull/82666)
* Fixed buffer overflow for `unzip` when `columns_to_skip` is specified (https://github.com/pytorch/data/pull/658)
* Fixed `TarArchiveLoader` to skip `open` for opened TarFile stream (https://github.com/pytorch/data/pull/679)
* Fixed mishandling of exception message in `IterDataPipe` (https://github.com/pytorch/pytorch/pull/84676)
* Fixed interface generation in setup.py ([87081](https://github.com/pytorch/pytorch/pull/87081))

Performance

DataLoader2

* Added benchmarking for `DataLoader2`
* Added AWS cloud configurations (https://github.com/pytorch/data/pull/680)
* Added benchmark from torchvision training references (https://github.com/pytorch/data/pull/714)

Documentation

DataPipe

* Added examples for data loading with `DataPipe`
* Read Criteo TSV and Parquet files and apply TorchArrow operations (https://github.com/pytorch/data/pull/561)
* Read caltech256 and coco with `AIStoreDataPipe` (https://github.com/pytorch/data/pull/582)
* Read from tigergraph database (https://github.com/pytorch/data/pull/783)
* Improved docstring for `DataPipe`
* `DataPipe` converters (https://github.com/pytorch/data/pull/710)
* `S3` DataPipe (https://github.com/pytorch/data/pull/784)
* `FileOpenerIterDataPipe` (https://github.com/pytorch/pytorch/pull/81407)
* `buffer_size` for `MaxTokenBucketizer` (https://github.com/pytorch/data/pull/834)
* `Prefetcher` (https://github.com/pytorch/data/pull/835)
* Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (https://github.com/pytorch/data/pull/812, https://github.com/pytorch/data/pull/836)
* Improved tutorial
* Fixed tutorial for newline on Windows in `generate_csv` (https://github.com/pytorch/data/pull/675)
* Improved note on shuffling behavior (https://github.com/pytorch/data/pull/688)
* Fixed tutorial about shuffing before sharding (https://github.com/pytorch/data/pull/715)
* Added `random_split` example (https://github.com/pytorch/data/pull/843)
* Simplified long type names for online doc (https://github.com/pytorch/data/pull/838)

DataLoader2

* Improved docstring for `DataLoader2` (https://github.com/pytorch/data/pull/581, https://github.com/pytorch/data/pull/817)
* Added training examples using `DataLoader2`, `ReadingService` and `DataPipe` (https://github.com/pytorch/data/pull/563, https://github.com/pytorch/data/pull/664, https://github.com/pytorch/data/pull/670, https://github.com/pytorch/data/pull/787)

Releng

* Added contribution guide for third-party library (https://github.com/pytorch/data/pull/663)

Future Plans

We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making `DataLoader2` and related `ReadingService` more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

0.4.1

Bug fixes
- Fixed `DataPipe` working with `DataLoader` in the distributed environment (https://github.com/pytorch/pytorch/pull/80348, https://github.com/pytorch/pytorch/pull/81071, https://github.com/pytorch/pytorch/pull/81071)

Documentation
- Updated TorchData tutorial (675, 688, 715)

Releng
- Provided pre-compiled `torchdata` binaries for arm64 Apple Silicon (692)
- Python [3.8~3.10]

Page 1 of 2

Releases

Has known vulnerabilities

Torchdata

Page 1 of 2

0.7.0

0.6.1

0.6.0

0.5.1

0.5.0

0.4.1

Page 1 of 2

Links

Releases