Torchdata

Latest version: v0.10.0

Safety actively analyzes 688293 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 2

0.5.1

This is a minor release to update PyTorch dependency from `1.13.0` to `1.13.1`. Please check the [release note](https://github.com/pytorch/data/releases/tag/v0.5.0) of TorchData `0.5.0` major release for more detail.

0.5.0

* Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found [here](https://pytorch.org/data/0.5/tutorial.html#working-with-cloud-storage-providers)
* AWS S3 Benchmarking [result](https://github.com/pytorch/data/blob/main/benchmarks/cloud/aws_s3_results.md)
* Consolidated API for `DataLoader2` and provided a few `ReadingServices`, with detailed documentation now [available here](https://pytorch.org/data/0.5/dataloader2.html)
* Provided more comprehensive `DataPipe` operations, e.g., `random_split`, `repeat`, `set_length`, and `prefetch`.
* Provided pre-compiled torchdata binaries for arm64 Apple Silicon

Backwards Incompatible Change

DataPipe

Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (https://github.com/pytorch/pytorch/pull/83202)
`IterDataPipe` is used to to preserve data order

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">MapDataPipe.shuffle</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
True
>>> isinstance(dp, IterDataPipe)
False
</pre></sub></td>
<td><sub><pre lang="python">
>>> from torch.utils.data import IterDataPipe, MapDataPipe
>>> from torch.utils.data.datapipes.map import SequenceWrapper
>>> dp = SequenceWrapper(list(range(10))).shuffle()
>>> isinstance(dp, MapDataPipe)
False
>>> isinstance(dp, IterDataPipe)
True
</pre></sub></td>
</tr>
</table>
</p>

`on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (https://github.com/pytorch/data/pull/810)

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">on_disk_cache</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
</pre></sub></td>
<td><sub><pre lang="python">
>>> url_dp = IterableWrapper(["https://path/to/filename", ])
>>> def filepath_gen_fn(url):
… yield from [url + f”/{i}” for i in range(3)]
>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)
AssertionError
</pre></sub></td>
</tr>
</table>

DataLoader2

Imposed single iterator constraint on `DataLoader2` (https://github.com/pytorch/data/pull/700)

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">DataLoader2 with a single iterator</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) No reset here
>>> print(next(it2))
1
>>> print(next(it1))
2
</pre></sub></td>
<td><sub><pre lang="python">
>>> dl = DataLoader2(IterableWrapper(range(10)))
>>> it1 = iter(dl)
>>> print(next(it1))
0
>>> it2 = iter(dl) DataLoader2 resets with the creation of a new iterator
>>> print(next(it2))
0
>>> print(next(it1))
Raises exception, since it1 is no longer valid
</pre></sub></td>
</tr>
</table>
</p>

Deep copy `DataPipe` during `DataLoader2` initialization or restoration (https://github.com/pytorch/data/pull/786, https://github.com/pytorch/data/pull/833)
Previously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">Deep copy DataPipe during DataLoader2 constructor</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe...
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp = IterableWrapper([0, 1, 2, 3, 4])
>>> dl1 = DataLoader2(dp)
>>> dl2 = DataLoader2(dp)
>>> for x, y in zip(dl1, dl2):
… print(x, y)
0 0
1 1
2 2
3 3
4 4
</pre></sub></td>
</tr>
</table>
</p>

Deprecations

DataLoader2

Deprecated `traverse` function and `only_datapipe` argument (https://github.com/pytorch/pytorch/pull/85667)
Please use `traverse_dps` with the behavior the same as `only_datapipe=True`. (https://github.com/pytorch/data/pull/793)

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">DataPipe traverse function</th>
</tr>
</thead>
<tr><th>0.4.1</th><th>0.5.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp_graph = torch.utils.data.graph.traverse(datapipe, only_datapipe=False)
FutureWarning: `traverse` function and only_datapipe argument will be removed after 1.13.
</pre></sub></td>
</tr>
</table>
</p>

New Features

DataPipe

* Added AIStore DataPipe (https://github.com/pytorch/data/pull/545, https://github.com/pytorch/data/pull/667)
* Added support for `IterDataPipe` to trace DataFrames operations (https://github.com/pytorch/pytorch/pull/71931,
* Added support for `DataFrameMakerIterDataPipe` to accept `dtype_generator` to solve unserializable `dtype` (https://github.com/pytorch/data/pull/537)
* Added graph snapshotting by counting number of successful yields for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/79479, https://github.com/pytorch/pytorch/pull/79657)
* Implemented `drop` operation for `IterDataPipe` to drop column(s) (https://github.com/pytorch/data/pull/725)
* Implemented `FullSyncIterDataPipe` to synchronize distributed shards (https://github.com/pytorch/data/pull/713)
* Implemented `slice` and `flatten` operations for `IterDataPipe` (https://github.com/pytorch/data/pull/730)
* Implemented `repeat` operation for `IterDataPipe` (https://github.com/pytorch/data/pull/748)
* Added `LengthSetterIterDataPipe` (https://github.com/pytorch/data/pull/747)
* Added `RandomSplitter` (without buffer) (https://github.com/pytorch/data/pull/724)
* Added `padden_tokens` to `max_token_bucketize` to bucketize samples based on total padded token length (https://github.com/pytorch/data/pull/789)
* Implemented thread based `PrefetcherIterDataPipe` (https://github.com/pytorch/data/pull/770, https://github.com/pytorch/data/pull/818, https://github.com/pytorch/data/pull/826, https://github.com/pytorch/data/pull/842)

DataLoader2

* Added `CacheTimeout` `Adapter` to redefine cache timeout of the `DataPipe` graph (https://github.com/pytorch/data/pull/571)
* Added `DistribtuedReadingService` to support uneven data sharding (https://github.com/pytorch/data/pull/727)
* Added `PrototypeMultiProcessingReadingService`
* Added prefetching (https://github.com/pytorch/data/pull/826)
* Fixed process termination (https://github.com/pytorch/data/pull/837)
* Enabled deterministic training in distributed/non-distributed environment (https://github.com/pytorch/data/pull/827)
* Handled empty queue exception properly (https://github.com/pytorch/data/pull/785)

Releng

* Provided pre-compiled torchdata binaries for arm64 Apple Silicon (https://github.com/pytorch/data/pull/692)

Improvements

DataPipe

* Fixed error message coming from singler iterator constraint (https://github.com/pytorch/pytorch/pull/79547)
* Enabled profiler record context in `__next__` for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/79757)
* Raised warning for unpickable local function (547) (https://github.com/pytorch/pytorch/pull/80232, https://github.com/pytorch/data/pull/547)
* Cleaned up opened streams on the best effort basis (https://github.com/pytorch/data/pull/560, https://github.com/pytorch/pytorch/pull/78952)
* Used streaming reading mode for unseekable streams in `TarArchiveLoader` (https://github.com/pytorch/data/pull/653)
Improved GDrive 'content-disposition' error message (https://github.com/pytorch/data/pull/654)
* Added `as_tuple` argument for CSVParserIterDataPipe` to convert output from list to tuple (https://github.com/pytorch/data/pull/646)
* Raised Error when `HTTPReader` get 404 Response (160) (https://github.com/pytorch/data/pull/569)
* Added default no-op behavior for `flatmap` (https://github.com/pytorch/data/pull/749)
* Added support to validate `input_col` with the provided map function for `DataPipe` (https://github.com/pytorch/pytorch/pull/80267, https://github.com/pytorch/data/pull/755, https://github.com/pytorch/pytorch/pull/84279)
* Made `ShufflerIterDataPipe` support snapshotting ([83535](https://github.com/pytorch/pytorch/pull/83535))
* Unified implementations between `in_batch_shuffle` with `shuffle` for `IterDataPipe` (https://github.com/pytorch/data/pull/745)
* Made `IterDataPipe.to_map_datapipe` loading data lazily (https://github.com/pytorch/data/pull/765)
* Added `kwargs` to open files for `FSSpecFileLister` and `FSSpecSaver` (https://github.com/pytorch/data/pull/804)
* Added missing functional name for `FileLister` ([86497](https://github.com/pytorch/pytorch/pull/86497))

DataLoader

* Controlled shuffle option to all `DataPipes` with `set_shuffle` API https://github.com/pytorch/pytorch/pull/83741)
* Made distributed process group lazily initialized & share seed via the process group (https://github.com/pytorch/pytorch/pull/85279)

DataLoader2

* Improved graph traverse function
* Added support for unhashable `DataPipe` (https://github.com/pytorch/pytorch/pull/80509, https://github.com/pytorch/data/pull/559)
* Added support for all python collection objects (https://github.com/pytorch/pytorch/pull/84079, https://github.com/pytorch/data/pull/773)
* Ensured `finalize` and `finalize_iteration` are called during shutdown or exception (https://github.com/pytorch/data/pull/846)

Releng

* Enabled conda release to support GLIBC_2.27 (https://github.com/pytorch/data/pull/859)

Bug Fixes

DataPipe

* Fixed error for static typing (https://github.com/pytorch/data/pull/572, https://github.com/pytorch/data/pull/645, https://github.com/pytorch/data/pull/651, https://github.com/pytorch/pytorch/pull/81275, https://github.com/pytorch/data/pull/758)
* Fixed `fork` and `unzip` operations for the case of a single child (https://github.com/pytorch/pytorch/pull/81502)
* Corrected the type of exception that is being raised by `ShufflerMapDataPipe` (https://github.com/pytorch/pytorch/pull/82666)
* Fixed buffer overflow for `unzip` when `columns_to_skip` is specified (https://github.com/pytorch/data/pull/658)
* Fixed `TarArchiveLoader` to skip `open` for opened TarFile stream (https://github.com/pytorch/data/pull/679)
* Fixed mishandling of exception message in `IterDataPipe` (https://github.com/pytorch/pytorch/pull/84676)
* Fixed interface generation in setup.py ([87081](https://github.com/pytorch/pytorch/pull/87081))

Performance

DataLoader2

* Added benchmarking for `DataLoader2`
* Added AWS cloud configurations (https://github.com/pytorch/data/pull/680)
* Added benchmark from torchvision training references (https://github.com/pytorch/data/pull/714)

Documentation

DataPipe

* Added examples for data loading with `DataPipe`
* Read Criteo TSV and Parquet files and apply TorchArrow operations (https://github.com/pytorch/data/pull/561)
* Read caltech256 and coco with `AIStoreDataPipe` (https://github.com/pytorch/data/pull/582)
* Read from tigergraph database (https://github.com/pytorch/data/pull/783)
* Improved docstring for `DataPipe`
* `DataPipe` converters (https://github.com/pytorch/data/pull/710)
* `S3` DataPipe (https://github.com/pytorch/data/pull/784)
* `FileOpenerIterDataPipe` (https://github.com/pytorch/pytorch/pull/81407)
* `buffer_size` for `MaxTokenBucketizer` (https://github.com/pytorch/data/pull/834)
* `Prefetcher` (https://github.com/pytorch/data/pull/835)
* Added tutorial to load from Cloud Storage Provider including AWS S3, Google Cloud Platform and Azure Blob Storage (https://github.com/pytorch/data/pull/812, https://github.com/pytorch/data/pull/836)
* Improved tutorial
* Fixed tutorial for newline on Windows in `generate_csv` (https://github.com/pytorch/data/pull/675)
* Improved note on shuffling behavior (https://github.com/pytorch/data/pull/688)
* Fixed tutorial about shuffing before sharding (https://github.com/pytorch/data/pull/715)
* Added `random_split` example (https://github.com/pytorch/data/pull/843)
* Simplified long type names for online doc (https://github.com/pytorch/data/pull/838)

DataLoader2

* Improved docstring for `DataLoader2` (https://github.com/pytorch/data/pull/581, https://github.com/pytorch/data/pull/817)
* Added training examples using `DataLoader2`, `ReadingService` and `DataPipe` (https://github.com/pytorch/data/pull/563, https://github.com/pytorch/data/pull/664, https://github.com/pytorch/data/pull/670, https://github.com/pytorch/data/pull/787)

Releng

* Added contribution guide for third-party library (https://github.com/pytorch/data/pull/663)

Future Plans

We will continue benchmarking over datasets on local disk and cloud storage using TorchData. And, we will continue making `DataLoader2` and related `ReadingService` more stable and provide more features like snapshotting the data pipeline and restoring it from the serialized state. Stay tuned and welcome any feedback.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

0.4.1

Bug fixes
- Fixed `DataPipe` working with `DataLoader` in the distributed environment (https://github.com/pytorch/pytorch/pull/80348, https://github.com/pytorch/pytorch/pull/81071, https://github.com/pytorch/pytorch/pull/81071)

Documentation
- Updated TorchData tutorial (675, 688, 715)

Releng
- Provided pre-compiled `torchdata` binaries for arm64 Apple Silicon (692)
- Python [3.8~3.10]

0.4.0

* DataPipe graph is now backward compatible with `DataLoader` regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial [here](https://pytorch.org/data/0.4.0/tutorial.html#working-with-dataloader).
* [`AWSSDK`](https://github.com/aws/aws-sdk-cpp) is integrated to support listing/loading files from AWS S3.
* Adding support to read from `TFRecord` and Hugging Face Hub.
* `DataLoader2` became available in prototype mode. For more details, please check our [future plans](Future-Plans).

Backwards Incompatible Change

DataPipe

Updated `Multiplexer` (functional API `mux`) to stop merging multiple `DataPipes` whenever the shortest one is exhausted (https://github.com/pytorch/pytorch/pull/77145)
Please use `MultiplexerLongest` (functional API `mux_longgest`) to achieve the previous functionality.

<p align="center">
<table align="center">
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]
>>> len(output_dp)
13
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp1 = IterableWrapper(range(3))
>>> dp2 = IterableWrapper(range(10, 15))
>>> dp3 = IterableWrapper(range(20, 25))
>>> output_dp = dp1.mux(dp2, dp3)
>>> list(output_dp)
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> len(output_dp)
9
</pre></sub></td>
</tr>
</table>
</p>

Enforcing single valid iterator for `IterDataPipes` w/wo multiple outputs https://github.com/pytorch/pytorch/pull/70479, (https://github.com/pytorch/pytorch/pull/75995)
If you need to reference the same `IterDataPipe` multiple times, please apply `.fork()` on the `IterDataPipe` instance.

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">IterDataPipe with a single output</th>
</tr>
</thead>
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp)
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2)
0
>>> next(it1)
1
Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
[(0, 0), ..., (9, 9)]
</pre></sub></td>
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> it1 = iter(source_dp)
>>> list(it1)
[0, 1, ..., 9]
>>> it1 = iter(source_dp) This doesn't raise any warning or error
>>> next(it1)
0
>>> it2 = iter(source_dp)
>>> next(it2) Invalidates `it1`
0
>>> next(it1)
RuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
Multiple references of DataPipe
>>> source_dp = IterableWrapper(range(10))
>>> zip_dp = source_dp.zip(source_dp)
>>> list(zip_dp)
RuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))
This may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
</pre></sub></td>
</tr>
</table>
</p>

<p align="center">
<table align="center">
<thead>
<tr>
<th colspan="2">IterDataPipe with multiple outputs</th>
</tr>
</thead>
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1)
Basically share the same reference as `it1`
doesn't reset because `cdp1` hasn't been read since reset
>>> next(it1)
0
>>> next(it2)
0
>>> next(it3)
1
The next line resets all ChildDataPipe
because `cdp2` has started reading
>>> it4 = iter(cdp2)
>>> next(it3)
0
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
</pre></sub></td>
<td><sub><pre lang="python">
>>> source_dp = IterableWrapper(range(10))
>>> cdp1, cdp2 = source_dp.fork(num_instances=2)
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> list(it1)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(it2)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> it1, it2 = iter(cdp1), iter(cdp2)
>>> it3 = iter(cdp1) This invalidates `it1` and `it2`
>>> next(it1)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it2)
RuntimeError: This iterator has been invalidated, because a new iterator has been created from one of the ChildDataPipes of _ForkerIterDataPipe(buffer_size=1000, num_instances=2).
For feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https://github.com/pytorch/data/issues/45.
>>> next(it3)
0
The next line should not invalidate anything, as there was no new iterator created
for `cdp2` after `it2` was invalidated
>>> it4 = iter(cdp2)
>>> next(it3)
1
>>> list(it4)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
</pre></sub></td>
</tr>
</table>
</p>

Deprecations

DataPipe

Deprecated functional APIs of `open_file_by_fsspec` and `open_file_by_iopath` for `IterDataPipe` (https://github.com/pytorch/pytorch/pull/78970, https://github.com/pytorch/pytorch/pull/79302)
Please use `open_files_by_fsspec` and `open_files_by_iopath`

<p align="center">
<table align="center">
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec() No Warning
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath() No Warning
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_fsspec()
FutureWarning: `FSSpecFileOpener()`'s functional API `.open_file_by_fsspec()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_fsspec()` instead.
>>> dp = IterableWrapper([file_path, ])
>>> dp = dp.open_file_by_iopath()
FutureWarning: `IoPathFileOpener()`'s functional API `.open_file_by_iopath()` is deprecated since 0.4.0 and will be removed in 0.6.0.
See https://github.com/pytorch/data/issues/163 for details.
Please use `.open_files_by_iopath()` instead.
</pre></sub></td>
</tr>
</table>
</p>

Argument `drop_empty_batches` of `Filter` (functional API `filter`) is deprecated and going to be removed in the future release (https://github.com/pytorch/pytorch/pull/76060)

<p align="center">
<table align="center">
<tr><th>0.3.0</th><th>0.4.0</th></tr>
<tr valign="top">
<td><sub><pre lang="python">
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
</pre></sub></td>
<td><sub><pre lang="python">
>>> dp = IterableWrapper([(1, 1), (2, 2), (3, 3)])
>>> dp = dp.filter(lambda x: x[0] > 1, drop_empty_batches=True)
FutureWarning: The argument `drop_empty_batches` of `FilterIterDataPipe()` is deprecated since 1.12 and will be removed in 1.14.
See https://github.com/pytorch/data/issues/163 for details.
</pre></sub></td>
</tr>
</table>
</p>

New Features

DataPipe

* Added utility to visualize `DataPipe` graphs (https://github.com/pytorch/data/pull/330)

IterDataPipe

* Added `Bz2FileLoader` with functional API of `load_from_bz2` (https://github.com/pytorch/data/pull/312)
* Added `BatchMapper` (functional API: `map_batches`) and `FlatMapper` (functional API: `flat_map`) (https://github.com/pytorch/data/pull/359)
* Added support for WebDataset-style archives (https://github.com/pytorch/data/pull/367)
* Added `MultiplexerLongest` with functional API of `mux_longest` (https://github.com/pytorch/data/pull/372)
* Add `ZipperLongest` with functional API of `zip_longest` (https://github.com/pytorch/data/pull/373)
* Added `MaxTokenBucketizer` with functional API of `max_token_bucketize` (https://github.com/pytorch/data/pull/283)
* Added `S3FileLister` (functional API: `list_files_by_s3`) and `S3FileLoader` (functional API: `load_files_by_s3`) integrated with the native AWSSDK (https://github.com/pytorch/data/pull/165)
* Added `HuggingFaceHubReader` (https://github.com/pytorch/data/pull/490)
* Added `TFRecordLoader` with functional API of `load_from_tfrecord` (https://github.com/pytorch/data/pull/308)

MapDataPipe

* Added `UnZipper` with functional API of `unzip` (https://github.com/pytorch/data/pull/325)
* Added `MapToIterConverter` with functional API of `to_iter_datapipe` (https://github.com/pytorch/data/pull/327)
* Added `InMemoryCacheHolder` with functional API of `in_memory_cache` (https://github.com/pytorch/data/pull/328)

Releng

* Added nightly releases for TorchData. Users should be able to install nightly TorchData via
* `pip install –pre torchdata -f https://download.pytorch.org/whl/nightly/cpu`
* `conda install -c pytorch-nightly torchdata`
* Added support of AWSSDK enabled `DataPipes`. See: [README](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/README.md)
* AWSSDK was pre-compiled and assembled in TorchData for both nightly and 0.4.0 releases

Improvements

DataPipe

* Added optional `encoding` argument to `FileOpener` (https://github.com/pytorch/pytorch/pull/72715)
* Renamed `BucketBatcher` argument to avoid name collision (https://github.com/pytorch/data/pull/304)
* Removed default parameter of `ShufflerIterDataPipe` (https://github.com/pytorch/pytorch/pull/74370)
* Made profiler wrapper can delegating function calls to `DataPipe` iterator (https://github.com/pytorch/pytorch/pull/75275)
* Added `input_col` argument to `flatmap` for applying `fn` to the specific column(s) (https://github.com/pytorch/data/pull/363)
* Improved debug message when exceptions are raised within `IterDataPipe` (https://github.com/pytorch/pytorch/pull/75618)
* Improved debug message when argument is a tuple/list of `DataPipes` (https://github.com/pytorch/pytorch/pull/76134)
* Add functional API to `StreamReader` (functional API: `open_files`) and `FileOpener` (functional API: `read_from_stream`) (https://github.com/pytorch/pytorch/pull/76233)
* Enabled graph traversal for `MapDataPipe` (https://github.com/pytorch/pytorch/pull/74851)
* Added `input_col` argument to `filter` for applying `filter_fn` to the specific column(s) (https://github.com/pytorch/pytorch/pull/76060)
* Added functional APIs for `OnlineReaders` (https://github.com/pytorch/data/pull/369)
* `HTTPReaderIterDataPipe`: `read_from_http`
* `GDriveReaderDataPipe`: `read_from_gdrive`
* `OnlineReaderIterDataPipe`: `read_from_remote`
* Cleared buffer for `DataPipe` during `__del__` (https://github.com/pytorch/pytorch/pull/76345)
* Overrode wrong python https proxy on Windows (https://github.com/pytorch/data/pull/371)
* Exposed functional API of 'to_map_datapipe' from `IterDataPipe`'s pyi interface (https://github.com/pytorch/data/pull/326)
* Moved buffer for `IterDataPipe` from iterator to instance (self) (https://github.com/pytorch/data/pull/388)
* Improved `DataPipe` serialization:
* Enabled serialization of `ForkerIterDataPipe` (https://github.com/pytorch/pytorch/pull/73118)
* Fixed issue with `DataPipe` serialization with dill (https://github.com/pytorch/pytorch/pull/72896)
* Applied special serialization when dill is installed (https://github.com/pytorch/pytorch/pull/74958)
* Applied dill serialization for `demux` and added cache to graph traverse (https://github.com/pytorch/pytorch/pull/75034)
* Revamp serialization logic of `DataPipes` (https://github.com/pytorch/pytorch/pull/74984)
* Prevented automatic reset after state is restored (https://github.com/pytorch/pytorch/pull/77774)
* Moved `IterDataPipe` buffers from __iter__ to instance (self) ([76999](https://github.com/pytorch/pytorch/pull/76999))
* Refactored buffer of `Multiplexer` from `__iter__` to instance (self) (https://github.com/pytorch/pytorch/pull/77775)
* Made `GDriveReader` handling Virus Scan Warning (https://github.com/pytorch/data/pull/442)
* Added `**kwargs` arguments to `HttpReader` to specify extra parameters for HTTP requests (https://github.com/pytorch/data/pull/392)
* Updated `FSSpecFileLister` and `IoPathFileLister` to support multiple root paths and updated `FSSpecFileLister` to support S3 urls (https://github.com/pytorch/data/pull/383)
* Fixed racing condition issue with writing files in multiprocessing
* Added `filelock` to `IoPathSaver` to prevent racing condition (https://github.com/pytorch/data/pull/413)
* Added lock mechanism to prevent `on_disk_cache` downloading twice https://github.com/pytorch/data/pull/409)
* Add instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
* Added a 's' to the functional names of open/list `DataPipes` (https://github.com/pytorch/data/pull/479)
* Added `list_file` functional API to `FSSpecFileLister` and `IoPathFileLister` (https://github.com/pytorch/data/pull/463)
* Added `list_files` functional API to `FileLister` (https://github.com/pytorch/pytorch/pull/78419)
* Improved FSSpec `DataPipes` to accept extra keyword arguments (https://github.com/pytorch/data/pull/495)
* Pass through `kwargs` to `json.loads` call in JsonParse (https://github.com/pytorch/data/pull/518)

DataLoader

* Added ability to use `dill` to pass `DataPipes` in multiprocessing (https://github.com/pytorch/pytorch/pull/77288))
* `DataLoader` automatically apply sharding to `DataPipe` graph in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/78762, https://github.com/pytorch/pytorch/pull/78950, https://github.com/pytorch/pytorch/pull/79041, https://github.com/pytorch/pytorch/pull/79124, https://github.com/pytorch/pytorch/pull/79524)
* Made `ShufflerDataPipe` deterministic with `DataLoader` in single-process, multi-process and distributed environments (https://github.com/pytorch/pytorch/pull/77741, https://github.com/pytorch/pytorch/pull/77855, https://github.com/pytorch/pytorch/pull/78765, https://github.com/pytorch/pytorch/pull/79829)
* Prevented overriding shuffle settings in `DataLoader` for `DataPipe` (https://github.com/pytorch/pytorch/pull/75505)

Releng

* Made `requirements.txt` as the single source of truth for TorchData version (https://github.com/pytorch/data/pull/414)
* Prohibited Release GHA workflows running on forked branches. (https://github.com/pytorch/data/pull/361)

Performance

DataPipe

* Lazily generated exception message for performance (https://github.com/pytorch/pytorch/pull/78673)
* Fixes regression introduced from single iterator constraint related PRs.
* Disabled profiler for `IterDataPipe` by default (https://github.com/pytorch/pytorch/pull/78674)
* By skipping over the record function when the profiler is not enabled, the speedup is up to [5-6x](https://github.com/pytorch/pytorch/pull/78674#issuecomment-1146233729) for `DataPipes` when their internal operations are very simple (e.g. `IterableWrapper`)

Documentation

DataPipe

* Fixed typo in TorchVision example (https://github.com/pytorch/data/pull/311)
* Updated `DataPipe` naming guidelines (https://github.com/pytorch/data/pull/428)
* Updated documents from `DataSet` to PyTorch `Dataset` (https://github.com/pytorch/data/pull/292)
* Added examples for graphs, meshes and point clouds using `DataPipe` (https://github.com/pytorch/data/pull/337)
* Added examples for semantic segmentation and time series using `DataPipe` (https://github.com/pytorch/data/pull/340)
* Expanded the contribution guide, especially including instructions to add a new `DataPipe` (https://github.com/pytorch/data/pull/354)
* Updated tutorial about placing `sharding_filter` (https://github.com/pytorch/data/pull/487)
* Improved graph visualization documentation (https://github.com/pytorch/data/pull/504)
* Added instructions about ImportError for portalocker (https://github.com/pytorch/data/pull/506)
* Updated examples to avoid lambdas (https://github.com/pytorch/data/pull/524)
* Updated documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)
* Updated links for tutorial (https://github.com/pytorch/data/pull/543)

IterDataPipe

* Fixed documentation for `IterToMapConverter`, `S3FileLister` and `S3FileLoader` (https://github.com/pytorch/data/pull/381)
* Update documentation for S3 DataPipes (https://github.com/pytorch/data/pull/534)

MapDataPipe

* Updated contributing guide and added guidance for `MapDataPipe` (https://github.com/pytorch/data/pull/379)
* Rather than re-implementing the same functionalities twice for both `IterDataPipe` and `MapDataPipe`, we encourage users to use the built-in functionalities of `IterDataPipe` and use the converter to `MapDataPipe` as needed.

DataLoader/DataLoader2

* Fixed tutorial about `DataPipe` working with `DataLoader` (https://github.com/pytorch/data/pull/458)
* Updated examples and tutorial after automatic sharding has landed (https://github.com/pytorch/data/pull/505)
* Add README for DataLoader2 (https://github.com/pytorch/data/pull/526, https://github.com/pytorch/data/pull/541)

Releng

* Added nightly documentation for TorchData in https://pytorch.org/data/main/
* Fixed instruction to install TorchData (https://github.com/pytorch/data/pull/455)

Future Plans

For `DataLoader2`, we are introducing new ways to interact between `DataPipes`, DataLoading API, and backends (aka `ReadingServices`). Feature is stable in terms of API, but functionally not complete yet. We welcome early adopters and feedback, as well as potential contributors.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

0.3.0

We are delighted to present the Beta release of [TorchData](https://github.com/pytorch/data). This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called [“DataPipes”](https://github.com/pytorch/data#what-are-datapipes) that work well out of the box with the PyTorch’s [`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader).

* Highlights
* What are DataPipes?
* Usage Example
* New Features
* Documentation
* Usage in Domain Libraries
* Future Plans
* Beta Usage Note

Highlights

We are releasing DataPipes - there are Iterable-style DataPipe ([`IterDataPipe`](https://pytorch.org/data/0.3.0/torchdata.datapipes.iter.html)) and Map-style DataPipe ([`MapDataPipe`](https://pytorch.org/data/0.3.0/torchdata.datapipes.map.html)).

What are DataPipes?

Early on, we observed widespread confusion between the PyTorch `DataSets` which represented reusable loading tooling (e.g. [TorchVision's `ImageFolder`](https://github.com/pytorch/vision/blob/main/torchvision/datasets/folder.py#L272)), and those that represented pre-built iterators/accessors over actual data corpora (e.g. [TorchVision's `ImageNet`](https://github.com/pytorch/vision/blob/main/torchvision/datasets/imagenet.py#L21)). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.

`DataPipe` is simply a renaming and repurposing of the PyTorch `DataSet` for composed usage. A `DataPipe` takes in some access function over Python data structures, `__iter__` for `IterDataPipes` and `__getitem__` for `MapDataPipes` , and returns a new access function with a slight transformation applied. For example, take a look at this `JsonParser`, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:

py
import json

class JsonParserIterDataPipe(IterDataPipe):
def __init__(self, source_datapipe, **kwargs) -> None:
self.source_datapipe = source_datapipe
self.kwargs = kwargs

def __iter__(self):
for file_name, stream in self.source_datapipe:
data = stream.read()
yield file_name, json.loads(data)

def __len__(self):
return len(self.source_datapipe)


You can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.

Under this naming convention, `DataSet` simply refers to a graph of `DataPipes`, and a dataset module like `ImageNet` can be rebuilt as a factory function returning the requisite composed `DataPipes`.

Usage Example

In this example, we have a compressed TAR archive file stored in Google Drive and accessible via an URL. We demonstrate how you can use DataPipes to download the archive, cache the result, decompress the archive, filter for specific files, parse and return the CSV content. The full example with detailed explanation is [included in the example folder](https://github.com/pytorch/data/blob/release/0.3.0/examples/text/amazonreviewpolarity.py).

py
url_dp = IterableWrapper([URL])
cache_compressed_dp = GDriveReader(cache_compressed_dp)
cache_decompressed_dp = ... See source file for full code example
Opens and loads the content of the TAR archive file.
cache_decompressed_dp = FileOpener(cache_decompressed_dp, mode="b").load_from_tar()
Filters for specific files based on the file name.
cache_decompressed_dp = cache_decompressed_dp.filter(
lambda fname_and_stream: _EXTRACTED_FILES[split] in fname_and_stream[0]
)
Saves the decompressed file onto disk.
cache_decompressed_dp = cache_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
data_dp = FileOpener(cache_decompressed_dp, mode="b")
Parses content of the decompressed CSV file and returns the result line by line. return
return data_dp.parse_csv().map(fn=lambda t: (int(t[0]), " ".join(t[1:])))


New Features

[Beta] [IterDataPipe](https://pytorch.org/data/0.3.0/torchdata.datapipes.iter.html)

We have implemented over 50 Iterable-style DataPipes across 10 different categories. They cover different functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the [fsspec and iopath DataPipes](https://pytorch.org/data/0.3.0/torchdata.datapipes.iter.html#io-datapipes) will allow you to do so. The documentation provides detailed explanations and usage examples of each `IterDataPipe`.

[Beta] [MapDataPipe](https://pytorch.org/data/0.3.0/torchdata.datapipes.map.html)

Similar to `IterDataPipe`, we have various, but a more limited number of `MapDataPipe` available for different functionalities. More `MapDataPipes` support will come later. If the existing ones do not meet your needs, you can write a custom DataPipe.

Documentation

The [documentation for TorchData](https://pytorch.org/data) is now live. It contains a tutorial that covers [how to use DataPipes](https://pytorch.org/data/0.3.0/tutorial.html#using-datapipes), [use them with DataLoader](https://pytorch.org/data/0.3.0/tutorial.html#working-with-dataloader), and [implement custom ones](https://pytorch.org/data/0.3.0/tutorial.html#implementing-a-custom-datapipe).

Usage in Domain Libraries

In this release, some of the PyTorch domain libraries have migrated their datasets to use DataPipes. In TorchText, the [popular datasets provided by the library](https://github.com/pytorch/text/tree/release/0.12/torchtext/datasets) are implemented using DataPipes and a [section of its SST-2 binary text classification tutorial](https://pytorch.org/text/0.12.0/tutorials/sst2_classification_non_distributed.html#dataset) demonstrates how you can use DataPipes to preprocess data for your model. There also are other prototype implementations of datasets with DataPipes in [TorchVision (available in nightly releases](https://github.com/pytorch/vision/tree/main/torchvision/prototype/datasets/_builtin)) and in [TorchRec](https://pytorch.org/torchrec/torchrec.datasets.html). You can find more [specific examples here](https://pytorch.org/data/0.3.0/examples.html).

Future Plans

There will be a new version of DataLoader in the next release. At the high level, the plan is that DataLoader V2 will only be responsible for multiprocessing, distributed, and similar functionalities, not data processing logic. All data processing features, such as the shuffling and batching, will be moved out of DataLoader to DataPipe. At the same time, the current/old version of DataLoader should still be available and you can use DataPipes with that as well.

Beta Usage Note

This library is currently in the Beta stage and currently does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.

Page 2 of 2

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.