Torchtext

Latest version: v0.18.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 6

0.16.1

This is a patch release, which is compatible with [PyTorch 2.1.1](https://github.com/pytorch/pytorch/releases/tag/v2.1.1). There are no new features added.

0.16.0

Current status

As of September 2023 we have paused active development of TorchText because our focus has shifted away from building out this library offering. We will continue to release new versions but do not anticipate any new feature development as we figure out future investments in this space.

Bug Fixes

- Update links to multi30k dataset since original servers are down (2194)
- Use filelock to block on concurrent model downloads (2166)

New Features

- Add support for `__contains__` for Vectors class (2144)
- Add generation utility support to T5Bundle (2146)
- Add option to ignore UTF-8 decoding error to scripted tokenizer (2134)
- Add shift-right method to T5 model (2131)
- Add XLMR and RoBERTa transforms as factory functions (2102)
- Make sure to include padding mask in generation (2096)
- (Prototype) Add top-p and top-k sampling (2137)

0.15.2

This is a minor release, which is compatible with [PyTorch 2.0.1](https://github.com/pytorch/pytorch/releases/tag/v2.0.1). There are no new features added.

0.15.1

Highlights

In this release, we add a new model architecture along with pre-trained weights, increase flexibility in our tokenizers, and improve the overall stability of the library.

* Added T5 & Flan-T5 model architecture with pre-trained weights
* Added DistilRoBERTa
* Added tutorial showing T5 in action
* Added prototype `GenerationUtils`

Models
Torchtext expanded its models to include both [T5](https://arxiv.org/abs/1910.10683), [Flan-T5](https://arxiv.org/abs/2210.11416) and [DistilRoBERTa](https://huggingface.co/roberta-base) along with the corresponding pre-trained model weights. These additions represent both the smallest and largest models available in Torchtext to date as well as the first encoder/decoder model with T5. As usual, all models are Torchscriptable.

Utils
Since TorchText now has encoder/decoder models available, we prototyped a `GenerationUtils` for generic decoding capabilities for encoder/decoder or decoder only models.

Improvements

Features
* Add DistilRoBERTa to OSS (1998)
* Beginning of GenerationUtils (2011)
* Add Flan-T5 architecture (2027)
* Optimize T5 for sequence generation (2054)
* Add bundles for FLAN-T5 (2061)
* Promote T5 and variants (2064)
* Fixup generation utils for prototype release (2065)

CI (Migrate from CircleCI to Github Actions)
* Remove CUDA binary builds (1994)
* Remove Linux and MacOS unit tests from CircleCI (1993)
* Validate binaries for nightly/release testing (2010)
* Rename variable to avoid conflict with PIP system variable PIP_PREFIX (2015, 2016)
* Refactor validation using MATRIX vars (2021)
* Migrate validation workflows to test-infra (2022)
* 3.11 Windows Wheels Support in CircleCI (2053)
* Adding RC triggers for all build jobs (2057)
* Add windows 3.11 conda (2063)
* Channel=test for build matrix generation (2066)
* Turn off CirclCI 3.11 unit tests (2078)
* Fix validation workflow for test channel (2071)
* Modify integration test workflow to use PyTorch generic CI job (2051)

Bug Fixes
* Change `read_from_tar` call to `load_from_tar` (1997)
* Update Multi30k test dataset hash (2003)
* Fix device setting for T5 Model (2007)
* Fix `overwite` typo (2006)
* Fix linting error (2019)
* Fix memory leak with C++ RegEx operator (2024)
* Fix CodeQL workflow failure (2046)
* Fix UTF8 decoding error in GPT2BPETokenizer `decode` method (2092)

Examples
* Update T5 tutorial for 2.0 release (2080)

Documentation
* Added min version req + readme instructions for torchdata (2048)
* Update README w/ 3.11 (2062)

Testing
* Replaced tabs w/ spaces to fix CodeMod (1999)
* Add GPU testing for RoBERTa models (2025)
* Add TorchData version to smoke tests (2034)
* Update integration-test.yml (2038)
* Update CUDA version on GPU test (2040)
* Add prototype GPU tests for T5 (2055)
* Install portalocker for testing (2056)
* Test newly uploaded Flan-T5 weights (2074)

Dependencies
* Add TorchData as a hard dependency (1985)

Others
* Drop support for Python 3.7 (2037)
* Add logo (2050)
* Version Bumps and Update channels (2067)

0.14.1

This is a minor release, which is compatible with [PyTorch 1.13.1](https://github.com/pytorch/pytorch/releases/tag/v1.13.1). There are no new features added.

0.14.0

Highlights
In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

* Added CNN-DM dataset
* Added support for RegexTokenizer
* Added TorchArrow based examples for training RoBERTa model on SST2 classification dataset

Datasets
We increased the number of datasets in TorchText from 30 to 31 by adding the [CNN-DM](https://github.com/abisee/cnn-dailymail) ([paper](https://aclanthology.org/K16-1028.pdf)) dataset. The datasets supported by TorchText use datapipes from the [TorchData project](https://pytorch.org/data/beta/index.html), which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of `DataLoaderV2` from `torchdata`. For more details, refer to https://pytorch.org/text/stable/datasets.html

Tokenizers
TorchText has extended support for TorchScriptable tokenizers by adding a RegexTokenizer that enables splitting based on regular expressions. TorchScriptabilty support would allow users to embed the Regex Tokenizer natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate Regex tokenizers for deployment needs.

New Features

Transforms, Tokenizers, Ops
* Migrate RegexTokenizer from experimental/transforms.py to transforms.py (1763)
* Migrate MaskTransform from internal to experimental/transforms.py (1775)
* Graduate MaskTransform from prototype (1882)

Datasets
* Add CNN-DM dataset to torchtext (1789)
* Resolve inconsistency in IMDB label output (1914)
* Cache CNNDM extraction and optimize reading in filenames (1809)
* Allow CNNDM to be imported from torchtext.datasets (1884)

Improvements
Features
* Convert TA transform module to prepoc function (1854)
* Use TA functional for adding tokens to the beginning and end of input (1820)
* Add TA Tensor creation operation to the benchmark (1836)
* Add never_split feature to BERTTokenizer (1898)
* Adding benchmarks for add tokens operator (1807)
* Add benchmark for roberta prepoc pipelines (1684)
* Adding Benchmark for TA ops (1801)
* Make BERT benchmark code more robust (1871)
* Define TORCHTEXT_API macro for visibility control (1806)
* Modify get_local_asset_path to take overwrite option and use it in BERTTokenizer (1839)

Testing
* Add test to compare encoder inference on input with and without padding (1770)
* Add m1 tagged build for TorchText (1776)
* Refactor TorchText version handing and adding first version of M1 builds (1773)
* Fix test execution in torchtext (1889)
* Add torchdata to testing requirements in requirements.txt (1874)
* Add missing None type hint to tests (1868)
* Create pytest fixture to auto delete model checkpoints within integration tests (1886)
* Disable test_vocab_from_raw_text_file on Linux (1901)

Examples
* Add libtorchtext cpp example (1817)
* Torcharrow based training using RoBERTa model and SST2 classification dataset (1808)

Documentation
* Add Datasets contribution guidelines (1798)
* Correct typo in SST-2 tutorial (1865)
* Update doc theme to the latest (1899)
* Tutorial on using T5 model for text summarization (1864)
* Fix docstring type (1867)

Bug fixes
* Fixing incorrect inputs to add eos and bos operators (1810)
* Add missing type hints (1782)
* Fix typo in nightly branch ref (1783)
* Sharing -> sharding (1787)
* Remove padding mask for input embeddings (1799)
* Fixed on_disk_cache issues (1957)
* Fix Multi30k dataset urls (1816)
* Add missing Cmake file for in tokenizer dir (1908)
* Fix OBO error for vocab files with empty lines (1841)
* Fixing build when CUDA enabled torch is installed (1814)
* Make comment paths dynamic (1894)
* Turn off mask checking for torchtext which is known to have a legal mask ( 1906)
* Fix push on release reference name (1792)

Dependencies
* Remove future dep from windows (1838)
* Remove dependency on the torch::jit::script::Module for mobile builds (1885)
* Add Torchdata as a requirement and remove conditional imports of Torchdata (1962)
* Remove sphinx_rtd_theme from requirements.txt (1837)
* Fix Sphinx-gallery display and pin sphinx-related packages (1907)

Others
* Resolve and remove TODO comments (1912)
* Refactor TorchText version handling and adding first version of M1 builds (1773)
* Update xcode version to 14.0 in CI (1881)
* CI: Use self hosted runners for build (1851)
* Move Spacy from Pip dependencies to Conda dependencies (1890)
* Update compatibility matrix for 0.13 release (1802)
* Update CircleCI Xcode image (1818)
* Avoid looping through the whole counter in bleu_score method (1913)
* Rename build_tools dir to tools dir (1804)
* Usage setup-minicoda action for m1 build (1897)
* Making sure we build correctly against release branch (1790)
* Adding the conda builds for m1 (1794)
* Automatically initialize submodule (1805)
* Set MACOSX_DEPLOYMENT_TARGET=10.9 for binary job (1835)

Page 2 of 6

Releases

Has known vulnerabilities

Previous Next

Torchtext

Page 2 of 6

0.16.1

0.16.0

0.15.2

0.15.1

0.14.1

0.14.0

Page 2 of 6

Links

Releases