Highlights
In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.
* Added CNN-DM dataset
* Added support for RegexTokenizer
* Added TorchArrow based examples for training RoBERTa model on SST2 classification dataset
Datasets
We increased the number of datasets in TorchText from 30 to 31 by adding the [CNN-DM](https://github.com/abisee/cnn-dailymail) ([paper](https://aclanthology.org/K16-1028.pdf)) dataset. The datasets supported by TorchText use datapipes from the [TorchData project](https://pytorch.org/data/beta/index.html), which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of `DataLoaderV2` from `torchdata`. For more details, refer to https://pytorch.org/text/stable/datasets.html
Tokenizers
TorchText has extended support for TorchScriptable tokenizers by adding a RegexTokenizer that enables splitting based on regular expressions. TorchScriptabilty support would allow users to embed the Regex Tokenizer natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate Regex tokenizers for deployment needs.
New Features
Transforms, Tokenizers, Ops
* Migrate RegexTokenizer from experimental/transforms.py to transforms.py (1763)
* Migrate MaskTransform from internal to experimental/transforms.py (1775)
* Graduate MaskTransform from prototype (1882)
Datasets
* Add CNN-DM dataset to torchtext (1789)
* Resolve inconsistency in IMDB label output (1914)
* Cache CNNDM extraction and optimize reading in filenames (1809)
* Allow CNNDM to be imported from torchtext.datasets (1884)
Improvements
Features
* Convert TA transform module to prepoc function (1854)
* Use TA functional for adding tokens to the beginning and end of input (1820)
* Add TA Tensor creation operation to the benchmark (1836)
* Add never_split feature to BERTTokenizer (1898)
* Adding benchmarks for add tokens operator (1807)
* Add benchmark for roberta prepoc pipelines (1684)
* Adding Benchmark for TA ops (1801)
* Make BERT benchmark code more robust (1871)
* Define TORCHTEXT_API macro for visibility control (1806)
* Modify get_local_asset_path to take overwrite option and use it in BERTTokenizer (1839)
Testing
* Add test to compare encoder inference on input with and without padding (1770)
* Add m1 tagged build for TorchText (1776)
* Refactor TorchText version handing and adding first version of M1 builds (1773)
* Fix test execution in torchtext (1889)
* Add torchdata to testing requirements in requirements.txt (1874)
* Add missing None type hint to tests (1868)
* Create pytest fixture to auto delete model checkpoints within integration tests (1886)
* Disable test_vocab_from_raw_text_file on Linux (1901)
Examples
* Add libtorchtext cpp example (1817)
* Torcharrow based training using RoBERTa model and SST2 classification dataset (1808)
Documentation
* Add Datasets contribution guidelines (1798)
* Correct typo in SST-2 tutorial (1865)
* Update doc theme to the latest (1899)
* Tutorial on using T5 model for text summarization (1864)
* Fix docstring type (1867)
Bug fixes
* Fixing incorrect inputs to add eos and bos operators (1810)
* Add missing type hints (1782)
* Fix typo in nightly branch ref (1783)
* Sharing -> sharding (1787)
* Remove padding mask for input embeddings (1799)
* Fixed on_disk_cache issues (1957)
* Fix Multi30k dataset urls (1816)
* Add missing Cmake file for in tokenizer dir (1908)
* Fix OBO error for vocab files with empty lines (1841)
* Fixing build when CUDA enabled torch is installed (1814)
* Make comment paths dynamic (1894)
* Turn off mask checking for torchtext which is known to have a legal mask ( 1906)
* Fix push on release reference name (1792)
Dependencies
* Remove future dep from windows (1838)
* Remove dependency on the torch::jit::script::Module for mobile builds (1885)
* Add Torchdata as a requirement and remove conditional imports of Torchdata (1962)
* Remove sphinx_rtd_theme from requirements.txt (1837)
* Fix Sphinx-gallery display and pin sphinx-related packages (1907)
Others
* Resolve and remove TODO comments (1912)
* Refactor TorchText version handling and adding first version of M1 builds (1773)
* Update xcode version to 14.0 in CI (1881)
* CI: Use self hosted runners for build (1851)
* Move Spacy from Pip dependencies to Conda dependencies (1890)
* Update compatibility matrix for 0.13 release (1802)
* Update CircleCI Xcode image (1818)
* Avoid looping through the whole counter in bleu_score method (1913)
* Rename build_tools dir to tools dir (1804)
* Usage setup-minicoda action for m1 build (1897)
* Making sure we build correctly against release branch (1790)
* Adding the conda builds for m1 (1794)
* Automatically initialize submodule (1805)
* Set MACOSX_DEPLOYMENT_TARGET=10.9 for binary job (1835)