Torchtext

Latest version: v0.18.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 6

0.13.1

This is a minor release, which is compatible with [PyTorch 1.12.1](https://github.com/pytorch/pytorch/releases/tag/v1.12.1) and include small bug fixes, improvements and documentation update. There is no new feature added.

Bug Fix
- 1814 Fixing build when CUDA enabled torch is installed

For the full feature of v0.13, please refer to [the v0.13.0 release note](https://github.com/pytorch/text/releases/tag/v0.13.0).

0.13.0

Highlights
In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

* Added all 9 [GLUE benchmark](https://gluebenchmark.com/)’s datasets (#1710): CoLA, MRPC, QQP, STS-B, SST-2, MNLI, QNLI, RTE, WNLI
* Added support for BERTTokenizer
* Created native C++ binaries using a CMake based build system (1644)

Datasets
We increased the number of datasets in TorchText from 22 to 30 by adding the remaining 8 datasets from the GLUE benchmark (SST-2 was already supported). The complete list of GLUE datasets is as follows:
* [CoLA](https://nyu-mll.github.io/CoLA/) ([paper](https://arxiv.org/pdf/1805.12471.pdf)): Single sentence binary classification acceptability task
* [SST-2](https://nlp.stanford.edu/sentiment/) ([paper](https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf)): Single sentence binary classification sentiment task
* [MRPC](https://metatext.io/datasets/microsoft-research-paraphrase-corpus-(mrpc)) ([paper](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/I05-50025B15D.pdf)): Dual sentence binary classification paraphrase task
* [QQP](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs): Dual sentence binary classification paraphrase task
* [STS-B](https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark) ([paper](https://aclanthology.org/S17-2001.pdf)): Single sentence to float regression sentence similarity task
* [MNLI](https://cims.nyu.edu/~sbowman/multinli/) ([paper](https://cims.nyu.edu/~sbowman/multinli/paper.pdf)): Sentence ternary classification NLI task
* [QNLI](https://gluebenchmark.com/) ([paper](https://arxiv.org/pdf/1804.07461.pdf)): Sentence binary classification QA and NLI tasks
* [RTE](https://aclweb.org/aclwiki/Recognizing_Textual_Entailment) ([paper](https://arxiv.org/pdf/2010.03061.pdf)): Dual sentence binary classification NLI task
* [WNLI](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) ([paper](http://commonsensereasoning.org/2011/papers/Levesque.pdf)): Dual sentence binary classification coreference and NLI tasks

The datasets supported by TorchText use datapipes from the [TorchData project](https://pytorch.org/data/beta/index.html), which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of `DataLoaderV2` from `torchdata`. For more details, refer to https://pytorch.org/text/stable/datasets.html

Tokenizers
TorchText has extended support for TorchScriptable tokenizers by adding the WordPiece tokenizer used in BERT. It is one of the most commonly used algorithms for splitting input text into sub-words units and was introduced in [Japanese and Korean Voice Search (Schuster et al., 2012)](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf).

TorchScriptabilty support would allow users to embed the BERT text-preprocessing natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate BERT tokenizers for deployment needs.

For usage details, please refer to the corresponding [documentation](https://pytorch.org/text/main/transforms.html#torchtext.transforms.BERTTokenizer).

CMake Build System
TorchText has migrated its build system for C++ extension and third party libraries to use CMake rather than PyTorch’s [`CppExtension`](https://pytorch.org/docs/stable/cpp_extension.html#torch.utils.cpp_extension.CppExtension) module. This allows end-users to integrate TorchText C++ binaries in their applications without having a dependency on `libpython` thus allowing them to use TorchText operators in a non-Python environment.

Refer to the [GitHub issue](https://github.com/pytorch/text/issues/1644) for more details.

Backward Incompatible Changes
The `RobertaModelBundle` introduced in 0.12 release, which gets pre-trained RoBERTa/XLM-R models and builds custom models with similar architecture, has been renamed to `RobertaBundle` (1653).

The default caching location (`cache_dir`) has been changed from `os.path.expanduser("~/.TorchText/cache")` to `os.path.expanduser("~/.cache/torch/text")`. Furthermore the default root directory of datasets is `cache_dir/datasets` (1740). Users can now control default cache location via the `TORCH_HOME` environment variable (1741)

New Features

Models
* [fbsync] BetterTransformer support for TorchText (1690) (1694)
* [fbsync] Killed to_better by having native load_from_state_dict and init (1695)
* [fbsync] Removed unneeded modules after using nn.Module for BetterTransformer (1696)
* [fbsync] Replaced TransformerEncoder in TorchText with better transformer (1703)

Transforms, Tokenizers, Ops
* Added pad transform, string to int transform (1683)
* Added support for Scriptable BERT tokenizer (1707)
* Added support for batch input in BERT Tokenizer with perf benchmark (1745)

Datasets
Support for [GLUE benchmark](https://gluebenchmark.com/)’s datasets added:
* CoLA (1711)
* MRPC (1712)
* QQP (1713)
* STS-B (1714)
* MNLI (1715)
* QNLI (1717)
* RTE (1721)
* WNLI (1724)
*Note:* SST2 was added previously (1538)

Others
* Prepared datasets for new encoding kwarg. (1616)
* Added Shuffle and sharding datapipes to datasets (1729)
* For Datasets, refactored local functions to be global so that they can be pickled (1726)
* Updated TorchData DataPipe API usages (1663)
* Replaced lambda functions with regular functions in all datasets (1718)

CMake Build System
* [CMake 1/3] Updated C++ includes to use imports relative to root directory (1666)
* [CMake 2/3] Added CMake Build to TorchText to create single `_TorchText library (1673)
* [CMake 3/3] Splited source files with Python dependency to separate library (1660)
Improvements

Features
* [BC-breaking] Renamed Roberta Bundle (1635)
* Modified CLIPTokenizer to either infer number of merges from encoder json or take it in constructor (1622)
* Provided option to return splitted tokens (1698)
* Updated dataset code to avoid creating multiple iterators from a DataPipe (1708)
Testing
* Added unicode generation to IWSLT tests (followup to 1608) (1642)
* Added MacOS unit tests on CircleCI (1672)
* Added parameterized dataset pickling tests (1732)
* Added test to compare encoder inference on input with and without padding (1770)
* Added test for shuffle before shard (1738)
* Added more test coverage (1653)
* Enabled model testing in FBCode (1720)
* Fixed for windows builds with python 3.10 , getting rid of ssize_t (1627)
* Built and test py3.10 (1625)
* Making sure we build correctly against release branch (1790)
* Removed caching artifacts for datasets and fix it for vectors (1674)
* Installed torchdata from nightly release in CI (1664)
* Added m1 tagged build for TorchText (1776)
* Refactored TorchText version handing and adding first version of M1 builds (1773)
* Removed MACOSX_DEPLOYMENT_TARGET (1728)

Examples
* Added data pipelines for Roberta pre-processing (1637)
* Updated sst2 tutorial to replace lambda usage (1722)

Documentation
* Removed _add_docstring_header decorator from amazon review polarity (1611)
* Added missing quotation marks to to CLIPTokenizer docs (1610)
* Updated README around installing LTS version (1665)
* Added contributing guidelines for third party and custom C++ operators (1742)
* Added recommendations regarding use of datapipes for multi-processing, shuffling, DDP, etc. (1755)
* Fixed roberta bundle example doc (1648)
* Updated doc conf (1634)
* Removed install instructions (1641)
* Updated README (1652)
* Updated requirements (1675)
* Fixed typo sharing -> sharding (1787)
* Fixed docs build (1730)
* Replaced git+git with git+https in requirements.txt (1658)
* Added header info for BERT tokenizer (1754)
* Fixed docstring for Tokenizers (1739)
* Fixed doc js initialization (1736)
* Added missing type hints (1782)
* Fixed SentencePiece Tokenizer doc-string (1706)

Bug fixes
* Fixed missed mask arg in TorchText transformer (1758)
* Fixed bug in RTE and WNLI testing (1759)
* Fixed bug in QNLI dataset and corresponding test (1760)
* Fixed STSB and WikiTexts tests (1737)
* Fixed smoke tests for linux (1687)
* Removed redundant dataname in test_shuffle_shard_wrapper (1733)
* Fixed non-deterministic test failures for IWSLT (1699)
* Fixed typo in nightly branch ref (1783)
* Fixed windows utils test (1761)
* Fixed test utils (1757)
* Fixed pad transform test (1688)
* Resolved issues in 1653 + sanitize test names generated by nested_params (1667)
* Fixed mock tests due to change in datasets directory (1749)
* Deleted prints in test_qqp.py (1734)
* Fixed logger issue (1656)

Others
* Pinned Jinja2 version to fix broken doc build (1669)
* Fixed formatting for all files using pre-commit (1670)
* Pinned setuptools to 58.0.4 on Windows (1746)
* Added post install script for pywin32 (1748)
* Pinned Utf8proc version (1771)
* Removed models from experimental (1643)
* Cleaned examples folder (1647)
* Cleaned stale code (1654)
* Took TORCH_HOME env variable into account while setting the cache dir (1741)
* Updateed download hooks and datasets to import HttpReader and GDriveReader from download hooks (1657)
* Added Model benchmark (1697)
* Changed root directory for datasets (1740)
* Used _get_torch_home standard utility from torch hub (1752)
* Removed ticks (``) from the url under is_module_available (1753)
* Prepared repo for auto-formatters (1546)
* Fixed flake8 issues introduced from adding auto formatter (1617)

0.12.0

Highlights

In this release, we have revamped the library to provide a more comprehensive experience for users to do NLP modeling using TorchText and PyTorch.
* Migrated datasets to build on top of [TorchData](https://github.com/pytorch/data#readme) DataPipes
* Added support RoBERTa and XLM-RoBERTa pre-trained models
* Added support for Scriptable tokenizers
* Added support for composable transforms and functionals

Datasets
TorchText has modernized its datasets by migrating from older-style Iterable Datasets to TorchData’s DataPipes. TorchData is a library that provides modular/composable primitives, allowing users to load and transform data in performant data pipelines. These DataPipes work out-of-the-box with PyTorch DataLoader and would enable new functionalities like auto-sharding. Users can now easily do data manipulation and pre-processing using user-defined functions and transformations in a functional style programming. Datasets backed by DataPipes also enable standard flow-control like batching, collation, shuffling and bucketizing. Collectively, DataPipes provides a comprehensive experience for data preprocessing and tensorization needs in a pythonic and flexible way for model training.

python
from functools import partial
import torchtext.functional as F
import torchtext.transforms as T
from torch.hub import[ load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)
from torch.utils.data import DataLoader
from torchtext.datasets import SST2

Tokenizer to split input text into tokens
encoder_json_path = "https://download.pytorch.org/models/text/gpt2_bpe_encoder.json"
vocab_bpe_path = "https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe"
tokenizer = T.GPT2BPETokenizer(encoder_json_path, vocab_bpe_path)
vocabulary converting tokens to IDs
vocab_path = "https://download.pytorch.org/models/text/roberta.vocab.pt"
vocab = T.VocabTransform([load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)(vocab_path))
Add BOS token to the beginning of sentence
add_bos = T.AddToken(token=0, begin=True)
Add EOS token to the end of sentence
add_eos = T.AddToken(token=2, begin=False)

Create SST2 dataset datapipe and apply pre-processing
batch_size = 32
train_dp = SST2(split="train")
train_dp = train_dp.batch(batch_size).rows2columnar(["text", "label"])
train_dp = train_dp.map(tokenizer, input_col="text", output_col="tokens")
train_dp = train_dp.map(partial(F.truncate, max_seq_len=254), input_col="tokens")
train_dp = train_dp.map(vocab, input_col="tokens")
train_dp = train_dp.map(add_bos, input_col="tokens")
train_dp = train_dp.map(add_eos, input_col="tokens")
train_dp = train_dp.map(partial(F.to_tensor, padding_value=1), input_col="tokens")
train_dp = train_dp.map(F.to_tensor, input_col="label")
create DataLoader
dl = DataLoader(train_dp, batch_size=None)
batch = next(iter(dl))
model_input = batch["tokens"]
target = batch["label"]

TorchData is required in order to use these datasets. Please install following instructions at https://github.com/pytorch/data

Models
We have added support for pre-trained RoBERTa and XLM-R models. The models are torchscriptable and hence can be employed for production use-cases. The modeling APIs let users attach custom task-specific heads with pre-trained encoders. The API also comes equipped with data pre-processing transforms to match the pre-trained weights and model configuration.

python
import torch, torchtext
from torchtext.functional import to_tensor
xlmr_base = torchtext.models.XLMR_BASE_ENCODER
model = xlmr_base.get_model()
transform = xlmr_base.transform()
input_batch = ["Hello world", "How are you!"]
model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape
torch.Size([2, 6, 768])

add classification head
import torch.nn as nn
class ClassificationHead(nn.Module):
def __init__(self, input_dim, num_classes):
super().__init__()
self.output_layer = nn.Linear(input_dim, num_classes)

def forward(self, features):
get features from cls token
x = features[:, 0, :]
return self.output_layer(x)

binary_classifier = xlmr_base.get_model(head=ClassificationHead(input_dim=768, num_classes=2))
output = binary_classifier(model_input)
output.shape
torch.Size([2, 2])

Transforms and tokenizers
We have revamped our transforms to provide composable building blocks to do text pre-processing. They support both batched and non-batched inputs. Furthermore, we have added support for a number of commonly used tokenizers including SentencePiece, GPT-2 BPE and CLIP.

python
import torchtext.transforms as T
from torch.hub import load_state_dict_from_url

padding_idx = 1
bos_idx = 0
eos_idx = 2
max_seq_len = 256
xlmr_vocab_path = r"https://download.pytorch.org/models/text/xlmr.vocab.pt"
xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"

text_transform = T.Sequential(
T.SentencePieceTokenizer(xlmr_spm_model_path),
T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),
T.Truncate(max_seq_len - 2),
T.AddToken(token=bos_idx, begin=True),
T.AddToken(token=eos_idx, begin=False),
)

text_transform([“Hello World”, “How are you”])

Tutorial
We have added an end-2-end [tutorial](https://pytorch.org/text/main/tutorials/sst2_classification_non_distributed.html) to perform SST-2 binary text classification with pre-trained XLM-R base architecture and demonstrates the usage of new datasets, transforms and models.

Backward Incompatible changes
We have removed the legacy folder in this release which provided access to legacy datasets and abstractions. For additional information, please refer to the corresponding github issue (1422) and PR (1437)

New Features

Models
* Add XLMR Base and Large pre-trained models and corresponding transformations (1407)
* Added option to specify whether to load pre-trained weights (1424)
* Added Option for freezing encoder weights (1428)
* Enable optional return of all states in transformer encoder (1430)
* Added support for RobertaModel to accept model configuration (1431)
* Allow inferred scaling in MultiheadSelfAttention for head_dim != 64 (1432)
* Added attention mask to transformer encoder modules (1435)
* Added builder method in Model Bundler to facilitate model creation with user-defined configuration and checkpoint (1442)
* Cleaned up Model API (1452)
* Fixed bool attention mask in transformer encoder (1454)
* Removed xlmr transform class and instead used sequential for model transforms composition (1482)
* Added support for pre-trained Roberta encoder for base and large architecture 1491

Transforms, Tokenizers, Ops
* Added ToTensor and LabelToIndex transformations (1415)
* Added Truncate Transform (1458)
* Updated input annotation type to `Any` to support torch-scriptability during transform composability (1453)
* Added AddToken transform (1463)
* Added GPT-2 BPE pre-tokenizer operator leveraging re2 regex library (1459)
* Added Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models (1462)
* Migrated GPT-2 BPE tokenizer logic to C++ (1469)
* fix optionality of default arg in to_tensor (1475)
* added scriptable sequential transform (1481)
* Removed optionality of dtype in ToTensor (1492)
* Fixed max sequence length for xlmr transform (1495)
* add max_tokens kwarg to vocab factory (1525)
* Refactor vocab factory method to accept special tokens as a keyword argument (1436)
* Implemented ClipTokenizer that builds on top of GPT2BPETokenizer (1541)

Datasets

Migration of datasets on top of datapipes
* AG_NEWS (1498)
* AmazonReviewFull (1499)
* AmazonReviewPolarity (1490)
* DBpedia (1500)
* SogouNews (1503)
* YelpReviewFull (1507)
* YelpReviewPolarity (1509)
* YahooAnswers (1508)
* CoNLL2000Chunking (1515)
* UDPOS (1535)
* IWSLT2016 (1545)
* IWSLT2017 (1547)
* Multi30K (1536)
* SQuAD1 (1513)
* SQuAD2 (1514)
* PennTreebank (1511)
* WikiText103 (1518)
* WikiText2 (1519)
* EnWik9 (1512)
* IMDB (1531)

Newly added datasets
* SST2 (1538)
* CC-100 (1562)

Misc
* Fix split filter logic in AmazonReviewPolarity (1505)
* use os.path.join for consistency. 1506
* Fixing dataset test failures due to incorrect caching mode in AG_NEWS (1517)
* Added caching for extraction datapipe for AmazonReviewPolarity (1527)
* Added caching for extraction datapipe for Yahoo (1528)
* Added caching for extraction datapipe for yelp full (1529)
* Added caching for extraction datapipe for yelp polarity (1530)
* Added caching for extraction datapipe for DBPedia (1571)
* Added caching for extraction datapipe for SogouNews and AmazonReviewFull (1594)
* Fixed issues with extraction caching (1550, 1551, 1552)
* Updating Conll2000Chunking dataset to be consistent with other datasets (1590)
* [BC-breaking] removed unnecessary split argument from datasets (1591)

Improvements

Testing

Revamp TorchText dataset testing to use mocked data
* AG_NEWS (1553)
* AmazonReviewFull (1561)
* AmazonReviewPolarity (1532)
* DBpedia (1566)
* SogouNews (1576)
* YelpReviewFull (1567)
* YelpReviewPolarity (1567)
* YahooAnswers (1577)
* CoNLL2000Chunking (1570)
* UDPOS (1569)
* IWSLT2016 (1563)
* IWSLT2017 (1598)
* Multi30K (1554)
* SQuAD1 (1574)
* SQuAD2 (1575)
* PennTreebank (1578)
* WikiText103 (1592)
* WikiText2 (1592)
* EnWik9 (1560)
* IMDB (1579)
* SST2 (1542)
* CC-100 (1583)

Others
* Fixed attention mask testing (1439)
* Fixed CircleCI download failures on windows for XLM-R unit tests (1441)
* Asses unit tests for testing model training (1449)
* Parameterized XLMR and Roberta model integration tests (1496)
* Removed redundant get asset functions from parameterized_utils file (1501)
* Parameterize jit and non-jit model integration tests (1502)
* fixed cache logic to work with datapipes (1522)
* Convert get_mock_dataset fn in AmazonReviewPolarity to be private (1543)
* Removing unused TEST_MODELS_PARAMETERIZED_ARGS constant from model test (1544)
* Removed real dataset caching and testing in favor of mocked dataset testing (1587)
* fixed platform-dependent expectation for Multi30k mocked test. (1593)
* Fixed Conll2000Chunking Test (1595)
* Updated IWSLT testing to start from compressed file (1596)
* Used unicode strings to test utf-8 handling for all non-IWSLT dataset tests. (1599)
* Parameterize tests for similar datasets (1600)

Examples
* non-distributed training example for SST-2 binary text classification data using XLM-Roberta model (1468)

Documentation

Dataset Documentation
* Updated docs for text classification and language modeling datasets (1603)
* Updated docs for Machine Translation, Sequence Tagging, Question Answer, Unsupervised Learning datasets (1597)
* Updated docs for CC100 and SST2 (1604)
* Update sphinx version, added rst files for models, transforms and functionals (1434)
* Removed experimental documentation (1457)
* Fix links in README (1461)
* Added sphinx based tutorial for SST-2 binary classification task using XLM-R model (1468)
* pointed to pytorch.org docs instead of outdated rtd link (1480)
* Added documentation describing XLM-R, the datasets it was trained on, and relevant license information (1497)
* Fixed CI doc build (1504)
* Remove example using next(...) from README (1516)

Misc

* Hide symbols when building third party code (1467)
* Add .DS_Store files to gitignore (1470)
* Remove Python 3.6 support as it has reached EOL (1484)
* Added .gitattributes file to hide generated circleci files in PRs (1485)
* Switched to use FileOpener from FileLoader (1488)
* Update python_requires in setup.py to reflect support for non-EOL python versions (1521)
* Added auto-formatters (1545)
* fix typo in torchtext/vocab/vocab_factory.py (1565)
* Formatted datasets and tests (1601, 1602)

0.11.2

This is a minor release compatible with [PyTorch 1.10.2](https://github.com/pytorch/pytorch/releases/tag/v1.10.2).

There is no feature change in torchtext from 0.11.1. For the full feature of v0.11.1, please refer to the [v0.11.1 release notes](https://github.com/pytorch/text/releases/tag/v0.11.0-rc3).

0.11.0

This is a relatively lightweight release while we are working on revamping the library. Users are encouraged to check various developments on the main branch.

Improvements
* Refactored C++ codebase to fix clang-tidy warnings and using emplace_back for improving performance (1327)
* Updated sentecepience to v0.1.95 to make it compilable on M1 (1336)
* Up the priority of numpy array comparison in self.assertEqual (1341)
* Removed mentions of conda-forge as it is no longer necessary to build on python 3.9 (1345)
* Separated experimental tests to help remove them easily during release cycles (1348)
* Splitted the pybind and torchtbind registration in separate file and refactor Vocab modules to allow vocab to be used in pure C++ environment (1352)
* Changed the default root directory for downloaded datasets to avoid dirtying the working directory ( 1361)
* Added method for logging module usage in fbcode (1367)
* Updated bug report file (1377)
* Renamed default branch to main (1378)
* Enabled torchtext extension work seamlessly between fbcode and open-source (1382)
* Migrated CircleCI docker image (1393)

Docs
* Fix tag build so so that adding a tag will trigger a documentation build-and-upload (1332)
* Minor doc-string fix in Multi30K dataset (1351)
* Fixed example in doc-string of get_vec_by_tokens (1383)
* Updated docs to point to main instead of deprecated master branch (1387)
* Changed various README.md links to point to main instead of master branch (1392)

Bug fix
* Fixed benchmark code that compares performance of vocab (1339)
* Fixed text classification example broken due removal of experimental datasets (1347)
* Fixed issue in IMDB dataset that result in all samples being positive depending on directory path (1354)
* Fixed doc building (1365)

Torchtext

Page 3 of 6

0.13.1

0.13.0

0.12.0

0.11.2

0.11.0

0.11.0rc3

Page 3 of 6

Links

Releases