Torchtext

Latest version: v0.18.0

Safety actively analyzes 714973 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 6

0.10.1

0.10.0

Highlights

In this release, we introduce a new Vocab module that replaces the current Vocab class. The new Vocab provides common functional APIs for NLP workflows. This module is backed by an efficient C++ implementation that reduces look-up time by up-to ~85% for batch look-up (refer to summary of 1248 and 1290 for further information on benchmarks), and provides support for TorchScript. We provide accompanying factory functions that can be used to build the Vocab object either through a python ordered dictionary or an Iterator that yields lists of tokens.

creating Vocab from text file

python
import io
from torchtext.vocab import build_vocab_from_iterator
generator that yield list of tokens
def yield_tokens(file_path):
with io.open(file_path, encoding = 'utf-8') as f:
for line in f:
yield line.strip().split()
get Vocab object
vocab_obj = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])


creating Vocab through ordered dict

python
from torchtext.vocab import vocab
from collections import Counter, OrderedDict
counter = Counter(["a", "a", "b", "b", "b"])
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vocab_obj = vocab(ordered_dict)


common API usage

python
look-up index
vocab_obj["a"]

batch look-up indices
vocab_obj.looup_indices(["a","b"])
support forward API of PyTorch nn Modules
vocab_obj(["a","b"])

batch look-up tokens
vocab_obj.lookup_tokens([0,1])

set default index to return when token not found
vocab_obj.set_default_index(0)
vocab_obj["out_of_vocabulary"] prints 0


Backward Incompatible changes

* We have retired the old Vocab class into the legacy folder (1289) . Users relying on this class should be able to access it from torchtext.legacy. The Vocab module that replaces this class is not backward compatible. The most notable difference is that the Vectors object is not an attribute of new Vocab object. We recommend users to use the build_vocab_from_iterator factory function to construct the new Vocab module that provides similar initialization capabilities as the retired Vocab class.

python
retired Vocab class
from torchtext.legacy.vocab import Vocab as retired_vocab
from collections import Counter
tokens_list = ["a", "a", "b", "b", "b"]
counter = Counter(tokens_list)
vocab_obj = retired_vocab(counter, specials=["<unk>","<pad>"], specials_first=True)

new Vocab Module
from torchtext.vocab import build_vocab_from_iterator
vocab_obj = build_vocab_from_iterator([tokens_list], specials=["<unk>","<pad>"], specials_first=True)


* Removed legacy batch from torchtext.data package (1307) that was kept around for backward compatibility reasons. Users can still access batch from torchtext.data.legacy package.



New Features

* Introduced functional to convert Iterable-style to map-style datasets (1299)

python
from torchtext.datasets import IMDB
from torchtext.data import to_map_style_dataset
train_iter = IMDB(split='train')
convert iterator to map-style dataset
train_dataset = to_map_style_dataset(train_iter)


* Introduced functional to filter raw wikipedia XML dumps (1292)

python
from torchtext.data.functional import filter_wikipedia_xml
from torchtext.datasets import EnWik9
data_iter = EnWik9(split='train')
filter data according to https://github.com/facebookresearch/fastText/blob/master/wikifil.pl
filter_data_iter = filter_wikipedia_xml(data_iter)


* Introduced Multi30k dataset (1306 (https://github.com/pytorch/text/pull/1306))

python
Added datasets for http://www.statmt.org/wmt16/multimodal-task.html#task1
from torchtext.datasets import Multi30k
train_data, valid_data, test_data = Multi30k()
next(train_data)
prints following
('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.\n',
'Two young, White males are outside near many bushes.\n')


* Introduced new Vocab module and associated factory functions (1289, 1297, 1302, 1304, 1308, 1309, 1310)

Improvements

* Separated experimental and legacy tests into separate subfolders (1285)
* Stored md5 hash instead of raw text data for in-built datasets testing (1261)
* Cleaned up CircleCI cache handling and optimization of daily cache (1236, 1238)
* Fixed CircleCI caching issue when new dataset is added (1314)
* Organized datasets by names in root folder and moved common file reading functions into dataset_utils (1233)
* Added unit-test to verify raw datasets name property (1234)
* Fixed jinja2 environment autoescape to enable select extensions (1277)
* Added yaml.safe_load instead of yaml.load (1278)
* Added defusedxml to parse untrusted XML data (1279)
* Added CodeQL and Bandit security checks as GitHub Actions (1266)
* Added benchmark code to compare Vocab module with python dict for batch look-up time (1290)

Documentation

* Fixing doc for nn modules (1267)
* Store artifacts of rendered docs so that rendered docs can be checked on each PR (1288)
* Add Google Analytics support (1287)

Bug Fix

* Fixed import issue in text classification example (1256)
* Fixed and re-organized data pipeline example (1250)

Performance

* used c10::string_view and fast-text dictionary inside C++ kernel of Vocab module (1248)

0.9.1rc1

Highlights

This is a minor release following pytorch 1.8.1. Please refer to torchtext 0.9.0 release note for more details.

0.9.0rc5

Highlights

In this release, we’re updating torchtext’s datasets to be compatible with the PyTorch DataLoader, and deprecating torchtext’s own DataLoading abstractions. We have published a full review of the legacy code and the new datasets in pytorch/text 664. These new datasets are simple string-by-string iterators over the data, rather than the previously custom set of abstractions such as `Field`. The legacy Datasets and abstractions have been moved into a new legacy folder to ease the migration, and will remain there for two more releases. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide ([link](https://github.com/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb)).

The following raw text datasets are available as the replacement of the legacy datasets. Those datasets are iterators which yield the raw text data line-by-line. To apply those datasets in the NLP workflows, please refer to the end-to-end tutorial for the text classification task ([link](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)).
- Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
- Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
- Sequence tagging: UDPOS, CoNLL2000Chunking
- Translation: IWSLT2016, IWSLT2017
- Question answer: SQuAD1, SQuAD2

We add Python 3.9 support in this release

Backwards Incompatible
The current users of the legacy code will experience BC breakage as we have retired the legacy code (1172, 1181, 1183). The legacy components are placed in `torchtext.legacy.data` folder as follows:
- `torchtext.data.Pipeline` -> `torchtext.legacy.data.Pipeline`
- `torchtext.data.Batch` -> `torchtext.legacy.data.Batch`
- `torchtext.data.Example` -> `torchtext.legacy.data.Example`
- `torchtext.data.Field` -> `torchtext.legacy.data.Field`
- `torchtext.data.Iterator` -> `torchtext.legacy.data.Iterator`
- `torchtext.data.Dataset` -> `torchtext.legacy.data.Dataset`

This means, all features are still available, but within `torchtext.legacy` instead of torchtext.

Table 1: Summary of the legacy datasets and the replacements in 0.9.0 release

Category | Legacy | 0.9.0 release
-- | -- | --
Language Modeling | torchtext.legacy.datasets.WikiText2 | torchtext.datasets.WikiText2
  | torchtext.legacy.datasets.WikiText103 | torchtext.datasets.WikiText103
  | torchtext.legacy.datasets.PennTreebank | torchtext.datasets.PennTreebank
  | torchtext.legacy.datasets.EnWik9 | torchtext.datasets.EnWik9
Text Classification | torchtext.legacy.datasets.AG_NEWS | torchtext.datasets.AG_NEWS
  | torchtext.legacy.datasets.SogouNews | torchtext.datasets.SogouNews
  | torchtext.legacy.datasets.DBpedia | torchtext.datasets.DBpedia
  | torchtext.legacy.datasets.YelpReviewPolarity | torchtext.datasets.YelpReviewPolarity
  | torchtext.legacy.datasets.YelpReviewFull | torchtext.datasets.YelpReviewFull
  | torchtext.legacy.datasets.YahooAnswers | torchtext.datasets.YahooAnswers
  | torchtext.legacy.datasets.AmazonReviewPolarity | torchtext.datasets.AmazonReviewPolarity
  | torchtext.legacy.datasets.AmazonReviewFull | torchtext.datasets.AmazonReviewFull
  | torchtext.legacy.datasets.IMDB | torchtext.datasets.IMDB
  | torchtext.legacy.datasets.SST | deferred
  | torchtext.legacy.datasets.TREC | deferred
Sequence Tagging | torchtext.legacy.datasets.UDPOS | torchtext.datasets.UDPOS
  | torchtext.legacy.datasets.CoNLL2000Chunking | torchtext.datasets.CoNLL2000Chunking
Translation | torchtext.legacy.datasets.WMT14 | deferred
  | torchtext.legacy.datasets.Multi30k | deferred
  | torchtext.legacy.datasets.IWSLT | torchtext.datasets.IWSLT2016, torchtext.datasets.IWSLT2017
Natural Language Inference | torchtext.legacy.datasets.XNLI | deferred
  | torchtext.legacy.datasets.SNLI | deferred
  | torchtext.legacy.datasets.MultiNLI | deferred
Question Answer | torchtext.legacy.datasets.BABI20 | deferred

Improvements
- Enable importing `metrics`/`utils`/`functional` from `torchtext.legacy.data` (1229)
- Set up daily caching mechanism with Master job (1219)
- Reset the functions in datasets_utils.py as private (1224)
- Resolve the download folder for some raw datasets (1213)
- Store the hash of the extracted CoNLL2000Chunking files so the extraction step will be skipped if the extracted files are detected (1204)
- Fix the total number of lines in doc strings of the datasets (1200)
- Extend CI tests to cover all the datasets (1197, 1201, 1171)
- Document the number of lines in the dataset splits (1196)
- Add hashes to skip the slow extraction if the extracted files are available (1195)
- Use decorator to loop over the split argument in the datasets (1194)
- Remove offset option from `torchtext.datasets`, and move `torchtext.datasets.common` to `torchtext.data.dataset_utils` (1188, 1145)
- Remove the step to clean up the cache in `test_iwslt()` (1192)
- Split IWSLT dataset into IWSLT2016 and IWSLT2017 dataset and re-organize the parameters in the constructors (1191, 1209)
- Move the prototype datasets in `torchtext.experimental.datasets.raw` folder to `torchtext.datasets` folder (1182, 1202, 1207, 1211, 1212)
- Add a decorator `add_docstring_header()` to generate docstring (1185)
- Add EnWiki9 dataset (1184)
- Avoid unnecessary downloads and extraction for some raw datasets, and add more logging (1178)
- Split raw datasets into individual files (1156, 1173, 1174, 1175, 1176)
- Extend the unittest coverage for all the raw datasets (1157, 1149)
- Define the relative path of the datasets in the `download_from_url()` func and skip unnecessary download if the downloaded files are detected (1158, 1155)
- Add `MD5` and `NUM_LINES` as the meta information in the `__init__` file of `torchtext.datasets` folder (1155)
- Standardize the text dataset doc strings and argument order. (1151)
- Report the “exceeds quota” error for the datasets using Google drive links (1150)
- Add support for the string-typed split values to the text datasets (1147)
- Re-name the argument from data_select to split in the dataset constructor (1143)
- Add Python 3.9 support across Linux, MacOS, and Windows platforms (1139)
- Switch to the new URL for the IWSLT dataset (1115)
- Extend the language shortcut in `torchtext.data.utils.get_tokenizer` func with the full name when Spacy tokenizers are loaded (1140)
- Fix broken CI tests due to spacy 3.0 release (1138)
- Pass an embedding layer to the constructor of the BertModel class in the BERT example (1135)
- Fix test warnings by switching to `assertEqual()` in PyTorch TestCase class (1086)
- Improve CircleCI tests and conda package (1128, 1121, 1120, 1106)
- Simplify TorchScript registration by adopting `TORCH_LIBRARY_FRAGMENT` macro (1102)

Bug Fixes
- Fix the total number of returned lines in `setup_iter()` func in `RawTextIterableDataset` (1142)

Docs
- Add number of classes to doc strings for text classification data (1230)
- Remove Lato font for `pytorch/text` website (1227)
- Add the migration tutorial (1203, 1216, 1222)
- Remove the legacy examples on pytorch/text website (1206)
- Update README file for 0.9.0 release (1198)
- Add CI check to detect undocumented parameters (1167)
- Add a static text link for the package version in the doc website (1161)
- Fix sphinx warnings and turn warnings into errors (1163)
- Add the text datasets to torchtext website (1153)
- Add the constructor document for IMDB and SST datasets (1118)
- Fix typos in the README file (1089)
- Rename "Arguments" to "Args" in the doc strings (1110)
- Build docs and push to gh-pages on nightly basis (1105, 1111, 1112)

0.8.1

Highlights

Updated pinned PyTorch version to 1.7.1 and added Python 3.9 support.

Improvement
* Added Python 3.9 support 1088
* Added certifi for Windows unittest envir 1077
* Added setup version to pin torch dependency 1067

Docs
* Updated docs strings for torchtext.nn.InProjContainer 1083
* Updated the doc strings for torchtext.nn.MultiheadAttentionContainer 1057

0.8.0rc2

This is a relatively light release while we are working on revamping the library. According to [PyTorch feature classification changes](https://pytorch.org/blog/pytorch-feature-classification-changes/), the new building blocks and datasets in the experimental folder are defined as **Prototype** and available in the nightly release only. Once the prototype building blocks are matured enough, we will release them together with all the relevant commits in a beta release. At the same time, users are encouraged to take a look at those building blocks and give us feedback. An easy way to send your feedback is to open an issue in pytorch/text repo or comment in [Issue #664](https://github.com/pytorch/text/issues/664). For details regarding the revamp execution, see [Issue #985](https://github.com/pytorch/text/issues/985).

The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command.


pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html


For more detail instructions, please refer to [Install PyTorch](https://pytorch.org/get-started/locally/). It should be noted that the new building blocks are still under development, and the APIs have not been solidified.

The stable release branch here includes a few feature improvements and documentation updates. Compiled against the PyTorch 1.7.0 release, the stable release packages are available via Pip and Conda for Windows, Linux, and Mac.


Improvements

* Updated the BERT pipeline to improve question-answer task score 950
* Fixed the order of the datasets used in the BERT example 1040
* Skipped requests.get in `download_from_url` function if path exists 922
* Used Ninja to build extensions and disable C++11 ABI when necessary for libtorch compatibility. 931
* Removed SentencePiece from setup.py file. SentencePiece source code is now being used as the third-party library in torchtext 1055
* Improved CircleCI setting for better engineering
* Switched PyTorch binary location for CI unittests 1044
* Parameterized UPLOAD_CHANNEL 1037
* Installed binaries for the CI test directly from the CPU channel 1025, 981
* Added dataclasses to dependencies for environment.yml 964
* Bumped Xcode workers to 9.4.1 951
* Disabled glove tests due to URL breakage 920
* Used the specific channel for the CI tests 907

Docs

* Added test and updated error message for `load_sp_model` function in `torch.data.functional` 984
* Updated the README file in BERT example 899
* Updated the legacy retirement message 1047
* Updated index page to include links to PyTorch libraries and describe feature classification 1048
* Cleaned up the doc strings 1049
* Fixed clang-format version to what PyTorch uses 1052
* Added OSX environment variables to the README file 1054
* Updated README file for the prototype in the nightly release 1050

Bug Fixes

* Fixed the order of the datasets used in the BERT example 1040

Page 4 of 6

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.