Torchtext

Latest version: v0.18.0

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 6

0.7.0rc3

Highlights

With the continued progress of PyTorch, some code in torchtext grew out of date with the SOTA PyTorch modules (for example `torch.utils.data.DataLoader`, `torchscript`). In 0.7.0 release, we’re taking big steps toward modernizing torchtext, and adding warning messages to these legacy components which will be retired in the October 0.8.0 release. We’re also introducing a host of new features, including:

1. A generalized MultiheadAttentionContainer for flexible attention behavior
2. Torchscript support for SentencePiece models
3. An end-to-end BERT example pipeline, including pertained weights and a question answering fine-tuning example
4. The SQuAD1 and SQuAD2 question answering datasets
5. Windows support

Legacy code and issues

For a period of time (ending around June of 2019), torchtext lacked active maintenance and grew out of date with the present SOTA research and PyTorch features. We’ve committed to bringing the library fully up to date, and identified a few core issues:

* Several components and functionals were unclear and difficult to adopt. For example, the `Field` class coupled tokenization, vocabularies, splitting, batching and sampling, padding, and numericalization all together, and was opaque and confusing to users. We determined that these components should be divided into separate orthogonal building blocks. For example, it was difficult to use HuggingFace's tokenizers with the `Field` class (issue 609). Modular pipeline components would allow a third party tokenizer to be swapped into the pipeline easily.
* torchtext’s datasets were incompatible with [DataLoader](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) and [Sampler](https://pytorch.org/docs/master/data.html#torch.utils.data.Sampler) in [torch.utils.data](https://pytorch.org/docs/master/data.html#module-torch.utils.data), or even duplicated that code (e.g. `torchtext.data.Iterator`, `torchtext.data.Batch`). Basic inconsistencies confused users. For example, many struggled to fix the data order while using `Iterator` (issue 828), whereas with `DataLoader`, users can simply set `shuffle=False` to fix the data order.

We’ve addressed these issues in this release, and several legacy components are now ready to be retired:

* `torchtext.data.Batch` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/batch.py#L4))
* `torchtext.data.Field` ([link](https://github.com/pytorch/text/blob/master/torchtext/data/field.py))
* `torchtext.data.Iterator` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/iterator.py))
* `torchtext.data.Example` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/example.py#L5))

In 0.7.0 release, we add deprecation warnings, and finally will retire them to the `torchtext.legacy` directory in 0.8.0 release on October.

New dataset abstraction

Since the 0.4.0 release, we’ve been working on a new common interface for the torchtext datasets (inheriting from `torch.utils.data.Dataset`) to address the issues above, and completed it for this release. For standard usage, we’ve created a map-style dataset which materializes the text iterator. A default dataset processing pipeline, including tokenizer and vocabulary, is added to the map-style datasets to support one-command data loading.

from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)

For those who want more flexibility, the raw text is still available as a `torch.utils.data.IterableDataset` by simply inserting `.raw` into the module path as follows.

train, test = torchtext.experimental.datasets.raw.AG_NEWS()

Instead of maintaining Batch and Iterator func in torchtext, the new dataset abstraction is fully compatible with `torch.utils.data.DataLoader` like below. `collate_fn` is used to process the data batch generated from `DataLoader`.

from torch.utils.data import DataLoader
def collate_fn(batch):
texts, labels = [], []
for label, txt in batch:
texts.append(txt)
labels.append(label)
return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
print(idx, texts, labels)

With the new datasets, we worked together with the OSS community to re-write the legacy datasets in torchtext. Here is a brief summary of the progress:

* Word language modeling datasets (WikiText2, WikiText103, PennTreeBank) 661, 774
* Text classification datasets (AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull) 701, 775, 776
* Sentiment analysis dataset (IMDb) 651
* Translation datasets (Multi30k, IWSLT, WMT14) 751, 821, 851
* Question-answer datasets (SQuAD1, SQuAD2) 773
* Sequence tagging datasets (UDPOS, CoNLL2000Chunking) 805

Those new datasets stay in `torchtext.experimental.datasets` directory. The old version of the datasets are still available in `torchtext.datasets` and the new datasets are opt-in. In 0.8.0 release, the old datasets will be moved to `torchtext.legacy` directory.

To learn how to apply the new dataset abstraction with `DataLoader` and SOTA PyTorch compatibilities (like Distributed Data Parallel), we created a full example to use the new torchtext datasets (WikiText103, SQuAD1, etc) to train a BERT model. A pretrained BERT model is generated from masked language task and next sentence task. Then, the model is fine-tuned for the question-answer task. The example is available in torchtext repo ([here](https://github.com/pytorch/text/tree/master/examples/BERT)).

Backwards Incompatible Changes

* Remove code specific to Python2 732

New Features

* Refractor nn.MultiheadAttention as MultiheadAttentionContainer in torchtext 720, 839, 883
* Pre-train BERT pipeline and fine-tune question-answer task 767
* Experimental datasets in torchtext.experimental.datasets (See New Dataset Abstraction section above for the full list) 701, 773, 774, 775, 776, 805, 821, 851
* Add Windows support for torchtext 772, 781, 789, 796, 807, 810, 829
* Add torchscript support to SentencePiece 755, 771, 786, 798, 799

Improvements

* Integrates [pytorch-probot](https://github.com/pytorch/pytorch-probot) into the repo #877
* Switch to pytorch TestCase for build-in dataset 822
* Switch experimental ngrams_func to data.utils.ngrams_iterator 813
* Create root directory automatically for download_from_url if not exists 797
* Add shebang line to suppress the lint warning 787
* Switch to CircleCI and improve torchtext CI tests 744, 766, 768, 777, 783, 784, 794, 800, 801, 803, 809, 832, 837, 881, 888
* Put sacremoses tokenizer test back 782
* Update installation directions 763, 764, 769, 795
* Add CCI cache for test data 748
* Disable travis tests except for RUN_FLAKE8 747
* Disable Travis tests of which equivalent run on CCI 746
* Use 'cpu' instead of None for Iterator 745
* Remove the allow to fail statement in travis test 743
* Add explicit test results to text classification datasets 738

Docs

* Bump nightlies to 0.8.0 847
* Update README.rst file 735, 817
* Update the labels of docs in text classification datasets 734

Bug Fixes

None

Deprecations

Add deprecation warning to legacy code 863. The following legacy components are ready to be retired, including

* `torchtext.data.Batch` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/batch.py#L4))
* `torchtext.data.Field` ([link](https://github.com/pytorch/text/blob/master/torchtext/data/field.py))
* `torchtext.data.Iterator` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/iterator.py))
* `torchtext.data.Example` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/example.py#L5))
* `torchtext.datasets` ([link](https://github.com/pytorch/text/tree/master/torchtext/datasets))

In 0.7.0 release, we add deprecation warnings, and finally will retire them to the torchtext.legacy directory in the October 0.8.0 release.

0.6.0

Highlights

This release drops the Python2 support from torchtext. Some minor bug fixes and doc updates are included.

We are continuously working on the new dataset abstraction. Users and developers are welcome to send feedback to issue 664. We want also to highlight a pull request 701 where the latest dataset abstraction is applied to the text classification datasets.

Backward compatibility

* Unified tar and zip file handling within extract_archive function 692

Docs

* Updated the BLEU example in doc 729
* Updated README file with conda installation 728
* Allowed maximum sentence length to 120 in flake8 719
* Updated CODE_OF_CONDUCT.md file 702
* Removed duplicate docs on torchtext website 697
* Updated README file with a disclaimer for the new dataset abstraction 693
* Updated docs in experimental language modeling dataset 682

Bug Fixes

* Sent out error message if SentencePiece is not installed. Fixed the SentencePiece dependency issue within conda package 733
* Fixed a bug in experimental IMDB dataset to allow a custom vocab 683

0.5.0

* Several components and functionals are unclear and difficult to adopt. For example, `Field` class couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. The current `Field` class works like a "black box", and users are confused about what's going on within the class. Instead, those components should be divided into several basic building blocks. This is more consistent with PyTorch core library where users build models and pipelines with orthogonal components.
* Incompatible with PyTorch core library, like [DataLoader](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) and [Sampler](https://pytorch.org/docs/master/data.html#torch.utils.data.Sampler) in [torch.utils.data](https://pytorch.org/docs/master/data.html#module-torch.utils.data). Some custom modules/functions in torchtext (e.g. `Iterator`, `Batch`, `splits`) should be replaced by the corresponding modules in `torch.utils.data`.

We have re-written several datasets in [torchtext.experimental.datasets](https://github.com/pytorch/text/tree/master/torchtext/experimental), which are using the new abstraction. The old version of the datasets are still available in `torchtext.datasets`, and the new datasets are opt-in. We expect to replace the legacy datasets with the experimental ones in the future. Torchtext users are welcome to send feedback to issue [#664]
* Re-write Sentiment Analysis dataset [651]
- IMDB
* Re-write Language Modeling datasets [624, 661], including
- WikiText2
- WikiText103
- PennTreebank

SentencePiece binding

The SentencePiece binding provides an effective way to solve the open vocabulary problems in NLP tasks. The binding now supports two segmentation algorithms, byte-pair-encoding (BPE) and unigram language model. It trains a subword models directly from raw text data, which are used to tokenize corpus and convert them into PyTorch tensors [597]

Backward compatibility

* Last release with the support of Python 2
* Change the default ngrams value to 1 in text classification datasets [663]
* Temporarily remove a unit test `test_get_tokenizer_moses` from CI tests. Need to push it back after issue related to moses tokenizer is resolved. [588]

We would like to thank the open source community, who continues to send pull requests for new features and bug-fixes.

New Features

* Add unsupervised learning dataset EnWik9, compressing first 10<sup>9</sup> bytes of enwiki-20060303-pages-articles.xml [610]
* Several generators are created to build the pipeline for text preprocessing [624, 610, 597].
* Add Bilingual Evaluation Understudy (BLEU) metric for translation task in `torch.data.metrics` [627]
* Add Cross-Lingual NLI Corpus (XNLI) dataset [613]

Improvements

* Improve `download_from_url` and `extract_archive` func. `extract_archive` func now supports .zip files. `download_from_url` func now explicitly gets the filename from the url instead of from url header. This allows to download from a non-google drive link [602]
* Add a legal disclaimer for torchtext datasets [590]
* Add installation command to Travis [585]
* Some improvements in the example torchtext/examples/text_classification [580] [578] [576]
* Fix and improve docs [603] [598] [594] [577] [662]
* Add Code of Conduct document [638]
* Add Contributing document [637]

Bug Fixes

* Fix a backward compatibility issue in `Vocab` class. The old version of torchtext doesn’t have `unk_index` attribute in `Vocab`, To avoid BC breaking, the `setstate` function now checks if there is `unk_index` attribute in the vocab object [591]
* Resolve an overflow error by decreasing the maxInt value, which is used to check `csv.field_size_limit` in `unicode_csv_reader` [584]

0.4.0

Highlights

Supervised learning baselines

torchtext 0.4.0 includes several [example scripts](https://github.com/pytorch/text/tree/master/examples) that showcase how to [create data](https://github.com/pytorch/text/blob/master/examples/text_classification/create_datasets.py), [build vocabularies](https://github.com/pytorch/text/blob/master/examples/vocab/vocab.py), [train, test](https://github.com/pytorch/text/blob/master/examples/text_classification/train.py) and [run inference](https://github.com/pytorch/text/blob/master/examples/text_classification/predict.py) for common supervised learning baselines. We further provide a [tutorial](http://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) to explain these examples in more detail.

For an advanced application of these constructs see [the iterable_train.py example](https://github.com/pytorch/text/blob/master/examples/text_classification/iterable_train.py).

Community

We would like to thank the open source community, who continues to send pull
requests for new features and bug-fixes.

Major New Features
- New datasets for supervised learning (557 565 580)
* AG_NEWS
* SogouNews
* DBpedia
* YelpReviewPolarity
* YelpReviewFull
* YahooAnswers
* AmazonReviewPolarity
* AmazonReviewFull
- Tutorials and examples:
* [Reference examples](https://github.com/pytorch/text/tree/master/examples/text_classification) (#569 575 571 575 576) to
+ [Create/save](https://github.com/pytorch/text/blob/master/examples/text_classification/create_datasets.py) text classification datasets
+ Train and test a text classification model using [one-line dataloading](https://github.com/pytorch/text/blob/master/examples/text_classification/train.py) and [iterator based Datasets](https://github.com/pytorch/text/blob/master/examples/text_classification/iterable_train.py).
+ [Setup online inference based on](https://github.com/pytorch/text/blob/master/examples/text_classification/predict.py) a trained model
* A [tutorial](http://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) to showcase and illustrate these examples.

New Features
- [ngrams_iterator](https://pytorch.org/text/data/utils.html?highlight=ngrams_iterator#torchtext.data.utils.ngrams_iterator) an iterator that yields ngrams based on a given list or iterator of strings. (567 577)
- [build_vocab_from_iterator](https://pytorch.org/text/vocab.html?highlight=build_vocab_from_iterator#torchtext.vocab.build_vocab_from_iterator) (567)
- [extract_archive](https://pytorch.org/text/utils.html?highlight=extract_archive#torchtext.utils.extract_archive) (569)

Improvements
- Added logging to [download_from_url](https://pytorch.org/text/utils.html?highlight=download_from_url#torchtext.utils.download_from_url) (569)
- Added fast, basic english sentence normalization to [get_tokenizer](https://pytorch.org/text/data.html?highlight=get_tokenizer#torchtext.data.get_tokenizer) (569 568)
- Updated docs theme to pytorch_sphinx_theme (573)
- Refined Example.fromJSON() to support parse nested key for parsing nested JSON dataset. (563)
- Added `__len__` & `get_vecs_by_tokens` in 'Vectors' class to generate vector from a list of tokens (561)
- Added templates for torchtext users to bring up issues (553 574)
- Added a new argument `specials` in Field.build_vocab to save the user-defined special tokens (495)
- Added a new argument `is_target` in `RawField` class to show whether the field is a target variable - False by default (459). Adjusted `is_target` argument in LabelField to True to take it into effect (450)
- Added the option to serialize fields with `torch.save` or `pickle.dump`, allow tokenizers in different languages (453)

Bug Fixes
- Allow caching from unverified SSL in `CharNGram` (554)
- Fix the wrong `unk` index by generating the unk_index according to the specials (531)
- Update Moses tokenizer link in README.rst file (529)
- Fix the url to load `wiki.simple.vec` (525), fix the dead url to load `fastText` vectors (521)
- Fix `UnicodeDecodeError` for loading sequence tagging dataset (506)
- Fix collisions between oov words and in-vocab words caused by Issue 447 (482)
- Fix a mistake in the processing bar of Vectors class (480)
- Add the dependency to `six` under 'install_requires' in the setup.py file (PR 475 for Issue 465)
- Fix a bug in `Field` class which causes overwriting the `stop_words` attribute (PR 458 for Issue 457)
- Transpose the text and target tensors if the text field in BPTTIterator has 'batch_first' set to True (462)
- Add <unk> to default specials (567)

Backward Compatibility
- Dropped support for python 2.7.9 (552)

0.3.1

**Major changes:**
- Added bABI dataset (286)
- Added MultiNLP dataset (326)
- Pytorch 0.4 compatibility + bugfixes (299, 302)
- Batch iteration now returns a tuple of `(inputs), outputs` by default without having to index attributes from `Batch` (288)
- [BREAKING] `Iterator` no longer repeats infinitely by default (now stops after epoch has completed) (417)

**Minor changes:**
- Handle `moses` tokenizer being migrated from nltk (361)
- Vector loading made more efficient and flexible (353)
- Allow special tokens to be added to the end of the vocabulary (400)
- Allow filtering unknown words from examples (413)

**Bugfixes:**
- Documentation (382, 383, 393 395, 410)
- Create cache dir for pretrained embeddings if it doesn't exist (301)
- Various typos (293, 369, 373, 344, 401, 404, 405, 418)
- `Dataset.split()` not copying `sort_key` fixed (279)
- Various python 2.* vs python 3.* issues (280)
- Fix `OOV` token vector dimensionality (308)
- Lowercased type of `TabularDataset` (315)
- Fix `splits` method in various translation datasets (377, 385, 392, 429)
- Fix `ParseTextField` postprocessing (386)
- Fix SubwordVocab (399)
- Make NestedField GPU compatible and fix frequency saving (409, 403)
- Allow `CSVreader` params to be modified by user (432)
- Use tqdm progressbar in downloads (425)

0.2.3

Release notes coming shortly.

Page 5 of 6

Releases

Has known vulnerabilities

Previous Next

Torchtext

Page 5 of 6

0.7.0rc3

0.6.0

0.5.0

0.4.0

0.3.1

0.2.3

Page 5 of 6

Links

Releases