Highlights
With the continued progress of PyTorch, some code in torchtext grew out of date with the SOTA PyTorch modules (for example `torch.utils.data.DataLoader`, `torchscript`). In 0.7.0 release, we’re taking big steps toward modernizing torchtext, and adding warning messages to these legacy components which will be retired in the October 0.8.0 release. We’re also introducing a host of new features, including:
1. A generalized MultiheadAttentionContainer for flexible attention behavior
2. Torchscript support for SentencePiece models
3. An end-to-end BERT example pipeline, including pertained weights and a question answering fine-tuning example
4. The SQuAD1 and SQuAD2 question answering datasets
5. Windows support
Legacy code and issues
For a period of time (ending around June of 2019), torchtext lacked active maintenance and grew out of date with the present SOTA research and PyTorch features. We’ve committed to bringing the library fully up to date, and identified a few core issues:
* Several components and functionals were unclear and difficult to adopt. For example, the `Field` class coupled tokenization, vocabularies, splitting, batching and sampling, padding, and numericalization all together, and was opaque and confusing to users. We determined that these components should be divided into separate orthogonal building blocks. For example, it was difficult to use HuggingFace's tokenizers with the `Field` class (issue 609). Modular pipeline components would allow a third party tokenizer to be swapped into the pipeline easily.
* torchtext’s datasets were incompatible with [DataLoader](https://pytorch.org/docs/master/data.html#torch.utils.data.DataLoader) and [Sampler](https://pytorch.org/docs/master/data.html#torch.utils.data.Sampler) in [torch.utils.data](https://pytorch.org/docs/master/data.html#module-torch.utils.data), or even duplicated that code (e.g. `torchtext.data.Iterator`, `torchtext.data.Batch`). Basic inconsistencies confused users. For example, many struggled to fix the data order while using `Iterator` (issue 828), whereas with `DataLoader`, users can simply set `shuffle=False` to fix the data order.
We’ve addressed these issues in this release, and several legacy components are now ready to be retired:
* `torchtext.data.Batch` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/batch.py#L4))
* `torchtext.data.Field` ([link](https://github.com/pytorch/text/blob/master/torchtext/data/field.py))
* `torchtext.data.Iterator` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/iterator.py))
* `torchtext.data.Example` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/example.py#L5))
In 0.7.0 release, we add deprecation warnings, and finally will retire them to the `torchtext.legacy` directory in 0.8.0 release on October.
New dataset abstraction
Since the 0.4.0 release, we’ve been working on a new common interface for the torchtext datasets (inheriting from `torch.utils.data.Dataset`) to address the issues above, and completed it for this release. For standard usage, we’ve created a map-style dataset which materializes the text iterator. A default dataset processing pipeline, including tokenizer and vocabulary, is added to the map-style datasets to support one-command data loading.
from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)
For those who want more flexibility, the raw text is still available as a `torch.utils.data.IterableDataset` by simply inserting `.raw` into the module path as follows.
train, test = torchtext.experimental.datasets.raw.AG_NEWS()
Instead of maintaining Batch and Iterator func in torchtext, the new dataset abstraction is fully compatible with `torch.utils.data.DataLoader` like below. `collate_fn` is used to process the data batch generated from `DataLoader`.
from torch.utils.data import DataLoader
def collate_fn(batch):
texts, labels = [], []
for label, txt in batch:
texts.append(txt)
labels.append(label)
return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
print(idx, texts, labels)
With the new datasets, we worked together with the OSS community to re-write the legacy datasets in torchtext. Here is a brief summary of the progress:
* Word language modeling datasets (WikiText2, WikiText103, PennTreeBank) 661, 774
* Text classification datasets (AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull) 701, 775, 776
* Sentiment analysis dataset (IMDb) 651
* Translation datasets (Multi30k, IWSLT, WMT14) 751, 821, 851
* Question-answer datasets (SQuAD1, SQuAD2) 773
* Sequence tagging datasets (UDPOS, CoNLL2000Chunking) 805
Those new datasets stay in `torchtext.experimental.datasets` directory. The old version of the datasets are still available in `torchtext.datasets` and the new datasets are opt-in. In 0.8.0 release, the old datasets will be moved to `torchtext.legacy` directory.
To learn how to apply the new dataset abstraction with `DataLoader` and SOTA PyTorch compatibilities (like Distributed Data Parallel), we created a full example to use the new torchtext datasets (WikiText103, SQuAD1, etc) to train a BERT model. A pretrained BERT model is generated from masked language task and next sentence task. Then, the model is fine-tuned for the question-answer task. The example is available in torchtext repo ([here](https://github.com/pytorch/text/tree/master/examples/BERT)).
Backwards Incompatible Changes
* Remove code specific to Python2 732
New Features
* Refractor nn.MultiheadAttention as MultiheadAttentionContainer in torchtext 720, 839, 883
* Pre-train BERT pipeline and fine-tune question-answer task 767
* Experimental datasets in torchtext.experimental.datasets (See New Dataset Abstraction section above for the full list) 701, 773, 774, 775, 776, 805, 821, 851
* Add Windows support for torchtext 772, 781, 789, 796, 807, 810, 829
* Add torchscript support to SentencePiece 755, 771, 786, 798, 799
Improvements
* Integrates [pytorch-probot](https://github.com/pytorch/pytorch-probot) into the repo #877
* Switch to pytorch TestCase for build-in dataset 822
* Switch experimental ngrams_func to data.utils.ngrams_iterator 813
* Create root directory automatically for download_from_url if not exists 797
* Add shebang line to suppress the lint warning 787
* Switch to CircleCI and improve torchtext CI tests 744, 766, 768, 777, 783, 784, 794, 800, 801, 803, 809, 832, 837, 881, 888
* Put sacremoses tokenizer test back 782
* Update installation directions 763, 764, 769, 795
* Add CCI cache for test data 748
* Disable travis tests except for RUN_FLAKE8 747
* Disable Travis tests of which equivalent run on CCI 746
* Use 'cpu' instead of None for Iterator 745
* Remove the allow to fail statement in travis test 743
* Add explicit test results to text classification datasets 738
Docs
* Bump nightlies to 0.8.0 847
* Update README.rst file 735, 817
* Update the labels of docs in text classification datasets 734
Bug Fixes
None
Deprecations
Add deprecation warning to legacy code 863. The following legacy components are ready to be retired, including
* `torchtext.data.Batch` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/batch.py#L4))
* `torchtext.data.Field` ([link](https://github.com/pytorch/text/blob/master/torchtext/data/field.py))
* `torchtext.data.Iterator` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/iterator.py))
* `torchtext.data.Example` ([link](https://github.com/pytorch/text/blob/63129947e8c826ad91771c7310cac2f36040afae/torchtext/data/example.py#L5))
* `torchtext.datasets` ([link](https://github.com/pytorch/text/tree/master/torchtext/datasets))
In 0.7.0 release, we add deprecation warnings, and finally will retire them to the torchtext.legacy directory in the October 0.8.0 release.