Major Features and Improvements
* New APIs proposed in [RFC: End-to-end text preprocessing with TF.Text 283](https://github.com/tensorflow/community/pull/283) have been added, including:
- `Splitter`
- `RegexSplitter`
- `StateBasedSentenceBreaker`
- `Trimmer`
- `WaterfallTrimmer`
- `RoundRobinTrimmer`
- `ItemSelector`
- `RandomItemSelector`
- `FirstNItemSelector`
- `MaskValuesChooser`
- `mask_language_model()`
- `combine_segments()`
- `pad_model_inputs()`
* Windows support!
* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
* Added `normalize_utf8_with_offsets` and `find_source_offsets` ops.
* Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
* Added string_to_id to SentencepieceTokenizer.
* Support Android build.
* RegexSplit op now caches regular expressions between calls.
Bug Fixes and Other Changes
* Add a minimal count_words function to wordpiece_vocabulary_learner.
* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.
* Update intro.ipynb colab.
* Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
* Update StateBasedSentenceBreaker handling of text input tensors.
* Reduce over-broad dependencies in regex_split library.
* Fix broken builds.
* Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
* Update README regarding versions.
* Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
* Convert non-tensor inputs in pad along dimension op.
* Add the necessity to install coreutils to the build instructions if building on MacOS.
* Add filegroup BUILD target for test_data segmentation Hub module.
* Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
* Add Spliter / SplitterWithOffsets abstract base classes.
* Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
* Change variable names for token offsets: "limit" -> "end".
* Fix presubmit failed for MacOS.
* Allow dense tensor inputs for RegexSplit.
* Fix imports in tools/.
* BertTokenizer: Error out if the user passes a `normalization_form` that will be ignored.
* Update documentation for Sentencepiece.tokenize_with_offsets.
* Let WordpieceTokenizer read vocabulary files.
* Numerous build improvements / adjustments (mostly to support Windows):
* Patch out googletest & glog dependencies from Sentencepiece.
* Switch to using Bazel's internal patching.
* ICU data is built statically for Windows.
* Remove reliance on tf_kernel_library.
* Patch TF to fix problematic Python executable searching.
* Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.
Thanks to our Contributors
This release contains contributions from many people at Google, as well as:
Pranay Joshi, Siddharths8212376, Vincent Bodin