Tensorflow-text

Latest version: v2.18.1

Safety actively analyzes 723625 Python packages for vulnerabilities to keep your Python projects secure.

Page 7 of 11

2.4.3

Bug Fixes and Other Changes

* Fix export as saved model of hub_module_splitter
* Fix bug in regex_split_with_offsets when input.ragged_rank > 1
* Convert input to tensor to allow for numpy inputs in state based sentence breaker.
* Add more classifiers to py packages.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

fsx950223

2.4.2

Major Features and Improvements

* We are now building a nightly package - `tensorflow-text-nightly`. This is available for Linux immediately, with other platforms to be added soon.

Bug Fixes and Other Changes

* Fixes a bug which prevented the sentence_fragmenter from being able to process tensors with a rank > 1.
* Update documentation filenames to prevent collisions when checking out the code on filesystems that do not have case sensitivity.

2.4.1

Major Features and Improvements

* New APIs proposed in [RFC: End-to-end text preprocessing with TF.Text 283](https://github.com/tensorflow/community/pull/283) have been added, including:
- `Splitter`
- `RegexSplitter`
- `StateBasedSentenceBreaker`
- `Trimmer`
- `WaterfallTrimmer`
- `RoundRobinTrimmer`
- `ItemSelector`
- `RandomItemSelector`
- `FirstNItemSelector`
- `MaskValuesChooser`
- `mask_language_model()`
- `combine_segments()`
- `pad_model_inputs()`
* Windows support!
* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
* Added `normalize_utf8_with_offsets` and `find_source_offsets` ops.
* Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
* Added string_to_id to SentencepieceTokenizer.
* Support Android build.
* RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

* Add a minimal count_words function to wordpiece_vocabulary_learner.
* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.
* Update intro.ipynb colab.
* Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
* Update StateBasedSentenceBreaker handling of text input tensors.
* Reduce over-broad dependencies in regex_split library.
* Fix broken builds.
* Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
* Update README regarding versions.
* Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
* Convert non-tensor inputs in pad along dimension op.
* Add the necessity to install coreutils to the build instructions if building on MacOS.
* Add filegroup BUILD target for test_data segmentation Hub module.
* Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
* Add Spliter / SplitterWithOffsets abstract base classes.
* Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
* Change variable names for token offsets: "limit" -> "end".
* Fix presubmit failed for MacOS.
* Allow dense tensor inputs for RegexSplit.
* Fix imports in tools/.
* BertTokenizer: Error out if the user passes a `normalization_form` that will be ignored.
* Update documentation for Sentencepiece.tokenize_with_offsets.
* Let WordpieceTokenizer read vocabulary files.
* Numerous build improvements / adjustments (mostly to support Windows):
* Patch out googletest & glog dependencies from Sentencepiece.
* Switch to using Bazel's internal patching.
* ICU data is built statically for Windows.
* Remove reliance on tf_kernel_library.
* Patch TF to fix problematic Python executable searching.
* Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

2.4.0rc1

Major Features and Improvements

* Windows support!
* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
* Added `normalize_utf8_with_offsets` and `find_source_offsets` ops.
* Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
* Added string_to_id to SentencepieceTokenizer.
* Support Android build.
* RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

* Add a minimal count_words function to wordpiece_vocabulary_learner.
* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.
* Update intro.ipynb colab.
* Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
* Update StateBasedSentenceBreaker handling of text input tensors.
* Reduce over-broad dependencies in regex_split library.
* Fix broken builds.
* Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
* Update README regarding versions.
* Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
* Convert non-tensor inputs in pad along dimension op.
* Add the necessity to install coreutils to the build instructions if building on MacOS.
* Add filegroup BUILD target for test_data segmentation Hub module.
* Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
* Add Spliter / SplitterWithOffsets abstract base classes.
* Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
* Change variable names for token offsets: "limit" -> "end".
* Fix presubmit failed for MacOS.
* Allow dense tensor inputs for RegexSplit.
* Fix imports in tools/.
* BertTokenizer: Error out if the user passes a `normalization_form` that will be ignored.
* Update documentation for Sentencepiece.tokenize_with_offsets.
* Let WordpieceTokenizer read vocabulary files.
* Numerous build improvements / adjustments (mostly to support Windows):
* Patch out googletest & glog dependencies from Sentencepiece.
* Switch to using Bazel's internal patching.
* ICU data is built statically for Windows.
* Remove reliance on tf_kernel_library.
* Patch TF to fix problematic Python executable searching.
* Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

2.4.0rc0

Major Features and Improvements

* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
* Added `normalize_utf8_with_offsets` and `find_source_offsets` ops.
* Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
* Added string_to_id to SentencepieceTokenizer.
* Support Android build.
* Support Windows build (Py3.6 & Py3.7 this release).
* RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.
* Update intro.ipynb colab.
* Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
* Update StateBasedSentenceBreaker handling of text input tensors.
* Reduce over-broad dependencies in regex_split library.
* Fix broken builds.
* Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
* Update README regarding versions.
* Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
* Convert non-tensor inputs in pad along dimension op.
* Add the necessity to install coreutils to the build instructions if building on MacOS.
* Add filegroup BUILD target for test_data segmentation Hub module.
* Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
* Add Spliter / SplitterWithOffsets abstract base classes.
* Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
* Change variable names for token offsets: "limit" -> "end".
* Fix presubmit failed for MacOS.
* Allow dense tensor inputs for RegexSplit.
* Fix imports in tools/.
* BertTokenizer: Error out if the user passes a `normalization_form` that will be ignored.
* Update documentation for Sentencepiece.tokenize_with_offsets.
* Let WordpieceTokenizer read vocabulary files.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

2.4.0b0

Please note that this is a pre-release and meant to run with TF v2.3.x. We wanted to give access to some of the features we were adding to 2.4.x, but did not want to wait for the TF release.

Major Features and Improvements

* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.

Bug Fixes and Other Changes

* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.

Thanks to our Contributors

Page 7 of 11

Releases

Has known vulnerabilities

Previous Next

Tensorflow-text

Page 7 of 11

2.4.3

2.4.2

2.4.1

2.4.0rc1

2.4.0rc0

2.4.0b0

Page 7 of 11

Links

Releases