Tensorflow-text

Latest version: v2.16.1

Safety actively analyzes 638396 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 10

2.5.0rc0

Major Features and Improvements

* Add a subwords tokenizer tutorial to text/examples.
* Add a function to generate a BERT vocab from a tf.data.Dataset.
* Add detokenize methods for `BertTokenizer` and `WordpieceTokenizer`.
* Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.
* Enable NFD and NFKD in NormalizeWithOffset op
* Adding an i18n-friendly BasicTokenizer that can preserve accents
* Create guide for tokenizers.

Breaking Changes

Bug Fixes and Other Changes

* Other:
* For Windows, always include ICU data files since they need to be built in statically.
* Patches TF to fix windows builds to not look for a python3 executable.
* Rename documentation file WordShape.md to WordShape_cls.md. The problem is on MacOS (and maybe Windows) this filename collides with wordshape.md, because the filesystem does not differentiate cases for the files. This is purely a QOL change for anybody checking out the library on a non-Linux platform. Fix 361.
* Convert input to tensor to allow for numpy inputs to state based sentence breaker.
* Add classifiers to py packages and fix header image.
* fix bad rendering for add_eos add_bos description in SentencepieceTokenizer.md
* Fix for the model server test. Make sure our test tensors have the expected
* Update regression test for break_sentences_with_offsets.
* Add a `shape` attribute to the `ToDense` Keras layer.
* Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker
* Fix for the model server test. The result of the tokenize() method of
* Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems. Also moved out the vocab for Wordpiece due to a tf bug.
* Update documentation for SplitMergeFromLogitsTokenizer
* Add regression test for Find Source Offsets
* Fix `unselectable_ids` shape check in ItemSelector.
* changing two tests, to debug failure on Kokoro Windows build.
* Switch out architecture image in tf.Text documentation.
* Fix regression test for state_based_sentence_breaker_v2
* Update run_build with enable_runfiles flag.
* Update the version of bazel_skylib to match TF's and fix a possible visibility issue.
* Simplify tf-text WORKSPACE, by relying on tf_workspace().
* Update transformer.ipynb to use a saved `text.BertTokenizer`
* typos
* Update mobile targets to use :mobile rather than separate :android & :ios targets.
* Make tools part of the `tensorflow_text` pip package.
* Import tools from the tf-text package, instead of cloning the git repo.
* Minor cleanups to make some code compile on the android build system.
* Fix pip install command in readme
* Fix `tools` pip package inclusion.
* Clear outputs
* A tensorfow.org compatible docs generator for tf-text.
* Formatting fixes for tensorflow.org
* Sample random tokens correctly during MLM.
* Internal repo change
* Treat Sentencepiece ops as stateful in tf.data pipelines.
* Reduce the critical section range. Because the options are
* Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.
* Updating guide with new template


Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Rens, Samuel Marks, thuang513

2.4.3

Bug Fixes and Other Changes

* Fix export as saved model of hub_module_splitter
* Fix bug in regex_split_with_offsets when input.ragged_rank > 1
* Convert input to tensor to allow for numpy inputs in state based sentence breaker.
* Add more classifiers to py packages.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

fsx950223

2.4.2

Major Features and Improvements

* We are now building a nightly package - `tensorflow-text-nightly`. This is available for Linux immediately, with other platforms to be added soon.

Bug Fixes and Other Changes

* Fixes a bug which prevented the sentence_fragmenter from being able to process tensors with a rank > 1.
* Update documentation filenames to prevent collisions when checking out the code on filesystems that do not have case sensitivity.

2.4.1

Major Features and Improvements

* New APIs proposed in [RFC: End-to-end text preprocessing with TF.Text 283](https://github.com/tensorflow/community/pull/283) have been added, including:
- `Splitter`
- `RegexSplitter`
- `StateBasedSentenceBreaker`
- `Trimmer`
- `WaterfallTrimmer`
- `RoundRobinTrimmer`
- `ItemSelector`
- `RandomItemSelector`
- `FirstNItemSelector`
- `MaskValuesChooser`
- `mask_language_model()`
- `combine_segments()`
- `pad_model_inputs()`
* Windows support!
* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
* Added `normalize_utf8_with_offsets` and `find_source_offsets` ops.
* Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
* Added string_to_id to SentencepieceTokenizer.
* Support Android build.
* RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

* Add a minimal count_words function to wordpiece_vocabulary_learner.
* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.
* Update intro.ipynb colab.
* Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
* Update StateBasedSentenceBreaker handling of text input tensors.
* Reduce over-broad dependencies in regex_split library.
* Fix broken builds.
* Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
* Update README regarding versions.
* Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
* Convert non-tensor inputs in pad along dimension op.
* Add the necessity to install coreutils to the build instructions if building on MacOS.
* Add filegroup BUILD target for test_data segmentation Hub module.
* Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
* Add Spliter / SplitterWithOffsets abstract base classes.
* Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
* Change variable names for token offsets: "limit" -> "end".
* Fix presubmit failed for MacOS.
* Allow dense tensor inputs for RegexSplit.
* Fix imports in tools/.
* BertTokenizer: Error out if the user passes a `normalization_form` that will be ignored.
* Update documentation for Sentencepiece.tokenize_with_offsets.
* Let WordpieceTokenizer read vocabulary files.
* Numerous build improvements / adjustments (mostly to support Windows):
* Patch out googletest & glog dependencies from Sentencepiece.
* Switch to using Bazel's internal patching.
* ICU data is built statically for Windows.
* Remove reliance on tf_kernel_library.
* Patch TF to fix problematic Python executable searching.
* Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

2.4.0rc1

Major Features and Improvements

* Windows support!
* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
* Added `normalize_utf8_with_offsets` and `find_source_offsets` ops.
* Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
* Added string_to_id to SentencepieceTokenizer.
* Support Android build.
* RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

* Add a minimal count_words function to wordpiece_vocabulary_learner.
* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.
* Update intro.ipynb colab.
* Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
* Update StateBasedSentenceBreaker handling of text input tensors.
* Reduce over-broad dependencies in regex_split library.
* Fix broken builds.
* Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
* Update README regarding versions.
* Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
* Convert non-tensor inputs in pad along dimension op.
* Add the necessity to install coreutils to the build instructions if building on MacOS.
* Add filegroup BUILD target for test_data segmentation Hub module.
* Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
* Add Spliter / SplitterWithOffsets abstract base classes.
* Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
* Change variable names for token offsets: "limit" -> "end".
* Fix presubmit failed for MacOS.
* Allow dense tensor inputs for RegexSplit.
* Fix imports in tools/.
* BertTokenizer: Error out if the user passes a `normalization_form` that will be ignored.
* Update documentation for Sentencepiece.tokenize_with_offsets.
* Let WordpieceTokenizer read vocabulary files.
* Numerous build improvements / adjustments (mostly to support Windows):
* Patch out googletest & glog dependencies from Sentencepiece.
* Switch to using Bazel's internal patching.
* ICU data is built statically for Windows.
* Remove reliance on tf_kernel_library.
* Patch TF to fix problematic Python executable searching.
* Various other updates to .bazelrc, build_pip_package, and configuration to support Windows.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

2.4.0rc0

Major Features and Improvements

* Released our first TF Hub module for Chinese segmentation! Please visit the hub module page [here](https://tfhub.dev/google/zh_segmentation/1) for more info including instructions on how to use the model.
* Added `Spliter` / `SplitterWithOffsets` abstract base classes. These are meant to replace the current `Tokenizer` / `TokenizerWithOffsets` base classes. The `Tokenizer` base classes will continue to work and will implement these new `Splitter` base classes. The reasoning behind the change is to prevent confusion when future splitting operations that also use this interface do not tokenize into words (sentences, subwords, etc).
* With this cleanup of terminology, we've also updated the documentation and internal variable names for token offsets to use "end" instead of "limit". This is purely a documentation change and doesn't affect any current APIs, but we feel it more clearly expresses that `offset_end` is a positional value rather than a length.
* Added new `HubModuleSplitter` that helps handle ragged tensor input and outputs for hub modules which implement the Splitter class.
* Added new `SplitMergeFromLogitsTokenizer` which is a narrowly focused tokenizer that splits text based on logits from a model. This is used with the newly released Chinese segmentation model.
* Added `normalize_utf8_with_offsets` and `find_source_offsets` ops.
* Added benchmarking for tokenizers and other ops. Allows for comparisons of dense vs ragged and TF1 vs TF2.
* Added string_to_id to SentencepieceTokenizer.
* Support Android build.
* Support Windows build (Py3.6 & Py3.7 this release).
* RegexSplit op now caches regular expressions between calls.

Bug Fixes and Other Changes

* Test cleanup - use assertAllEqual(expected, actual), instead of (actual, expected), for better error messages.
* Add dep on tensorflow_hub in pip_package/setup.py
* Add filegroup BUILD target for test_data segmentation Hub module.
* Extend documentation for class HubModuleSplitter.
* Read SP model file in bytes mode in tests.
* Update intro.ipynb colab.
* Track the Sentencepiece model resource via a TrackableResource so it can be saved within Keras layers.
* Update StateBasedSentenceBreaker handling of text input tensors.
* Reduce over-broad dependencies in regex_split library.
* Fix broken builds.
* Fix comparison between signed and unsigned int in FindNextFragmentBoundary.
* Update README regarding versions.
* Fixed bug in WordpieceTokenizer so end offset is preserved when an unknown token of long size is found.
* Convert non-tensor inputs in pad along dimension op.
* Add the necessity to install coreutils to the build instructions if building on MacOS.
* Add filegroup BUILD target for test_data segmentation Hub module.
* Add long and long long overloads for RegexSplit so as to be TF agnostic c++ api.
* Add Spliter / SplitterWithOffsets abstract base classes.
* Update setup.py. TensorFlow has switched to the default package being GPU, and having users explicitly call out when wanting just CPU.
* Change variable names for token offsets: "limit" -> "end".
* Fix presubmit failed for MacOS.
* Allow dense tensor inputs for RegexSplit.
* Fix imports in tools/.
* BertTokenizer: Error out if the user passes a `normalization_form` that will be ignored.
* Update documentation for Sentencepiece.tokenize_with_offsets.
* Let WordpieceTokenizer read vocabulary files.

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Pranay Joshi, Siddharths8212376, Vincent Bodin

Page 6 of 10

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.