Major Features and Improvements
* New `tft.word_count` mapper to identify the number of tokens for each row
(for pre-tokenized strings).
* All `tft.scale_to_*` mappers now have per-key variants, along with analyzers
for `mean_and_var_per_key` and `min_and_max_per_key`.
* New `tft_beam.AnalyzeDatasetWithCache` allows analyzing ranges of data while
producing and utilizing cache. `tft.analyzer_cache` can help read and write
such cache to a filesystem between runs. This caching feature is worth using
when analyzing a rolling range in a continuous pipeline manner. This is an
experimental feature.
* Added `reduce_instance_dims` support to `tft.quantiles` and `elementwise` to
`tft.bucketize`, while avoiding separate beam calls for each feature.
Bug Fixes and Other Changes
* `sparse_tensor_to_dense_with_shape` now accepts an optional `default_value`
parameter.
* `tft.vocabulary` and `tft.compute_and_apply_vocabulary` now support
`fingerprint_shuffle` to sort the vocabularies by fingerprint instead of
counts. This is useful for load balancing the training parameter servers.
This is an experimental feature.
* Fix numerical instability in `tft.vocabulary` mutual information calculations.
* `tft.vocabulary` and `tft.compute_and_apply_vocabulary` now support computing
vocabularies over integer categoricals and multivalent input features, and
computing mutual information for non-binary labels.
* New numeric normalization method available:
`tft.apply_buckets_with_interpolation`.
* Changes to make this library more compatible with TensorFlow 2.0.
* Fix sanitizing of vocabulary filenames.
* Emit a friendly error message when context isn't set.
* Analyzer output dtypes are enforced to be TensorFlow dtypes, and by extension
`ptransform_analyzer`'s `output_dtypes` is enforced to be a list of TensorFlow
dtypes.
* Make `tft.apply_buckets_with_interpolation` support SparseTensors.
* Adds an experimental api for analyzers to annotate the post-transform schema.
* `TFTransformOutput.transform_raw_features` now accepts an optional
`drop_unused_features` parameter to exclude unused features in output.
* If not specified, the min_diff_from_avg parameter of `tft.vocabulary` now
defaults to a reasonable value based on the size of the dataset (relevant
only if computing vocabularies using mutual information).
* Convert some `tf.contrib` functions to be compatible with TF2.0.
* New `tft.bag_of_words` mapper to compute the unique set of ngrams for each row
(for pre-tokenized strings).
* Fixed a bug in `tf_utils.reduce_batch_count_mean_and_var`, and as a result
`mean_and_var` analyzer, was miscalculating variance for the sparse
elementwise=True case.
* At test utility `tft_unit.cross_named_parameters` for creating parameterized
tests that involve the cartesian product of various parameters.
* Depends on `tensorflow-metadata>=0.14,<0.15`.
* Depends on `apache-beam[gcp]>=2.14,<3`.
* Depends on `numpy>=1.16,<2`.
* Depends on `absl-py>=0.7,<2`.
* Allow `preprocessing_fn` to emit a `tf.RaggedTensor`. In this case, the
output `Schema` proto will not be able to be converted to a feature spec,
and so the output data will not be able to be materialized with `tft.coders`.
* Ability to directly set exact `num_buckets` with new parameter
`always_return_num_quantiles` for `analyzers.quantiles` and
`mappers.bucketize`, defaulting to False in general but True when
`reduce_instance_dims` is False.
Breaking changes
* `tf_utils.reduce_batch_count_mean_and_var`, which feeds into
`tft.mean_and_var`, now returns 0 instead of inf for empty columns of a
sparse tensor.
* `tensorflow_transform.tf_metadata.dataset_schema.Schema` class is removed.
Wherever a `dataset_schema.Schema` was used, users should now provide a
`tensorflow_metadata.proto.v0.schema_pb2.Schema` proto. For backwards
compatibility, `dataset_schema.Schema` is now a factory method that produces
a `Schema` proto. Updating code should be straightforward because the
`dataset_schema.Schema` class was already a wrapper around the `Schema` proto.
* Only explicitly public analyzers are exported to the `tft` module, e.g.
combiners are no longer exported and have to be accessed directly through
`tft.analyzers`.
* Requires pre-installed TensorFlow >=1.14,<2.
Deprecations
* `DatasetSchema` is now a deprecated factory method (see above).
* `tft.tf_metadata.dataset_schema.from_feature_spec` is now deprecated.
Equivalent functionality is provided by
`tft.tf_metadata.schema_utils.schema_from_feature_spec`.