Major Features and Improvements
* Performance improvement due to optimizing inner loops.
* Add support for time semantic domain related statistics.
* Performance improvement due to batching accumulators before merging.
* Add utility method `validate_examples_in_tfrecord`, which identifies anomalous
examples in TFRecord files containing TFExamples and generates statistics for
those anomalous examples.
* Add utility method `validate_examples_in_csv`, which identifies anomalous
examples in CSV files and generates statistics for those anomalous examples.
* Add fast TF example decoder written in C++.
* Make `BasicStatsGenerator` to take arrow table as input. Example batches are
converted to Apache Arrow tables internally and we are able to make use of
vectorized numpy functions. Improved performance of BasicStatsGenerator
by ~40x.
* Make `TopKUniquesStatsGenerator` and `TopKUniquesCombinerStatsGenerator` to
take arrow table as input.
* Add `update_schema` API which updates the schema to conform to statistics.
* Add support for validating changes in the number of examples between the
current and previous spans of data (using the existing `validate_statistics`
function).
* Support building a manylinux2010 compliant wheel in docker.
* Add support for cross feature statistics.
Bug Fixes and Other Changes
* Expand unit test coverage.
* Update natural language stats generator to generate stats if actual ratio
equals `match_ratio`.
* Use `__slots__` in accumulators.
* Fix overflow warning when generating numeric stats for large integers.
* Set max value count in schema when the feature has same valency, thereby
inferring shape for multivalent required features.
* Fix divide by zero error in natural language stats generator.
* Add `load_anomalies_text` and `write_anomalies_text` utility functions.
* Define ReasonFeatureNeeded proto.
* Add support for Windows OS.
* Make semantic domain stats generators to take arrow column as input.
* Fix error in number of missing examples and total number of examples
computation.
* Make FeaturesNeeded serializable.
* Fix memory leak in fast example decoder.
* Add `semantic_domain_stats_sample_rate` option to compute semantic domain
statistics over a sample.
* Increment refcount of None in fast example decoder.
* Add `compression_type` option to `generate_statistics_from_*` methods.
* Add link to SysML paper describing some technical details behind TFDV.
* Add Python types to the source code.
* Make`GenerateStatistics` generate a DatasetFeatureStatisticsList containing a
dataset with num_examples == 0 instead of an empty proto if there are no
examples in the input.
* Depends on `absl-py>=0.7,<1`
* Depends on `apache-beam[gcp]>=2.14,<3`
* Depends on `numpy>=1.16,<2`.
* Depends on `pandas>=0.24,<1`.
* Depends on `pyarrow>=0.14.0,<0.15.0`.
* Depends on `scikit-learn>=0.18,<0.21`.
* Depends on `tensorflow-metadata>=0.14,<0.15`.
* Depends on `tensorflow-transform>=0.14,<0.15`.
Breaking Changes
* Change `examples_threshold` to `values_threshold` and update documentation to
clarify that counts are of values in semantic domain stats generators.
* Refactor IdentifyAnomalousExamples to remove sampling and output
(anomaly reason, example) tuples.
* Rename `anomaly_proto` parameter in anomalies utilities to `anomalies` to
make it more consistent with proto and schema utilities.
* `FeatureNameStatistics` produced by `GenerateStatistics` is now identified
by its `.path` field instead of the `.name` field. For example:
feature {
name: "my_feature"
}
becomes:
feature {
path {
step: "my_feature"
}
}
* Change `validate_instance` API to accept an Arrow table instead of a Dict.
* Change `GenerateStatistics` API to accept Arrow tables as input.
Deprecations