Tensorflow-data-validation

Latest version: v1.16.1

Safety actively analyzes 723576 Python packages for vulnerabilities to keep your Python projects secure.

Page 7 of 8

0.21.2

Major Features and Improvements

Bug Fixes and Other Changes

* Fix facets visualization.
* Optimize LiftStatsGenerator for string features.
* Make `_WeightedCounter` serializable.
* Add support computing for weighted examples in LiftStatsGenerator.

Breaking Changes

Deprecations

* `tfdv.TFExampleDecoder` has been removed. This legacy decoder converts
serialized `tf.Example` to a dict of numpy arrays, which is the legacy
input format (prior to Apache Arrow). TFDV has stopped accepting that format
since 0.14. Use `tfdv.DecodeTFExample` instead.

0.21.1

Major Features and Improvements

Bug Fixes and Other Changes
* Do validation on weighted feature stats.
* During schema inference, skip features which are missing common stats. This
makes schema inference work when the input stats are generated from some
pre-existing, unknown schema.
* Fix facets visualization in Chrome >=M80.

Known Issues

* Running TFDV with Apache Beam 2.18 or 2.19 does not work on Windows. If you
are using TFDV on Windows, use Apache Beam 2.17.

Breaking Changes

Deprecations

0.21.0

Major Features and Improvements

* Started depending on the CSV parsing / type inferring utilities provided
by `tfx-bsl` (since tfx-bsl 0.15.2). This also brings performance improvements
to the CSV decoder (~2x faster in decoding. Type inferring performance is not
affected).
* Compute bytes statistics for features of BYTES type. Avoid computing topk and
uniques for such features.
* Added LiftStatsGenerator which computes lift between one feature (typically a
label) and all other categorical features.

Bug Fixes and Other Changes

* Exclude examples in which the entire sparse feature is missing when
calculating sparse feature statistics.
* Validate min_examples_count dataset constraint.
* Document the schema fields, statistics fields, and detection condition for
each anomaly type that TFDV detects.
* Handle null array in cross feature stats generator, top-k & uniques combiner
stats generator, and sklearn mutual information generator.
* Handle infinity in basic stats generator.
* Set num_missing and num_examples correctly in the presence of sparse
features.
* Compute weighted feature stats for all weighted features declared in schema.
* Enforce that mutual information is non-negative.
* Depends on `tensorflow-metadata>=0.21.0,<0.22`.
* Depends on `pyarrow>=0.15` (removed the upper bound as it is determined by
`tfx-bsl`).
* Depends on `tfx-bsl>=0.21.0,<0.22`
* Depends on `apache-beam>=2.17,<3`
* Validate that float feature does not contain NaNs (if disallow_nan is True).

Breaking Changes

* Changed the behavior regarding to statistics over CSV data:

- Previously, if a CSV column was mixed with integers and empty strings, FLOAT
statistics will be collected for that column. A change was made so INT
statistics would be collected instead.

* Removed `csv_decoder.DecodeCSVToDict` as `Dict[str, np.ndarray]` had no longer
been the internal data representation any more since 0.14.

Deprecations

0.15.0

Major Features and Improvements

* Generate statistics for sparse features.
* Directly convert a batch of tf.Examples to Arrow tables. Avoids conversion of
tf.Example to intermediate Dict representation.

Bug Fixes and Other Changes

* Generate statistics for the weight feature.
* Support validation and schema inference from sliced statistics that include
the default slice (validation/inference will be done using the default slice
statistics).
* Avoid flattening null arrays.
* Set `weighted_num_examples` field in the statistics proto if a weight
feature is specified.
* Replace DecodedExamplesToTable with a Python implementation.
* Building TFDV from source does not need pyarrow anymore.
* Depends on `apache-beam[gcp]>=2.16,<3`.
* Depends on `six>=1.12,<2`.
* Depends on `scikit-learn>=0.18,<0.22`.
* Depends on `tfx-bsl>=0.15,<0.16`.
* Depends on `tensorflow-metadata>=0.15,<0.16`.
* Depends on `tensorflow-transform>=0.15,<0.16`.
* Depends on `tensorflow>=1.15,<3`.
* Starting from 1.15, package
`tensorflow` comes with GPU support. Users won't need to choose between
`tensorflow` and `tensorflow-gpu`.
* Caveat: `tensorflow` 2.0.0 is an exception and does not have GPU
support. If `tensorflow-gpu` 2.0.0 is installed before installing
`tensorflow-data-validation`, it will be replaced with `tensorflow` 2.0.0.
Re-install `tensorflow-gpu` 2.0.0 if needed.

Breaking Changes

Deprecations

0.14.1

Major Features and Improvements

* Add support for custom schema transformations when inferring schema.

Bug Fixes and Other Changes

* Fix incorrect file hashes in the TFDV wheel.
* Fix DOMException when embedding visualization in iframe.

Breaking Changes

Deprecations

0.14.0

Major Features and Improvements

* Performance improvement due to optimizing inner loops.
* Add support for time semantic domain related statistics.
* Performance improvement due to batching accumulators before merging.
* Add utility method `validate_examples_in_tfrecord`, which identifies anomalous
examples in TFRecord files containing TFExamples and generates statistics for
those anomalous examples.
* Add utility method `validate_examples_in_csv`, which identifies anomalous
examples in CSV files and generates statistics for those anomalous examples.
* Add fast TF example decoder written in C++.
* Make `BasicStatsGenerator` to take arrow table as input. Example batches are
converted to Apache Arrow tables internally and we are able to make use of
vectorized numpy functions. Improved performance of BasicStatsGenerator
by ~40x.
* Make `TopKUniquesStatsGenerator` and `TopKUniquesCombinerStatsGenerator` to
take arrow table as input.
* Add `update_schema` API which updates the schema to conform to statistics.
* Add support for validating changes in the number of examples between the
current and previous spans of data (using the existing `validate_statistics`
function).
* Support building a manylinux2010 compliant wheel in docker.
* Add support for cross feature statistics.

Bug Fixes and Other Changes

* Expand unit test coverage.
* Update natural language stats generator to generate stats if actual ratio
equals `match_ratio`.
* Use `__slots__` in accumulators.
* Fix overflow warning when generating numeric stats for large integers.
* Set max value count in schema when the feature has same valency, thereby
inferring shape for multivalent required features.
* Fix divide by zero error in natural language stats generator.
* Add `load_anomalies_text` and `write_anomalies_text` utility functions.
* Define ReasonFeatureNeeded proto.
* Add support for Windows OS.
* Make semantic domain stats generators to take arrow column as input.
* Fix error in number of missing examples and total number of examples
computation.
* Make FeaturesNeeded serializable.
* Fix memory leak in fast example decoder.
* Add `semantic_domain_stats_sample_rate` option to compute semantic domain
statistics over a sample.
* Increment refcount of None in fast example decoder.
* Add `compression_type` option to `generate_statistics_from_*` methods.
* Add link to SysML paper describing some technical details behind TFDV.
* Add Python types to the source code.
* Make`GenerateStatistics` generate a DatasetFeatureStatisticsList containing a
dataset with num_examples == 0 instead of an empty proto if there are no
examples in the input.
* Depends on `absl-py>=0.7,<1`
* Depends on `apache-beam[gcp]>=2.14,<3`
* Depends on `numpy>=1.16,<2`.
* Depends on `pandas>=0.24,<1`.
* Depends on `pyarrow>=0.14.0,<0.15.0`.
* Depends on `scikit-learn>=0.18,<0.21`.
* Depends on `tensorflow-metadata>=0.14,<0.15`.
* Depends on `tensorflow-transform>=0.14,<0.15`.

Breaking Changes

* Change `examples_threshold` to `values_threshold` and update documentation to
clarify that counts are of values in semantic domain stats generators.
* Refactor IdentifyAnomalousExamples to remove sampling and output
(anomaly reason, example) tuples.
* Rename `anomaly_proto` parameter in anomalies utilities to `anomalies` to
make it more consistent with proto and schema utilities.
* `FeatureNameStatistics` produced by `GenerateStatistics` is now identified
by its `.path` field instead of the `.name` field. For example:

feature {
name: "my_feature"
}

becomes:

feature {
path {
step: "my_feature"
}
}

* Change `validate_instance` API to accept an Arrow table instead of a Dict.
* Change `GenerateStatistics` API to accept Arrow tables as input.

Deprecations

Page 7 of 8

Releases

Has known vulnerabilities

Previous Next

Tensorflow-data-validation

Page 7 of 8

0.21.2

0.21.1

0.21.0

0.15.0

0.14.1

0.14.0

Page 7 of 8

Links

Releases