Dataprofiler

Latest version: v0.12.0

Safety actively analyzes 682244 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 8 of 10

0.5.0

Runtime Changes

Major release, unstructured profiles can now be generated

Profiler

* Unstructured Profiler enabled, profiles can be generated on the TextData class
* Factory Class automatically selects UnstructuredProfiler vs StructuredProfiler

0.4.6

Bug fixes

* Fix histogram index out of range 217
* Locking to required TensorFlow < 2.5.0, Tensorflow==2.5.0 has an issue 220
* Remove depreciated AVRO file formats 220
* Fix padding issue related to numpy 225
* Remove pad in output of labeler 226

Other changes

* histogram utils now use the builtin numpy functions 213

0.4.5

Runtime Changes

Minor release, fixes bugs around null counts.

0.4.4

Runtime Changes

Minor release, fixes bugs and adds save & load of profiles

Profiler

* Enables saving & loading a Profile

Bug fixes

* data can be `None` when checking length
* Corrected `row_has_null` and `row_is_null` on update / adding
* Ensured row statistics are appropriately calculated when subsampled
* Minor bug fixes

0.4.3

Runtime Changes

Migrating from v0.4.2 to v0.4.3 should result in a **30-90% reduction in profiling time**.
Largely dependent on system resources and data size.

Notes

* Remove requirement for tensorflow-addons
* Library now works with tensorflow nightly (Python 3.9)
* Added example on generating a new data labeler

Profiler

* Multiprocessing data preprocessing
* Improved histogram accuracy
* Reduced histogram generation runtime
* Option to set the bin count for histogram
* Expanded precision and switch to precision estimation (as opposed to exact calculations)
* Limit pool size based on cpu and memory limitations

Data

* Improved JSON detection method
* Option (default) pulls metadata and data separately (`data.meta` and `data.data`)
* data.meta would be part of the JSON which contains no records
* data.data would be part of the JSON which contains records
* Added option to select keys which represent records

Report

* Precision report now contains additional details

"precision": {
'min': int,
'max': int,
'mean': float,
'var': float,
'std': float,
'sample_size': int,
'margin_of_error': float,
'confidence_level': float
},


Bug fixes

* Fixed error in merging options
* Fixed issue related to merging DateTimeColumns
* Fixed multiprocessing on OSX
* Fixed row calculations if `min_true_samples` is greater than zero

0.4.2

Runtime Changes

Notes

This update reduces runtime by on average 50%.

Profiler
* Add support for HistogramOptions
* Add multiprocessing support
* Reduced runtime for shuffling indices
* Vectorized precision function
* Improved unique set & vocab merging
* By default histogram only runs 'auto' bin edge detection

Data
* Add length attribute to the data class `data.length()` or `len(data)`

Report
* Added optional `omit_keys` to the report options function, remove keys from the final report
* Added `row_has_null_count` (global), one or more nulls in the row
* Added `row_is_null_count` (global), the entire row is null
* Rename `total_samples` (global) -> `row_count`
* Rename label `BACKGROUND` -> `UNKNOWN` (column)
* Removed `covariance` (global)
* Removed `data_classification` (global)
* Removed `data_label_probability` (column)
* Removed `median` (column)

Bug fixes

* Accurate null count and total_samples on profile updates
* Each column now receives the same sampled indices; enabling `row_is_null_count`

Page 8 of 10

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.