Automunge

Latest version: v8.33

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 47 of 99

5.38

- new max_column_count parameter accepted for hash transform
- since number of returned columns is based on longest entry wordcount
- there may scenarios with extreme outliers that can result in excessive dimensionality
- max_column_count caps the number of returned columns
- defaulting to False for no cap
- when word extraction reaches thresshold remainer of string treated as single word
- e.g. for string entry "one two three four" and max_column_count = 3
- hashing would be based on extracted words ['one', 'two', 'three four']
- also updatred defaults under automation for high cardinality categoric sets
- now when number of unique entries exceeds numbercategoryheuristic parameter (which currently defaults to 127)
- it is treated with hsh2 which is a hashing similar to ordinal
- unless unique entry count exceeds 75% of train set row count
- in which case it is treated with hash which extracts the words from within entries

5.37

- found and fixed small bug from 5.36 missed in initial testing

5.36

- ran some additional tests on hashing algorithm speed
- and came to convincing conclusion that md5 wasn't best default for hash transforms
- so new default is the native python hash function
- md5 hash is still available with new 'hash_alg' parameter for hash transforms
- which defaults to 'hash' but can be passed as 'md5' to revert to original basis
- note that salt is still supported for both cases

5.35

- new parameter accepted for hash family of transforms
- 'salt' can be passed as arbitrary string, defaulting to empty string ''
- salt perturbs the hashing to ensure privacy of encoding basis
- which is consistently applied between train and test data for internal consistency
- quick fix to suffix appender assembly for hash transform from 'column_hash' to 'column_hash_'
- added edge case support for catboost associated with very small data sets

5.34

- new autoML option for ML infill using CatBoost library
- requires installing CatBoost with
pip install catboost
- available by passing ML_cmnd as
ML_cmnd = {'autoML_type':'catboost'}
- uses early stopping by default for regression and no early stopping by default for classifier
- (to avoid classifier channel for error when all label samples are included in validation set)
- can turn on early stopping for classifier by passing
ML_cmnd = {'autoML_type':'catboost', 'MLinfill_cmnd' : {'catboost_classifier_fit' : {'eval_ratio' : }}}
- where is float between 0-1 to designate validation ratio (defaults to 0.15 for regressor)
- in general can pass parameters to model initialization and fit operation as
ML_cmnd = {'autoML_type':'catboost',
'MLinfill_cmnd' : {'catboost_classifier_model' : {'parameter1' : 'value' },
'catboost_classifier_fit' : {'parameter2' : 'value' },
'catboost_regressor_model' : {'parameter3' : 'value' },
'catboost_regressor_fit' : {'parameter4' : 'value' }}}
- in general, accuracy performance of autoML options are expected as AutoGluon > CatBoost > Random Forest
- in general, latency performance of autoML options are expected as Random Forest > CatBoost > AutoGluon
- in general, memory performance of autoML options are expected as Random Forest > CatBoost > AutoGluon
- and where Random Forest and Catboost are more portable than AutoGluon since don't require a local model repository saved to hard drive
- for now retaining random forest as the default

5.33

- added parameter support for AutoGluon
- can be passed to fit command in ML_cmnd parameter
- further documented in read me

Page 47 of 99

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.