Automunge

Latest version: v8.33

Safety actively analyzes 715032 Python packages for vulnerabilities to keep your Python projects secure.

Page 47 of 99

5.38

- new max_column_count parameter accepted for hash transform
- since number of returned columns is based on longest entry wordcount
- there may scenarios with extreme outliers that can result in excessive dimensionality
- max_column_count caps the number of returned columns
- defaulting to False for no cap
- when word extraction reaches thresshold remainer of string treated as single word
- e.g. for string entry "one two three four" and max_column_count = 3
- hashing would be based on extracted words ['one', 'two', 'three four']
- also updatred defaults under automation for high cardinality categoric sets
- now when number of unique entries exceeds numbercategoryheuristic parameter (which currently defaults to 127)
- it is treated with hsh2 which is a hashing similar to ordinal
- unless unique entry count exceeds 75% of train set row count
- in which case it is treated with hash which extracts the words from within entries

5.37

- found and fixed small bug from 5.36 missed in initial testing

5.36

- ran some additional tests on hashing algorithm speed
- and came to convincing conclusion that md5 wasn't best default for hash transforms
- so new default is the native python hash function
- md5 hash is still available with new 'hash_alg' parameter for hash transforms
- which defaults to 'hash' but can be passed as 'md5' to revert to original basis
- note that salt is still supported for both cases

5.35

- new parameter accepted for hash family of transforms
- 'salt' can be passed as arbitrary string, defaulting to empty string ''
- salt perturbs the hashing to ensure privacy of encoding basis
- which is consistently applied between train and test data for internal consistency
- quick fix to suffix appender assembly for hash transform from 'column_hash' to 'column_hash_'
- added edge case support for catboost associated with very small data sets

5.34

- new autoML option for ML infill using CatBoost library
- requires installing CatBoost with
pip install catboost
- available by passing ML_cmnd as
ML_cmnd = {'autoML_type':'catboost'}
- uses early stopping by default for regression and no early stopping by default for classifier
- (to avoid classifier channel for error when all label samples are included in validation set)
- can turn on early stopping for classifier by passing
ML_cmnd = {'autoML_type':'catboost', 'MLinfill_cmnd' : {'catboost_classifier_fit' : {'eval_ratio' : }}}
- where is float between 0-1 to designate validation ratio (defaults to 0.15 for regressor)
- in general can pass parameters to model initialization and fit operation as
ML_cmnd = {'autoML_type':'catboost',
'MLinfill_cmnd' : {'catboost_classifier_model' : {'parameter1' : 'value' },
'catboost_classifier_fit' : {'parameter2' : 'value' },
'catboost_regressor_model' : {'parameter3' : 'value' },
'catboost_regressor_fit' : {'parameter4' : 'value' }}}
- in general, accuracy performance of autoML options are expected as AutoGluon > CatBoost > Random Forest
- in general, latency performance of autoML options are expected as Random Forest > CatBoost > AutoGluon
- in general, memory performance of autoML options are expected as Random Forest > CatBoost > AutoGluon
- and where Random Forest and Catboost are more portable than AutoGluon since don't require a local model repository saved to hard drive
- for now retaining random forest as the default

5.33

- added parameter support for AutoGluon
- can be passed to fit command in ML_cmnd parameter
- further documented in read me

Page 47 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 47 of 99

5.38

5.37

5.36

5.35

5.34

5.33

Page 47 of 99

Links

Releases