Automunge

Latest version: v8.33

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 12 of 99

7.58

- new automunge(.) and postmunge(.) parameters supported for noise injection
- entropy_seeds: defaults to False, accepts integer or list of integers which may serve as a supplemental source of entropy for numpy.random sampling
- random_generator: defaults to False, accepts input of a numpy.random.Generator formatted random sampler
- the default generator is np.random.PCG64 which applies the PCG pseudo random number generator (and is also the default applied by numpy)
- examples of alternate generators could be np.random.MT19937 for Mersenne Twister
- in both cases these generators on their own are not truly random, they rely on seedings of entropy provided by the operating system which are then enhanced through their use
- note that alternate libraries with numpy.random formatted generators can also be accessed for this purpose, such as for example for sampling with support of quantum circuits
- an example of a library that can serve as an alternate resource for np.random generator is QRAND
- or if the alternate library does not have numpy.random support, their output can be channeled as entropy_seeds for a similar benefit
- the two parameters can also be passed in tangent, for sampling with a custom generator with custom supplemental entropy seeds
- entropy_seeds and random_generator are specific to an automunge(.) or postmunge(.) call, in other words they are not returned in the populated postprocess_dict
- also found and fixed a bug in DPrt associated with test_mu specification

7.57

- found an immaterial family tree that was interfering with a validation test
- updated DPo7 family tree
- (DPo7 primarily intended for use as a downstream tree category)

7.56

- new root categories DPqt and DPbx
- DPqt is for quantile transform with noise injection
- (quantile transform is built on top of sklearn QuantileTransformer)
- DPbx is for box-cox transform with noise injections
- (box-cox is built on top of scipy.stats boxcox)
- both of these are options for translating numeric distributions to closer resemble normal
- as may be beneficial in presence of fat tails

7.55

- primarily a cleanup to read me
- went ahead and updated a function definition in code base to align
- the demonstrations of assigncat parameter are now streamlined
- basically this demonstration had started as a comprehensive set of transforms
- and as progressed in building out the library it gradually got kind of excessive
- now streamlined to a few core root categories, selected for potential common use
- first row are categoric encodings
- second row are corresponding categoric encodings with noise injection
- third row are numeric normalizations / corresponding normalization with noise
- fourth row are examples of binning options (as could be added to a normalization family tree)
- fifth row is misc, including integer encoding, search, string parsing, tlbn for explainability support, and passthrough trasnforms (excl is direct passthrough, exc2 is passthrough with force to numeric)
assigncat = {'1010':[], 'onht':[], 'ordl':[], 'bnry':[], 'hash':[], 'hsh2':[],
'DP10':[], 'DPoh':[], 'DPod':[], 'DPbn':[], 'DPhs':[], 'DPh2':[],
'nmbr':[], 'mnmx':[], 'retn':[], 'DPnb':[], 'DPmm':[], 'DPrt':[],
'bins':[], 'pwr2':[], 'bnep':[], 'bsor':[], 'por2':[], 'bneo':[],
'ntgr':[], 'srch':[], 'or19':[], 'tlbn':[], 'excl':[], 'exc2':[]}

7.54

- new ML_cmnd options available for XGBoost tuning with optuna
- optuna_kfolds: defaults to 1, can pass an integer to select a number of cross validation folds for tuning (may help with overfit)
- optuna_early_stop: defaults to 50, can pass an integer to select a max number of tuning cycles without improved performance metric to trigger early stopping
- optuna_max_depth_tuning_stepsize: defaults to 2 based on an optuna demonstration, we expect setting to 1 could be beneficial with higher tuning durations
- here are the current optuna tuning options shown with their defaults:
ML_cmnd = {'autoML_type' : 'xgboost',
'hyperparam_tuner' : 'optuna_XG1',
'optuna_n_iter' : 100,
'optuna_timeout' : 600,
'optuna_kfolds' : 1,
'optuna_early_stop': 50,
'optuna_max_depth_tuning_stepsize' : 2,
}

7.53

- new ppd_append automunge(.) parameter
* ppd_append: defaults to False, accepts as input a prior populated postprocess_dict for purposes of adding new features to a prior trained model. Basically the intent is that there are some specialized workflows where models in decision tree paradigms may have new features incorporated without retraining the model with the prior training data. In such cases a user may desire to add new features to a prior populated postprocess_dict to enable pushbutton preprocessing including the original training data basis coupled with basis of newly added features. In order to do so, automunge(.) should be called with just the new features passed as df_train, and the prior populated postprocess_dict passed to ppd_append. This will result in the newly populated postprocess_dict being saved as a new subentry in the returned original postprocess_dict, such that to prepare additional data including the original features and new features, they combined features can be colletively passed as df_test to postmunge(.) (which should have new features appended on right side of original features). postmunge(.) will prepare the original features and new features seperately, including a seperate basis for ML infill, Binary, and etc, and will return a combined prepared test data. Includes inversion support and support for performing more than one round of new feature appendings. Note that newly added features are limited to training features, labels and ID input should be excluded. Note that inversion numpy support not available with combined features and test feature inversion support is limited to the inversion='test' case. (If it is desired to include new features in the prior features' ML infill basis and visa versa, instead of applying ppd_append just pass everything to automunge(.) and populate a new postprocess_dict - noting this might justify retraining the original model due to a new ML infill basis of original features).
- inspired by a comment by Jensen Huang in the Nvidia keynote regarding potential for decision tree paradigms to incorporate new features into a prior trained model

Page 12 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 12 of 99

7.58

7.57

7.56

7.55

7.54

7.53

Page 12 of 99

Links

Releases