Automunge

Latest version: v8.33

Safety actively analyzes 681857 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 14 of 99

7.46

- simplified convention for use of postprocess_dict in postmunge to log validation results in support functions
- now using same postprocess_dict entry as used in automunge for this purpose, as temp_miscparameters_results
- previously we had struck this at conclusion of automunge(.), now it is simply reset after automunge and postmunge
- which is now used in place of previous postmunge temporary entry temp_pm_miscparameters_results
- the benefit of this approach is that now can use same support functions from automunge in postmunge without needing to accomodate different location for validation results
- as for example is relevant to the new validaiton results logged in model training from 7.45
- found and fixed an edge case for NArw aggregation for binary MLinfilltype assocated with cases of only one unique value in feature
- comprehensive spell check of read me
- in the process identified a few small spelling typos in code comment documentation for transformdict and processdict

7.45

- today's theme was edge case mitigations
- for cases where df_train has only one feature, routed around ML infill
- (prior this resulted in a printout and halt, now is just a printout without ML infill)
- in the process found a material edge case associated with automated leakage detection carveouts from ML infill
- now for cases where after leakage carveouts an ML infill model has less than two input features serving as basis, ML infill model not trained
- which otherwise could have been a halt channel
- also returns a corresponding validation result as 'not_enough_samples_or_features_for_MLinfill_result'
- recorded as boolean for each ML infill target column as {column : boolean}
- found an edge case associated with one of sorting methods used in evaluating entry frequecies in a few places
- associated with column header found in df_train overlapping with use of an arbitrary string
- now circumvented by adding suffix to the arbitrary string until overlap resolved
- a few cleanups to postmunge noise_augment
- now passing postprocess_dict to internal postmunge call directly instead of a deepcopy for reduced memory overhead
- just needed to accomodate postmunge specific temporary entries that were struck as part of the internal call
- in the process found a snafu with noise_augment
- associated with calling am.postmunge instead of self.postmunge
- which if user used different import procedure could have been error channel
- performed an audit of index extractions for ID sets
- including various scenarios of index mismatch between df_train and df_test
- found that there was an edge case associated with dataframes with multiple indexes
- now resolved
- put some additional thought into renaming of unnamed non-range index for extraction to ID sets
- originally had renamed to 'Orig_index_', where is the 12 digit application number associated with the automunge(.) call
- relized this was kind of impractical, now defaulits to 'Orig_index', unless that results in an overlap with existing columns for index or ID set, in which case the prior convention applies
- (and in cases where prior convention is inadequate additional comma suffixes added until resolved)
- in the process a few cleanups to the support function for deriving 'Automunge_index' string which has a similar convention
- note that when these strings deviate from default a validation result is reported to indexcolumn_valresult and origindexcolumn_valresult

7.44

- found another validation oversight from 7.42. Was a complex rollout.
- now the float(int) scenario for noise_augment working as intended

7.43

- 7.42 had noted that mlti now has support for assigning specific assignparam specifications to each of the normcategory column applications to the mlti multi column set
- an additional code review idenitfied that this aspect was insufficiently validated
- now working as intended

7.42

- New parameter accepted to both automunge(.) and postmunge(.) as noise_augment
- Accepts type int or float(int) >=0. Defaults to 0
- Used to specify a count of additional duplicates of training data prepared and concatinated with the original train set.
- Intended for use in conjunction with noise injection, such that the increased size of training corpus can be a form of data augmentation.
- Note that injected noise will be uniquely randomly sampled with each duplicate.
- When noise_augment is received as a dtype of int, one of the duplicates will be prepared without noise. When noise_augment is received as a dtype of float(int), all of the duplicates will be prepared with noise.
- When shuffletrain is activated the duplicates are collectively shuffled, and can distinguish between duplicates by the original df_train.shape in comparison to the ID set's Automunge_index.
- Please be aware that with large dataframes a large duplicate count may run into memory constraints, in which case additional duplicates can be prepared seperately in postmunge(.).
- The postmunge(.) noise_augment option takes into account traindata parameter for distinguishing whether to treat the duplicates as train or test data.
- Additional updates:
- mlti transform now takes account for assignparam specifications in parameters passed to normcategory
- this is relevant to mlti's use in context of noise injection to multi column hashing, as now can update trainnoise or testnoise to global_assignparam in similar fashion to other noise injection transforms
- this also has benefit that now can assign specific assignparam specifications to each of the normcategory column applications (mlti takes as input multi column sets)
- new scenarios supported for postmunge(.) traindata parameter as 'train_no_noise' and 'test_no_noise'
- used as you would expect, treating data passed to postmunge as train or test data but without noise injections
- this setting is primarily intended to support workflow with new noise_augment parameter
- added dupl_rows parameter support for validation data prep in automunge (treated consistently to training data)
- added random seeding support for validation data prep, matched to automunge randomseed when it is manually specified

7.41

- ML infill now available with XGBoost library and bayesian hyperparameter tuning by the Optuna library
- can activate XGBoost by passing ML_cmnd['autoML_type'] = 'xgboost'
- tuning can be activated by passing ML_cmnd['hyperparam_tuner'] = 'optuna_XGB1'
- defaults to no tuning with default parameters associated with XGBoost's scikit api (XGBClassifier and XGBRegressor)
- parameters can be passed to model training via ML_cmnd['MLinfill_cmnd']['xgboost_classifier_fit'] and ML_cmnd['MLinfill_cmnd']['xgboost_regressor_fit']
- gpu support available by passing a CUDA gpu device id to ML_cmnd['xgboost_gpu_id'], which automatically activates tree_method to gpu_hist
- if you have one gpu the device id is probably the integer 0
- (we recommend using ML_cmnd['xgboost_gpu_id'] instead of passing gpu parameters through ML_cmnd['MLinfill_cmnd'] so as to ensure consistent treatment between tuning and training)
- note the read me demonstrates how to designate gpu training with cpu inference, as may be desired when intending to put the postprocess_dict into production
- tuning applied via Bayesian optimization, making use of the Optuna library
- tuning supported by parameters ML_cmnd['optuna_n_iter'] and ['optuna_timeout'] which set the maximum number of iterations and maximum tuning time for each feature in seconds
- when not specified defaulting to ML_cmnd['optuna_n_iter'] = 100, ML_cmnd['optuna_timeout'] = 600 (i.e. 100 tuning iterations or 600 seconds for each feature, whichever comes first)
- tuning implementation was partly inspired by the optuna tutorial at https://github.com/optuna/optuna-examples/blob/main/xgboost/xgboost_simple.py
- currently based on a fixed validation split (possibly pending further research into incorporating cross validation / early stopping)
- in the process of drafting xgboost wrappers, identified a potential inconsistency between customML implementation and documentation
- we noted that customML receives ordinal encoded labels as str(int) with fully represented sequential range integer encodings from 0 to max of encoding space (which xgboost likes)
- turned out there was a scenario where this conversion wasn't being performed, now resolved
- *Please note we were having trouble validating the xgboost GPU support, possibly to issue with local hardware. For now please consider GPU support experimental, pending further validation.

Page 14 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.