
Latest version: v8.33

- simplified convention for use of postprocess_dict in postmunge to log validation results in support functions
- now using same postprocess_dict entry as used in automunge for this purpose, as temp_miscparameters_results
- previously we had struck this at conclusion of automunge(.), now it is simply reset after automunge and postmunge
- which is now used in place of previous postmunge temporary entry temp_pm_miscparameters_results
- the benefit of this approach is that now can use same support functions from automunge in postmunge without needing to accomodate different location for validation results
- as for example is relevant to the new validaiton results logged in model training from 7.45
- found and fixed an edge case for NArw aggregation for binary MLinfilltype assocated with cases of only one unique value in feature
- comprehensive spell check of read me
- in the process identified a few small spelling typos in code comment documentation for transformdict and processdict


- today's theme was edge case mitigations
- for cases where df_train has only one feature, routed around ML infill
- (prior this resulted in a printout and halt, now is just a printout without ML infill)
- in the process found a material edge case associated with automated leakage detection carveouts from ML infill
- now for cases where after leakage carveouts an ML infill model has less than two input features serving as basis, ML infill model not trained
- which otherwise could have been a halt channel
- also returns a corresponding validation result as 'not_enough_samples_or_features_for_MLinfill_result'
- recorded as boolean for each ML infill target column as {column : boolean}
- found an edge case associated with one of sorting methods used in evaluating entry frequecies in a few places
- associated with column header found in df_train overlapping with use of an arbitrary string
- now circumvented by adding suffix to the arbitrary string until overlap resolved
- a few cleanups to postmunge noise_augment
- now passing postprocess_dict to internal postmunge call directly instead of a deepcopy for reduced memory overhead
- just needed to accomodate postmunge specific temporary entries that were struck as part of the internal call
- in the process found a snafu with noise_augment
- associated with calling am.postmunge instead of self.postmunge
- which if user used different import procedure could have been error channel
- performed an audit of index extractions for ID sets
- including various scenarios of index mismatch between df_train and df_test
- found that there was an edge case associated with dataframes with multiple indexes
- now resolved
- put some additional thought into renaming of unnamed non-range index for extraction to ID sets
- originally had renamed to 'Orig_index_', where is the 12 digit application number associated with the automunge(.) call
- relized this was kind of impractical, now defaulits to 'Orig_index', unless that results in an overlap with existing columns for index or ID set, in which case the prior convention applies
- (and in cases where prior convention is inadequate additional comma suffixes added until resolved)
- in the process a few cleanups to the support function for deriving 'Automunge_index' string which has a similar convention
- note that when these strings deviate from default a validation result is reported to indexcolumn_valresult and origindexcolumn_valresult


- found another validation oversight from 7.42. Was a complex rollout.
- now the float(int) scenario for noise_augment working as intended


- 7.42 had noted that mlti now has support for assigning specific assignparam specifications to each of the normcategory column applications to the mlti multi column set
- an additional code review idenitfied that this aspect was insufficiently validated
- now working as intended


- New parameter accepted to both automunge(.) and postmunge(.) as noise_augment
- Accepts type int or float(int) >=0. Defaults to 0
- Used to specify a count of additional duplicates of training data prepared and concatinated with the original train set.
- Intended for use in conjunction with noise injection, such that the increased size of training corpus can be a form of data augmentation.
- Note that injected noise will be uniquely randomly sampled with each duplicate.
- When noise_augment is received as a dtype of int, one of the duplicates will be prepared without noise. When noise_augment is received as a dtype of float(int), all of the duplicates will be prepared with noise.
- When shuffletrain is activated the duplicates are collectively shuffled, and can distinguish between duplicates by the original df_train.shape in comparison to the ID set's Automunge_index.
- Please be aware that with large dataframes a large duplicate count may run into memory constraints, in which case additional duplicates can be prepared seperately in postmunge(.).
- The postmunge(.) noise_augment option takes into account traindata parameter for distinguishing whether to treat the duplicates as train or test data.
- Additional updates:
- mlti transform now takes account for assignparam specifications in parameters passed to normcategory
- this is relevant to mlti's use in context of noise injection to multi column hashing, as now can update trainnoise or testnoise to global_assignparam in similar fashion to other noise injection transforms
- this also has benefit that now can assign specific assignparam specifications to each of the normcategory column applications (mlti takes as input multi column sets)
- new scenarios supported for postmunge(.) traindata parameter as 'train_no_noise' and 'test_no_noise'
- used as you would expect, treating data passed to postmunge as train or test data but without noise injections
- this setting is primarily intended to support workflow with new noise_augment parameter
- added dupl_rows parameter support for validation data prep in automunge (treated consistently to training data)
- added random seeding support for validation data prep, matched to automunge randomseed when it is manually specified


- ML infill now available with XGBoost library and bayesian hyperparameter tuning by the Optuna library
- can activate XGBoost by passing ML_cmnd['autoML_type'] = 'xgboost'
- tuning can be activated by passing ML_cmnd['hyperparam_tuner'] = 'optuna_XGB1'
- defaults to no tuning with default parameters associated with XGBoost's scikit api (XGBClassifier and XGBRegressor)
- parameters can be passed to model training via ML_cmnd['MLinfill_cmnd']['xgboost_classifier_fit'] and ML_cmnd['MLinfill_cmnd']['xgboost_regressor_fit']
- gpu support available by passing a CUDA gpu device id to ML_cmnd['xgboost_gpu_id'], which automatically activates tree_method to gpu_hist
- if you have one gpu the device id is probably the integer 0
- (we recommend using ML_cmnd['xgboost_gpu_id'] instead of passing gpu parameters through ML_cmnd['MLinfill_cmnd'] so as to ensure consistent treatment between tuning and training)
- note the read me demonstrates how to designate gpu training with cpu inference, as may be desired when intending to put the postprocess_dict into production
- tuning applied via Bayesian optimization, making use of the Optuna library
- tuning supported by parameters ML_cmnd['optuna_n_iter'] and ['optuna_timeout'] which set the maximum number of iterations and maximum tuning time for each feature in seconds
- when not specified defaulting to ML_cmnd['optuna_n_iter'] = 100, ML_cmnd['optuna_timeout'] = 600 (i.e. 100 tuning iterations or 600 seconds for each feature, whichever comes first)
- tuning implementation was partly inspired by the optuna tutorial at
- currently based on a fixed validation split (possibly pending further research into incorporating cross validation / early stopping)
- in the process of drafting xgboost wrappers, identified a potential inconsistency between customML implementation and documentation
- we noted that customML receives ordinal encoded labels as str(int) with fully represented sequential range integer encodings from 0 to max of encoding space (which xgboost likes)
- turned out there was a scenario where this conversion wasn't being performed, now resolved
- *Please note we were having trouble validating the xgboost GPU support, possibly to issue with local hardware. For now please consider GPU support experimental, pending further validation.

