Automunge

Latest version: v8.33

Safety actively analyzes 715032 Python packages for vulnerabilities to keep your Python projects secure.

Page 27 of 99

6.58

- new early stopping criteria available for iterations of ML infill applied under infilliterate
- ML infill still defaults to 1 iteration with increased specification available by the infilliterate parameter
- the infilliterate parameter will serve as the maximum number of iterations when early stopping criteria not reached
- user can activate early stopping criteria by passing an ML_cmnd entry as ML_cmnd['halt_iterate'] = True
- which will evaluate imputation deirvations in comparison to the previous iteration and halt iterations when sufficiently low tolerance is reached
- the evaluation considers seperately categoric features in aggregate and numeric features in aggregate
- the categoric halting criteria is based on comparing the ratio of number of inequal imputations between iterations to the total number of imputations accross categoric features to a categoric tolerance value
- the numeric halting criteria is based on comparing for each numeric feature the ratio of max(abs(delta)) between imputation iterations to the mean(abs(entries)) of the current iteration, which are then weighted between features by the quantity of imputations associated with each feature and compared to a numeric tolerance value
- the tolerance values default to categoric_tol = 0.05 and numeric_tol = 0.01, each as can be user specificied by ML_cmnd entries of floats to ML_cmnd['categoric_tol'] and ML_cmnd['numeric_tol']
- early stopping is applied when both the numeric featuers in aggregate and categoric features in aggregate are below threshold of their associated tolerances as evaluated for the current iteration in comparison to the preceding iteration
- and the resulting number of iterations is then the infilliterate value applied in postmunge
- please note that our numeric early stopping criteria was informed by review of the scikit-learn iterativeimputer approach, and our categoric early stopping criteria was informed by review of the MissForest early stopping criteria, however there are some fundamental differences with our approach in both cases
- such as for instance the formula of our numeric criteria is unique, and the tolerance evaluation approach of our categoric criteria is unique

6.57

- new gradient boosting option for ML infill via XGBoost
- available by ML_cmnd = {'autoML_type':'xgboost'}
- defaults to verbosity 0
- can pass model initiliazation and fit parameters by e.g.
ML_cmnd = {'autoML_type' :'xgboost',
'MLinfill_cmnd':{'xgboost_classifier_model': {'verbosity' : 0},
'xgboost_classifier_fit' : {'learning_rate': 0.1},
'xgboost_regressor_model' : {'verbosity' : 0},
'xgboost_regressor_fit' : {'learning_rate': 0.1}}}
- hyperparameter tuning available by grid search or random search, tuned for each feature
- can activate grid search tuning by passing any of the fit parameters as a list or range of values
- received static parameters (e.g. as a string, integer, or float) are untuned, e.g.
ML_cmnd = {'autoML_type' :'xgboost',
'MLinfill_cmnd':{'xgboost_classifier_fit' : {'learning_rate': [0.1, 0.2],
'max_depth' : 12},
'xgboost_regressor_fit' : {'learning_rate': [0.1, 0.2],
'max_depth' : 12}}}
- random search also accepts scipy stats distributions to fit parameters
- to activate random search set the hyperparam_tuner to 'randomCV'
- and set the desired number of iterations to randomCV_n_iter (defaults to 100)
from scipy import stats
ML_cmnd = {'autoML_type' :'xgboost',
'hyperparam_tuner':'randomCV',
'randomCV_n_iter' : 5,
'MLinfill_cmnd':{'xgboost_classifier_fit' : {'learning_rate': [0.1, 0.2],
'max_depth' : stats.randint(12,15)},
'xgboost_regressor_fit' : {'learning_rate': [0.1, 0.2],
'max_depth' : stats.randint(12,15)}}}
- also revisited random forest tuning and consolidated a redundant training operation

6.56

- ok trying to formalize conventions for data returned from inversion
- we had previously noted in read me that general convention is that data not successfully recovered was recorded the the reserved string 'zzzinfill'
- a survey of the various transforms found that there are actually multiple conventions
- for transforms that had some kind of imputation in the forward pass, the convention is data returned from inversion corresponds to that imputation
- for transforms without imputation, such as one-hot encoding when ML infill not imputed, the revised convention is entries not recovered is now recored as NaN, which we believe is more inline with common practice
- there is a still a scenario where returned data from inversion records entries without recovery as the reserved string 'zzzinfill', specifically for transforms that return from inversion a column of pandas object dtype and were empty entries may not correlate with missing data (such as string parsing and search functions), which is applied since pandas object dtype would otherwise convert NaN to the string 'nan'

6.55

- added inplace support to search functions srch/src2/src3/src4
- added defaultinfill support to search functions srch/src4
- consolidated a redundant defaultinfill application to search functions src2/src3
- added case assignparam support to search functions src2/src3 (previously included with srch/src4)
- added support for aggregated search terms to src3 (previously included with srch/src2/src4)
- found and fixed edge case in srch/src2/src4 for cases where an aggregated search term returns an empty set
- please note that our search functions are patent pending
- added boolean entry support to the one-hot encoding support function rolled out in 6.54
- boolean entries sorted differently than pd.get_dummies, which means there will be a small impact to backward compatibility associated with transforms that previously applied the pandas version corresponding to boolean entries
- added defaultinfill support for received columns of pandas category dtype
- consolidated a redundancy in defaultinfill code so as to use single common support function in both dual/single/post process and custom_train conventions

6.54

- added a clarification to read me that under automation numeric data received as pandas categoric data type will be treated as categoric for binarization instead of normalizations
- new temporary postproccess_dict entry temp_miscparameters_results added at initialization
- temp_miscparameters_results is for storing validaiton results recieved in various support functions that might not have access to miscparameters_results
- which is then consolidated with miscparameters_results and the temporary entry struck
- new validation result reported in miscparameters_results as treecategory_with_empty_processing_functions_valresult
- which is for the tree category without processing functions populated that we noted in 6.50
- validation result is populated with unique entry for each occurance recording the tree category without processing functions and the associated root category whose generations included the tree category
- populated with key as integer i incremented with each occurance as:
treecategory_with_empty_processing_functions_valresult = \
{i : {'treecategory' : treecategory,
'rootcategory' : rootcategory}}
- identified a superfluous copy operation within the transformation functions associated with populating returned column_dict data structure
- so globally struck this copy operation
- created a new support function for one hot encoding as _onehot_support
- this function returns a comparable order of columns as would pd.get_dummies
- the index retention convention is also comparable
- part of the reason for creating this function is so will be able to experiment with variations
- for potential use in different transformation function scenarios
- new trigometric family of transforms sint/cost/tant/arsn/arcs/artn
- built on top of numpy trigometric operations np.sin, np.cos, np.tan, np.arcsin, np.arccos, np.arctan
- inversion supported with partial information recovery
- defaults to defaultinfill of adjinfill (since these are for periodic sets a static imputation makes less sense)
- the inspiration for incorporporating trigometric functions came from looking at a calculator
- audited family trees for cases with omitted NArw_marker support, found a few entries whose omission I believe was accidental
- in the process identified by memory I think a case where I may not have provided sufficient citation at the time of rollout
- for clarity, our use of adjacent sin and cos transformations for periodic datetime data, as well as binarization as an alternative to onehot encoding
- were direclty inspired by github issue suggestions submitted by github user "solalatus"
- (I had noted his input in acknowledgements of one of papers, just thought might be worth reiterating)
- this user provided links to blog posts associated with both of these concepts which were rolled out in 2.47
- the datetime blog post was an article by Ian London titled "Encoding cyclical continuous features - 24-hour time"
- the binarization blog was one of a few articles, not sure which, I think it was a blog post by Rakesh Ravi titled "One-Hot Encoding is making your Tree-Based Ensembles worse, here’s why?"

6.53

- a walkthrough of the evalcategory function identified a snafu for numeric sets passed as pandas categoric type
- it looks like had accidentially inserted an intermediate if statement in between a prior paired if/else combo and resulted in categoric sets with integer or float entries getting treated as numeric under automation
- (that I believe is the most likely explanation, although it is also possible I just poorly implemented the first if statement if I am remembering the order of these code edits incorrectly, this function has gradually evolved over a long period)
- the intent was that received columns that are pandas type 'category' get treated to default categoric encoding (bnry or 1010), even in cases where their entries were numeric
- the intermediate if statement I just struck, I think from a while back I was trying to get just a little too creative and was trying to treat numeric sets with 3 unique entries as a one hot encoding instead of normalization (for reasons that I'm now having a hard time trying to identify, in other words I don't think there was a good reason), to make matters worse along the way the 3 state one-hot got converted to a binarization and yeah long story short (too late) etc
- so now the corrected (and simplified) convention is that all majority numeric sets (with integers or floats) under automation are normalized, unless the set is received as a pandas 'categoric' data type, and then they are treated to a binarizaiton by bnry or 1010 based on their unique entry count
- also removed an unused code snippet in same function

Page 27 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 27 of 99

6.58

6.57

6.56

6.55

6.54

6.53

Page 27 of 99

Links

Releases