Automunge

Latest version: v8.33

Safety actively analyzes 682457 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 26 of 99

6.64

- removed collections.Counter import that is no longer used after 6.63 rewrite of _evalcategory
- performed an audit of leakage_sets rolled out in 6.61
- identified an opportunity for improved specification granularity
- specifically, leakage_sets as implemented were for specifying bidirectional ML infill basis exclusions
- i.e. for a list of features, every feature in list was excluded from basis of every other feature in list
- realized there may be scenarios where a unidirectional exclusion is prefered
- such as to exclude feature2 from feature1 basis but include feature1 in feature2 basis
- so now in addition to ML_cmnd['leakage_sets'] for bidirectional specification, user can also specify unidirectional exclusions in ML_cmnd['leakage_dict'] which accepts dictionaries in form of column header key with value of a set of column headers
leakage_dict = {feature1 : {feature2}}
- where headers can be specified in input or returned header conventions or combinations thereof (where returned headers include suffix appenders)
- note that as part of this update, ML infill exclusions derived as a result of leakage_tolerance specification are now captured in a unidirecitonal capacity as opposed to bidirectional
- which is more in line with our description provided as part of conference review
- note that in returned postprocess_dict['ML_cmnd'], the prior returned entries of leakage_sets_orig and leakage_sets_derived are now replaced with leakage_dict_orig and leakage_dict_derived

6.63

- corrected function name of validate_allvalidnumeric to include leading underscore for consistent convention for internal functions applied for rest of library
- reconfigured "leakage_dict" data structure population associated with leakage sets to be based on set aggregation
- instead of prior configuration of searching within lists and etc
- which will benefit automunge(.) latency associated with this operation
- significant rewrite for speed and clarity of _evalcategory
- revised the derivations of most common data types to a much more efficient method
- resulting in a material improvement to automunge(.) latency
- the rewrite also resulted in a much cleaner code presentation, we believe will make this function easier to understand now
- (this function was one of the first ones that we wrote :)

6.62

- new suite of parameter validations associated with the ML_cmnd parameter
- which is data structure to set options and pass parameters to predictive models associated with ML infill, feature importance, or PCA
- as well as various other ML infill options like early stopping thorugh iterations, stochastic noise injections, hyperparpameter tuning, leakage assessment, etc
- also moved all of the default ML_cmnd initializations and validations into a single location for code clarity, now all performed in _check_ML_cmnd
- should make future developments on this data structure much easier to maintain
- parallel improved some corresponding internal documentation

6.61

- a few cleanups to support functions _createMLinfillsets and _createpostMLinfillsets
- including replacement of a few kind of hacky concatinate and drops with a much cleaner pandas.iloc
- (this funcion was first implemented very early in development)
- new ML_cmnd option as can be passed to ML_cmnd['leakage_sets']
- leakage_sets can be passed as either a list of input columns or a list of lists of input columns
- user can also pass returned column headers if a subset of features derived from common input feature are the be included in a leakage set
- where each list of input columns is for specification of features that are to be excluded from each other's ML infill basis
- in other words, features with known data cross leakage issues can now be specified by user for accomodation in each other's ML infill basis
- new ML_cmnd option as can be passed to ML_cmnd['leakage_tolerance']
- leakage tolerance is associated with a new automated evaluation for a potential source of data leakage accross features in their respective imputation model basis
- compares aggregated NArw activations from a target feature in a train set to the surrounding features in a train set and for cases where separate features share a high correlation of missing data based on the shown formula we exclude those surrounding features from the imputation model basis for the target feature.
- ((Narw1 + Narw2) == 2).sum() / NArw1.sum() > leakage_tolerance
- where target features are those input columns with some returned coumn serving as target for ML infill
- leakage_tolerance defaults to 0.85 when not specified, and can be set as 1 or False to deactivate the assessment
- to perform the operation, set ML_cmnd['leakage_tolerance'] to a float between 0-1
- where the lower the value, the more likely for sets to be excluded between each other's basis
- leakage_tolerance is implemented by adding results of evaluation to any user specified leakage_sets
- where sets are collected in three forms in returned postprocess_dict['ML_cmnd'], as 'leakage_sets_orig' (user passed sets prior to derivations) 'leakage_sets_derived' (derived sets) and 'leakage_sets' (combination of orig and derived)
- please note that there is a small latency penalty associated with this operation in automunge(.) and no meaningful penalty in postmunge(.)
- new postprocess_dict entry ['ML_cmnd_orig'] which is a dictionary recording (for informational purposes) the original form of ML_cmnd as passed to automunge(.) prior to any initializations and updates such as based on leakage_tolerance or _check_ML_cmnd

6.60

- reverting the updates associated with 6.57
- upon some reflection we do not feel we have sufficient comfort in our hyperparameter tuning implementation to justify gradient boosting from an autoML standpoint
- and don't want to distract our users with an option that has tendency to overfit when not tuned

6.59

- introduced default convention of applying a stochasticly derived random seed to model training for each ML infill model, including each model accross features and each model accross iterations
- to deactivate for deterministic training based on automunge randomseed parameter can pass ML_cmnd['stochastic_training_seed'] = False
- please note that currently a randomseed is only inspected in randomforest, catboost, and xgboost autoML options
- new option for incorporating stochasticity into imputations derived through ML infill
- can activate for numeric and/or categoric features by passing ML_cmnd['stochastic_impute_categoric'] = True and/or ML_cmnd['stochastic_impute_numeric'] = True
- stochastic imputations bear some similarity to DP family of noise injection transforms
- in that sampled noise with numpy.random is injected into the imputations prior to insertion
- for numeric stochastic imputation we applied a similar method to our DPmm transform
- here we convert the imputation set into a min/max scaled reprentation to ensure noise distribution parameters aligned with range of data, with the scaling based on min/max found in the training data
- we sampled noise from a gaussian distribution or optionally from a laplace distribution
- with noise scaling defaulting to mu=0, sigma=0.03, and flip_prob=0.06 (where flip_prob is ratio of a feature set's imputations receiving injections)
- note that in order to insure range of resulting imputations is consistent with range in df_train we cap outlier noise entries at +/- 0.5 and scale negative noise when min/max representation is below midpoint and scale positive noise when minmax representation is above the midpoint, resulting in a consistent returned range independant of noise sampling
- after noise injection the imputation set is converted back by an inversion of the min/max representation
- noise injection to categoric features are based on parameter defaulting as flip_prob=0.03 (where flip_prob is ratio of a feature set's imputations receiving injections)
- injections are conducted randomly flipping the entries in a target row to a random draw from the set of unique activation sets (as may include 1 or more columns) based on draw from a uniform distribution
- please note that this includes the possibility that an injection entry will retain the original represtntation based on the random draw
- please note that the associated parameters can be configured by ML_cmnd entries to 'stochastic_impute_categoric_flip_prob', 'stochastic_impute_numeric_mu', 'stochastic_impute_numeric_sigma', 'stochastic_impute_numeric_flip_prob', 'stochastic_impute_numeric_noisedistribution'
- (where these entries all accept floats except 'stochastic_impute_numeric_noisedistribution' accepting one of {'normal', 'laplace'}
- please note that we suspect stochastic imputations may have potential to interfere with infilliterate early stopping criteria as rolled out in 6.58 based on the scale of injections

Page 26 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.