Automunge

Latest version: v8.33

Safety actively analyzes 715032 Python packages for vulnerabilities to keep your Python projects secure.

Page 18 of 99

7.22

- found and fixed a bug for multi-generation label sets originating from data structure populating (for public label inversion with encryption) rolled out in 7.18
- labels back to supporting full library including multi generations applied for noise injection
- some new powertransform scenarios now supported to support defaulting to noise injection under automation
- when powertrasnform = 'DP1', default numerical replaced with DPnb, categoric with DP10, and binary with DPbn
- when powertrasnform = 'DP2', default numerical replaced with DPrt, categoric with DPod, and binary with DPbn
- struck an unused validation result (check_transformdict_result2)

7.21

- updated defaults for categoric noise injections applied in DP family of transforms
- the new default is that noise injections are weighted by distribution of activations found in train data
- (making use of the weighted parameter from last update)
- based on numpy.random.choice documentation we expect there may be a small tradeoff with respect to latency for weighted sampling, we expect benefit to model performance may offset
- prior configuration still available by setting weighted parameter as False in assignparam
- comparable update made to categoric noise injections for ML infill, which now default to weighted injections per distribution of activations found in the train data
- weighted sampling can be deactivated for ML infill by setting ML_cmnd['stochastic_impute_categoric_weighted'] as False
- slight reconfiguration for halting criteria associated with ML_cmnd['halt_iterate']
- replaced an aggregation of tuples with a pandas operation
- (expect will benefit latency associated with summing entries by using pandas instead of iterating through tuples)
- running some empiracal inspections of halting criteria for infilliterate via ML_cmnd['halt_iterate']
- finding that particularly for the numeric tolerance criteria, the use of max delta in the halting formula is not a stable convergence criteria, as can fluctuate with single imputation entry
- so decided to revise the formula from to replace max(abs(delta)) with mean(abs(delta))
- now: "comparing for each numeric feature the ratio of mean(abs(delta)) between imputation iterations to the mean(abs(entries)) of the current iteration, which are then weighted between features by the quantity of imputations associated with each feature and compared to a numeric tolerance value"
- also based on similar empiracle inspections decided to raise the tolerance threshold for numeric halting criteria from 0.01 to 0.03
- can be reverted to prior value with ML_cmnd['numeric_tol']
- will update the associated appendix in paper Missing Data Infill with Automunge

7.20

- new parameter accepted to DPod transform for categoric noise injection as 'weighted'
- weighted accepts boolean defaulting to False
- default False is consistent with prior configuration where noise injections are by a uniform random draw from the set of unique entries
- in weighted=True scneario, the draw is weighted based on frequency of unique entries as found in the training data
- this operation is built on top of the np.random.choice p parameter after calculating weights based on the training data
- we're leaving the unform sampling as our default since per numpy documentation it runs more efficently
- a small cleanup to processfmaily funciton reseulting in fewer lines of code with comparable functionality
- (basically some of the if/else scenarios were associated with derived columns as per a similar operation in processparent, column as passed to processfamily is always an input column, so just eliminated to irrelevant scenarios)
- new validation result reported as check_processdict4_valresult2 which checks for a chhanel of postmugne error when user specified processdict contains a prioritized dualprocess convention callable transformation function without a coresponding callable singleprocess trasnfomation function
- (prioritized just means that a callable custom_train wasn't also entered which would otherwise take precedence)
- note that if a common function is desired for both train and test data user can instead pass a function in the 'custom_train' or 'singleprocess' convention
- a few code comments related to potential future extensions

7.19

- struck the entry for 'privacy_headers_labels' from labelsencoding_dict since it reveals information about how many features are in train set
- fixed an edge case for public label inversion associated with excl suffix convention
- updated printouts for randomseed parameter validation
- new postmunge parameter randomseed, defaults to False, accepts integers within 0:2**32-1
- postmunge randomseed is now used for postmunge seeds that don't need to match automunge
- including row shuffling for privacy encode, which otherwise could have served as a channel for information recovery
- we still have order of column shuffling maintained between automunge and postmunge, is needed for ML

7.18

- public labels inversion now supported for encrypted postprocess_dict without encryption key when privacy_encode != 'private'
- (basically that just means the entries needed for label inversion are not encrypted)
- this aligns with convention that labels are only anonymized for privacy_encode = 'private'
- in the process sort of a reorg of labelsencoding_dict returned in postprocess_dict, previous keys for 'transforms' and 'consolidations' are now omitted with information stored elsewhere
- reordered a few of the postmunge early operations to result in fewer postprocess_dict inspections prior to inversion and for clarity
- struck an unused postmunge variable testID_column_orig
- found a snafu with feature selection originating from 7.13 autoMLer cleanup in returned set (because feature selection calls automunge without ML infill it resulted in a returned postprocess_dict without autoMLer entries, updated convention so that an autoMLer entry is recored aligned with the autoMLtype (defaulting to random forest when not specified) even for cases where no ML infill was performed.)
- fixed a bug for postmunge feature selection when naninfill was performed in automunge (just turned off naninfill for the postmunge postmunge call)
- added a few code comments here and there for clarity

7.17

- quality control audit / walkthrough of: ML infill, stochastic impute, halting criteria, leakage tolerance, parameter assignment precedences, infill assignments
- found a process flaw for noise injections to ML infill imputations for categoric features, resolved by replacing a passed dataframe df_train with df_train_filllabel
- and also some consolidation to a single convention for infill and label headers returned from predictinfill
- so basically the new convention is derived infill, in all scenarios and all infill types, now is passed to insertinfill with a consistent column header(s) to the target feature
- added a copy operation to df_train_filllabel received by a few of the autoML wrappers to ensure column header retention outside of function
- updated column header convention for infill returned from predictinfill and predictpostinfill, resulting in now headers of returned infill match headers of labels which are matched to the categorylist (or adjusted categorylist for concurrent)
- this resolves a newly identified issue with recently revised insertinfill being applied to inferred onehot categoric imputations
- updated derived infill header for meaninfill and medianinfill
- fixed a variable address bug for naninfill in postmunge
- found a few cases of running for loops in lists that were editing, replaced with a deep copy, better python practice
- (deepcopy and .copy() for lists are a little different, is kind of grey area when list is a list derived from dictionary keys, went ahead with a deepcopy as a precautionary measure)
- new noise distribution options supported for DP family of transforms targeting numeric sets, including DPnb, DPmm, DPrt, DLnb, DLmm, DLrt, available by passing parameter noisedistribution as one of {'abs_normal', 'negabs_normal', 'abs_laplace', 'negabs_laplace'}, where the prefix 'abs' refers to injecting only positive noise by taking absolute value of sampled noise, and the prefix negabs refers to injecting only negative noise by taking the negative absolute value of sampled noise
- this may be suitable for some applications
- comparable noise distribution options also added for stochastic_impute, available by specification to ML_cmnd['stochastic_impute_numeric_noisedistribution']
- lifted restriction that inversion not supported with privacy_encode='private', now that we have encryption available user has avenue to restrict inversion as needed
- updated the convention for privacy_encode, now returned ID sets do not receive anonymization, only row shuffling and index reset is applied to match other sets with privacy scenario
- removing anonymization for ID sets which makes sense for a few reasons. First because inversion is not available for ID sets. Second is that it allows a channel for recovering row information that wouldn't otherwise have ability to recover even with inversion, with information in a seperate set from the returned features and labels. So if row anonymization is desired user can withhold the ID sets. It also makes sense since ID columns specified for automunge may be different than ID columns specified for postmunge. I like this convention.
- update to convention for privacy_encode, now label anonymization is only performed for the privacy_encode='private' scenario
- basically distinction between privacy_encode=True and 'private' is with respect to private has row shuffling, index reset, and label anonymization, and when encryption is performed, the True scenario returns a public resource for label inversion while the private scenario only allows inversion with the encryption key
- gradually honing in on a cohesive strategy for privacy_encode
- a correction to automunge ML infill targeting test data to have infil be conditional on test NArw instead of train NArw (this was a relic from earlier iterations where we didn't train a ML infill model when train set didn't have missing data, our current convention is when ML infill activated we train an imputation model for all features with supported MLinfilltypes
- a tweak to postmunge ML infill stochastic impute flow to better align with test data in automunge
- updated the writeup for encryption options rolled out in last update for clarity that it is built on top of the pycrypto library

Page 18 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 18 of 99

7.22

7.21

7.20

7.19

7.18

7.17

Page 18 of 99

Links

Releases