Automunge

Latest version: v8.33

Safety actively analyzes 715032 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 19 of 99

7.16

- new encrypt_key parameter now available for automunge(.) and postmunge(.)
- automunge(.) accepts encrypt_key as one of {False, 16, 24, 32, bytes}
- where bytes means a bytes type object with length of 16, 24, or 32
- encrypt_key defaults to False, other scenarios all result in an encryption of the returned postprocess_dict
- 16, 24, and 32 refer to the block size, where block size of 16 aligns with 128 bit encryption, 32 aligns with 256 bit
- when encrypt_key is passed as an integer, a returned encrypt_key is derived and returned in the closing printouts
- this returned printout should be copied and saved for use with the postmunge(.) encrypt_key parameter
- in other words, without this encryption key, user will not be able to prepare additional data in postmunge(.) with the returned postprocess_dict
- when encrypt_key is passed as a bytes object (of length 16, 24, or 32), it is treated as a user specified encryption key and not returned in printouts
- when data is encrypted, the postprocess_dict returned from automunge(.) is still a diciotnary that can be downloaded and uploaded with pickle
- and based on which scenario was selected by the privacy_encode parameter, the returned postprocess_dict may still contain some public entries that are not encrypted, such as ['columntype_report', 'label_columntype_report', 'privacy_encode', 'automungeversion', 'labelsencoding_dict', 'FS_sorted']
- where FS_sorted is ommitted when privacy_encode is not False
- and all public entries are omitted when privacy_encode = 'private'
- the encryption key, as either returned in printouts or basecd on user specification, can then be passed to the postmunge(.) encrypt_key parameter to prepare additional data
- thus privacy_encode may now be fully private, and a user with access to the returned postprocess_dict will not be able to invert training data without the encryption key
- small deviation for privacy_encode == 'private' scenario
- we are keeping convention that train and test data have their rows shuffled and dataframe index reset
- decided that would be better to have some channel to recover index position in private scneario if needed
- so in the private scenario, the Automunge_index column returned in the ID sets is retained
- since ID sets are returned as a seperate datframe, if user wishes data to remain fully row wise anonymous they can share just the train/test/labels data but keep the ID sets private
- found a oversight in the privacy encoded versions of columntype_report, had thought columntype_report contained both training features and labels, forgot that labels are broken out into seperate label_columntype_report, now both are anonymized for privacy_encode
- found an oversight for privacy_encode associated with information channels in postmunge
- postmunge returned postreports_dict now omits entries ['featureimportance', 'FS_sorted', 'driftreport', 'rowcount_basis', 'sourcecolumn_drift']
- postmunge no longer supports non default entries for following parameters when privacy_encode was activated in postmunge: [featureeval, driftreport]

7.15

- new option for privacy_encode parameter as 'private'
- previously privacy_encode accepted boolean defaulting to False, when True the column names and order of columns are anonymized, as well as some of returned data structures like columntype_report
- in the new 'private' option, these measures are supplemented by also activating that all datasets in automunge and postmunge will have their rows shuffled, consistent with what is otherwise available with the shuffletrain parameter
- additionially, as measures to further anonymize, inversion not supported for privacy_encode=='private', dataframe indexes are reset, and index columns retruned in ID sets are reset
- we recommend privacy_encode=='private' primarily as a resource for unsupervised learning applications
- or otherwise for scenarios of allowing model training / hyperparameter experiments by external party without needing to pair resulting inferences to specific input rows
- note that if you want to match the privacy_encode form in a seperate automunge(.) call with correpsonding data, you can do so by matching the automunge(.) randomseed
- and if you want to retain a row identifier without sharing with external party you can populate your own ID set
- we thought about some additional measures, like having postmunge require a minimum number of unique rows in order to prepare additional data, but for now since running postmunge means user has access to postprocess_dict there is no added benefit

7.14

- updated the ML_cmnd address to pass parameters to customML training and inference from ML_cmnd['MLinfill_cmnd']['customClassifier'] to ML_cmnd['MLinfill_cmnd']['customML_Classifier']
and from ML_cmnd['MLinfill_cmnd']['customRegressor'] to ML_cmnd['MLinfill_cmnd']['customML_Regressor']
- this was to better align on terminology by referring to custom ML operations as "customML"
- new validation result returned with Binary application as postprocess_dict['miscparameters_results']['Binary_columnspresent_valresult']
- Binary_columnspresent_valresult activates when a Binary specification includes a column header not found in the input or returned sets
- added a mitigation to leakage_dict specification for cases where was specified with key not found in set

7.13

- new option for ML infill, user can now define and integrate custom functions for model training and inference
- documented in read me in section Custom ML Infill Functions (final section before conclusion)
- trying to clean up postprocess_dict a little bit
- new convention is postprocess_dict['autoMLer'] only returns entries that will be inspected in postmunge
- meaning if custom ML infill functions are passed they only need to be reinitialized prior to uploading postprocess_dict when they were applied
- postprocess_dict['orig_noinplace'] recast from a list to a set which should slightly benefit postmunge latency
- found and mitigated a remote Binary edge case associated with improper specification
- replaced an operation to reset column headers from use of numpy conversion to a pandas method in a few of autoML training functions
- new entries returned in postprocess_dict['columntype_report'] as postprocess_dict['columntype_report']['all_categoric'] and postprocess_dict['columntype_report']['all_numeric']
- these are list aggregations of all returned numeric features and all returned categoric features
- (columntype_report already included more granular detail such specific feature types and groupings)

7.12

- 7.11 introduced a bug for downloading and uploading postprocess_dict’s through pickle, which we didn’t catch in our testing since we didn’t run a backward compatibility check since it was a backward compatibility breaking update
- resolved by recasting functions directly stored in postprocess_dict as public functions (by removing a leading underscore from function name)
- which includes transformation functions and wrappers for training and inference
- fully resolved
- also, put some thought into privacy preservation associated with inversion operation
- inversion passed as list or set specification now halts when applied in conjunction with privacy_encode
- when privacy_encde is activated, inversion can be passed to postmunge(.) as one of {'test', 'labels', 'denselabels'}
- new convention is that the dataframe returned from inversion only includes recovered features
- also in process improved inversion so that the order of recovered features in the returned dataframe matches order of features as passed to automunge(.) through df_train
- found and fixed bug for inversion passed as list or set in privacy_encode scenario
- inversion denselabels option now preserves transformation privacy in returned headers
- now with privacy_encode inversion the returned inversion_info_dict masks the recovery path, replacing with boolean True

7.11

- major backward compatibility impacting update
- meaning postprocess_dict's populated in prior versions will require re-fitting to the train set using this or a later version or running an earlier version for postmunge(.) to prepare additional data
- this update was to align with the intent that all operations are to be channeled through the interface of two master functions: automunge(.) and postmunge(.)
- all internal support functions other than automunge(.) and postmunge(.) are now private functions, not accessible outside of the class
- took this backward compatibility impact as an opportunity to clean up all postmunge and postprocess operations that had dual configurations to accomodate backward compatibility
- also, an audit of the insertinfill function identified opportunity for a more efficient application
- now with what was a kind of hacky pandas replace application replaced with an operation built on top of loc
- we expect this will benefit latency of this function which is used throughout ML infill and other assigninfill options
- struck some unused variables initialized in inversion
- new convention: returned postprocesss_dict omits entries for transformdict and processdict which were copies of user passed parameters
- new convention: returned postprocess_dict entries for transform_dict and process_dict only record transformation categories that were inspected as part of the automunge(.) call
- the thought was that this will benefit privacy in scenario where user has developed their own library of custom transformations, such that if they want to publish a populated postprocess_dict publicly, it will only reveal those portions of their library that were applied in derivation
- as further clarification on last update
- the concurrent ML infill process flaw was associated with the categorylist passed to inference
- although did not show up in testing since categorylist isn't inspected for default random forest implementation
- it is inspected for other learning libraries in inference
- was resolved by reframing categorylist passed to inference in concurrent scenario

Page 19 of 99

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.