Automunge

Latest version: v8.33

Safety actively analyzes 706267 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 99

8.15

- new assignparam option support for 1010 binarization as max_zero
- max_zero ensures unique entries with highest frequency in train set have the highest prevalance of zero entries in their encoding set
- the thought was that there might be cases where categoric entries are fed to a quantum computer, in which case a binarization with more zeros would have less occurrence of relaxation noise since 0 is a low energy state
- this is kind of like when they designed typewriters they took into account letter occurrence frequency
- yeah don't know if will have significant benefit, is primarily for use at initialization, still every little bit helps.
- new root category '10mz', comparable to '1010' but with default max_zero = True
- max_zero was partly inspired by reading through "Quantum Computing with Python and IBM Quantum Experience" by Robert Loredo
- also a cleanup to data structures returned from running automodel with customML
- now when running automodel with customML user doesn't have to reinitialize the training functions for postmunge or autoinference

8.14

- replaced a few instances throughout of initializing a new column as set to a constant with a .loc method to reduce fragmentation warning printouts (pandas can be annoying sometimes)
- added encrypt_key support to automodel(.) and autoinference(.)
- new tutorial 14 added to demonstrate encoding data sets exceeding local memory capacity by applying postmunge(.) in chunks
- added a still experimental integration of autoinference(.) directly integrated into a postmunge(.) call, so that if a postprocess_dict had a final model populated in automodel(.) and the df_test set passed to postmunge did not include a labels column, then the autoinference function is automatically called on the test data to generate output of inference which is returned in place of the labels set test_labels.

8.13

- replaced remaining instances of randomseed sampling max from (2 ** 32 -1) to recently corrected range of (2 ** 31 - 1). (Apparently had missed when this was rolled out since they were represented by the integer instead of the formula.)
- integrated a few experimental functions associated with a final model training and inference. Don't consider these fully audited or polished at this point, they are for experiments and a seed for iterations. Since they are accessed outside of traditional interface of automunge(.) or postmunge(.) assuming users won't mind. They are well suited for training a final model in conjunction with our optuna_XG1 hyperparameter tuner for example using the same ML_cmnd API to select tuning options. Note that this option can apply a different model architecture or tuning options than those used for ML infill.
- automodel(.) accepts a training set and postprocess_dict as returned from automunge(.) to automatically train a model which is saved in the postprocess_dict
- autoinference(.) accepts a test set prepared in automunge(.) or postmunge(.) and a postprocess_dict which has been populated by automodel and returns the results of inference.
- encrypted postprocess_dict by encrypt_key parameter not yet supported
- to reiterate we don't consider these new functions fully audited or polished just yet, but have validated that they are functional in limited testing.

8.12

- found an edge case scenario that was interfering with automunge(.) pandasoutput='dataframe' case for validation labels resulting in single column validation labels returned as series instead of dataframe which did not align with documentation
- now resolved
- a tweak to code comments here and there for clarity

8.11

- added support for training feature importance model with xgboost in both automunge and postmunge and customML in automunge
- (needed to align the conversion applied for fully represented range integer labels)
- customML doesn't have access to training funcitons in postmugne so defaults to random forest
- fixed the inversion process for messy data convention, now all of the assigncat set specification input columns when recovered are retained
- new root categories cnsl/cns2/cns3
- cnsl stands for consolidate, and is used to aggregate column sets in messy data convention that are specified by assigncat set bracket specification to return a single categoric encoding, where cnsl returns a 1010, cns2 returns a onht, and cns3 returns a ord3
- this is similar functionality to what was previously available through Binary dimensionality reduction, now since it is incorporated into the family tree it can be applied directly to input columns instead of requiring a preceding encoding
- another benefit is that now missing data infill is supported for consolidated sets
- where missing marker aggregation is based on the first column of the messy set alphabetical
- small tweak to GPS5/GPS6 family trees to differentiate a default parameter setting vs a different one applied by the new cnsl trees
- flipped the order of an if statement from 8.10, should make it run a tiny bit faster

8.10

- added support for injecting noise into one-hot encoded sets included in a dataframe
- lifted the requirement that input features need to meet "tidy data" requirement
- new convention is that input features can encompass multiple columns, although retaining requirement of one row per sample
- this was added for use with DPmc for categoric noise injection to pass-through sets with weighted activation flips
- new root category as DPmp for passthrough, excluded from ML infill and NArw aggregation, although note that since DPmc requires all valid entries performing a default adjinfill to DPmp
- available by assigncat specification of replacing a string header entry with a set of string header entries (or for a list of string header entries replacing one of the strings in list with a set of aggregated string header entries)
- for example, both of these would be valid specifications


assigncat = {'DPmp' : {'column1', 'column2'}}
assigncat = {'DPmp' : [{'column1', 'column2'}, 'column3']}


- Where in the second case DPmp would be applied first to the set {'column1', 'column2'} and separately to the single input column 'column3'
- this might be useful for injecting noise to one-hot encoded columns
- for now DPmp is the transform that makes the most sense for this capability for weighted categoric activation set flip, also DPse for swap noise
- possibly further transformation support pending, need to think about it
- also found and fixed a small implementation snafu interfering with protected_feature support for DPmc transform which is in family tree for DP10 and DPoh

Page 4 of 99

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.