Automunge

Latest version: v8.33

Safety actively analyzes 715032 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 17 of 99

7.28

- identified a subtle but impactful bug with customML
- turned out to be really simple, had just passed parameters in wrong order to training operation
- but because inference has a bypass for valueerror, it did not trigger a halt, so wasn't spotted until now
- now resolved
- updated catboost to receive single column label sets as pandas series instead of dataframe (to be consistent with the other autoML_types)
- updated customML to receive single column label sets as pandas series instead of dataframe
- note that this convention for single column labels sets as pandas series is consistent with form of label sets returned from automunge(.), other instances of single column sets are preserved as dataframes (i.e. for features and ID sets)
- which is similar to the convention of flattening single column numpy arrays with ravel

7.27

- reverted conditional imports for ML infill and encryption
- it appears this functionality did not work as expected for cases of redundantly called functions
- (it would successfully import on the first call, but then on susbsequent application the module was present in sys.modules even though function didn't have access to it, resulting in no import)
- appears to be another case of insufficient validation prior to rollout

7.26

- a few code comment cleanups to evalcategory associated with application of bnry to numeric sets with 2 unique values, added this clarifiaiton that also applies to numeric labels
- updated to convention for temporary columns initialized as part of transfomrations
- now non-returned temporary columns are named with integer column headers, which is an improvement as it eliminates a suffix overlap channel since other columns will be strings
- relevant to transforms tmsc, pwrs, pwor, DPod
- added code comment to custom_train_template in read me as:
we recommend naming non-returned temporary columns with integer headers since other headers will be strings
- converted support function _df_split to a private function (I think I may have been using in an experiment was why it was not previously private, not positive)
- made imports conditional for encryption support functions
- added printouts to final suffix overlap results associated with cases where PCA, Binary, or index columns were adjusted to avoid overlap (not an error channel, just an FYI)
- new PCA option to retain columns in returned data that served as basis
- similar to the Binary retain options
- can be activated by passing ML_cmnd['PCA_retain'] = True
- an update to support function associated with ML infill data type conversions to eliminate editing dataframe serving as input for __convert_onehot_to_singlecolumn
- in 7.24 had added note to the customML writeup in read me for customML_train_classifier template: "label entries will be non-negative str(int) with possible exception for the string '-1'"
- this convention is now updated to label entries will be non-negative str(int)
- I think this also fixes a snafu in our flaml autoML_type implementation since was using integer labels for classification instead of strings
- updated automunge and postmunge WorkflowBlock addresses to be based on a unique string (previously there were some redundant string addresses between automunge and postmunge)
- realized I wasn't applying ravel flattening to test_labels returned from automunge in pandasoutput = False scenario, which was an oversight
- used as an opportunity to rethink aspects of single column conventions for other sets
- we had in place that all single column pandas sets are copnverted to series already
- but only single column numpy arrays were flattened
- which was kind of not aligned
- decided that for benefit of common pandas operations independant of single or multi column case to features and ID sets it made sense to limit series conversions to label sets
- now updated and now aligned convention is with pandas output only single column label sets are converted to series, and with numpy output only single column label sets are flattened with ravel
- updated the empty set scenario for numpy output in postmunge for ID and label sets to be returned as numpy arrays consistent with form returned from automunge (previously were returned as empty list)
- found a simplification opportunity for a support function associated with populating a data structure as __populate_columnkey_dict
- updated convention for normalization_dict entries found in postprocess_dict['column_dict']
- previsouly we redundantly saved this dictionary for each column returned from a transform
- since we only inspect it for one column for use in postmunge, decided to eliminate the redundancy to reduce memory overhead
- (as in some cases, as with high cardinality categoric sets, these may have material size)
- now only saved for column that is first entry in categorylist
- updated postmunge driftreport to only report for first column in a categorylist
- (this was for compatibility for new normalziation_dict convention but had side effect of making for cleaner reporting by eliminating redundancy in output)
- updated a variable naming in infill application support function for clarity (from "boolcolumn" to "incompatible_MLinfilltype")
- formalized convention that any previously reported drift stats only reported in a single normalizaiton_dict out of a multi column set are now aggregated to a single reported form
- ML_cmnd returned in postprocess_dict now records the version number of application for use in postmunge

7.25

- today's theme was all about improved clarity of code
- we performed a full codebase walkthrough for purposes of aggregating navigation support
- we've introduced convention that each set of function defintions are grouped by theme
- which in most cases was already in place, in a handful of cases we moved a few functions around to better align with this grouping coherence
- we've introduced kind of a table of contents at the start of the AutoMunge class definition
- listing what we're referring to as "FunctionBlock" entries
- which are basically a title for the theme of a set of function definitions, each including in the table the list of associated functions
- code base navigation can now more easily be performed by using these FunctionBlock titles as a key for a control F search
- similarly, we performed a more detailed walkthrough of the two master functions for interface: automunge(.) and postmunge(.)
- and for each introduced a kind of table of contents with what we're referring to as "WorkflowBlock" entries
- the WorkflowBlocks are for cataloging segments of the lines of code by key themes
- which also can be navigated by a control F search
- and their entry in the code include codee comments of high level summary of key operations for the block
- we expect this update will significantly benefit code navigation
- which since we group everything in a single file was probably long overdue

7.24

- alternate autoML libraries are now given a conditional import instead applying import by default
- since imports are conducted internal to their support function, the import requires reinitialization with each ML infill model training etc. By conducting a conditional import, user now has option to perform associated imports external to automunge(.) or postmunge(.) (instead of an automated import with each function call), which in some cases will benefit latency. Or when external import omitted support functions conduct internal exports as prior.
- update to evalcategory, now majority numeric data with nunique == 2 defaults to bnry instead of nmbr root category (usefule in scenario where numeric labels may be a classificaiotn target which is not uncommon)
- added note to the customML writeup in read me for customML_train_classifier template: "label entries will be non-negative str(int) with possible exception for the string '-1'"
- realized last update claim of "only need to reinitialize the inference functions" was not sufficiently validated, had missed that customML funcitons are also saved in ML_cmnd postprocess_dict entry, now resolved
- update to customML convention, now library has a suite of internally defined inference functions for a range of libraries, including {'tensorflow', 'xgboost', 'catboost', 'flaml', 'autogluon', 'randomforest'}
- this was inspired by realization that since we only needed to reinitialize customML inferance functions in new notebook prior to pickle upload of postprocess_dict, and since most libraries inference operations are fairly simple and common, by giving option of an internally defined inference function user can now apply customML and share the postprocess_dict publicly without a need to parallel distribute the custom inference function
- user defined customML inference functions are specified through ML_cmnd['customML']['customML_Classifier_predict'] and ML_cmnd['customML']['customML_Regressor_predict']
- user can now alternatively populate these entries as a string of one of {'tensorflow', 'xgboost', 'catboost', 'flaml', 'autogluon', 'randomforest'} to apply the default inference function associated with that library
- Please note we do not yet consider these default inference functions fully audited - pending further validations. As implemented is intended as a proof of concept.

7.23

- added upstream primitive entry support for mlti
- corrected empty set scenario inplace accomodation for mlti postprocess
- corrected mlti norm_category inplace_option inspection to align with convention that unspecified inplace_option in process_dict is interpreted as True
- found and fixed an edge case for mlti transform associated with norm_category without inplace support
- new noise injection transforms for intended use as downstream tree categories
mlhs: for categoric noise injection targeting multicolumn sets with concurrent MLinfilltype (e.g. for use downstream of concurrent_act or concurrent_ordl)
DPmc: for categoric noise injection targeting multicolumn sets (e.g. for use downstream of multirt and 1010)
- note that DPmc differs from DPod in that it doesn't require an ordinal encoded feature as input
- new root categories for noise injection to hashing transforms
DPhs: hash with downstream noise injection (with support for multicolumn hashing with word extraction)
DPh2: hsh2 with dowstream noise injection (single column case comparable to hsh2)
DPh1: hs10 with downstream noise injection
- update to conventions for powertransform scenarios of 'DP1' and 'DP2', replacing hash and hsh2 scenarios with DPhs and DPh2 respectively
- new DPod parameter upstream_hsh2 for use when DPod is applied downstream of a hashing transform
- new mlti dtype parameter scenario mlhs for use when mlti applied downstream of a multicolumn hashing
- update to customML convention, now only the inference functions are stored in postprocess_dict for access in postmunge
- benefit of this convention is that if user downloads postprocess_dict with pickle and wants to upload in a new notebook, now the only need to reinitialize the inference functions
- which especially makes sense for QML
- which benefits privacy for special training configurations when postprocess_dict is shared publicly
- only tradeoff is lose access to cusotmML for training postmunge feature importance, compromise is we appy the default autoML_type instead
- small tweak to stochastic_impute_categoric for a better temp support column convention as integers to ensure no overlap with otherwise string column headers
- updated convention for DPbn root category, now noise is injected with DPod function instead of DPbn function
- previously they would have been equivalent, but now that DPod supported weighted samples is a better resource
- in other words, DPbn now supports weighted sampling for noise injection

Page 17 of 99

Links

Releases

Has known vulnerabilities

© 2025 Safety CLI Cybersecurity Inc. All Rights Reserved.