Automunge

Latest version: v8.33

Safety actively analyzes 682457 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 24 of 99

6.77

- added parameter support to ordl/ord3
- now accepting parameter 'null_activation', boolean defaults to True
- when False, missing data does not receive a distinct activation
- the purpose of this parameter is to support encoding categoric sets where a missing data representation is not desired, as may be the case when an ordinal encoding is applied to a label set for instance
- and when missing data is found, it is grouped into the encoding with the 0 bucket
- which in base configuration (when frequency_sort = True) will be encoding associated with the most frequent entry found in train set
- or when frequency_sort = False, will fall in first entry in alphabetic sorting
- note that when ordered_overide parameter is in default True state, if a column is received as pandas ordered category dtype, the first entry will be per that sorting
- note that when receiving data as pandas category dtype, we have a known range of potential values, and can thus be assured that missing data won't be present at least in the train data
- (although subsequent test data may not be consistent pandas category dtype, since this parameter is for label sets shouldn't be an issue)
- null_activation parameter now reset as False for root category lbor, which is our default categoric encoding to label sets under automation
- lbor also changed sorting basis from alphabetic (like ordl) to frequency (like ord3)

6.76

- significant backward compatibility impacting update
- after this update we intend to be much more intentional about future impacts to backward compatibility
- we have a new validation test rolled which we are comfortable will help us identify future backward compatibility impacts prior to rollout
- struck prior dual/single/post convention implementations of suite of categoric encodings
- including text/onht/ordl/ord3/1010
- in order to reduce code bloat
- this impacts backward compatibility for most categoric encodings, as well as Binary option
- also backward compatibility impacts to power of ten binning via pwrs and pwor
- tweaked bnry transform resulting in lifted reserved string
- tweaked ucct trasnform resulting in lifted reserved string
- tweaked strn transform resulting in lifted reserved string
- rewrote Binary dimensionality reduction to be based on custom_train versions of 1010 or ordl
- replaced all use of support function _postprocess_textsupport
- now consolidated to much cleaner support function _onehot_support
- struck support function _check_for_zzzinfill
- a big cleanup to the pwrs transform for power of ten binning
- added pwrs parameter support 'zeroset' (boolean defaults as False) to include an activation column for received zero entries, returned with the suffix column_suffix_zero
- similar cleanup to pwor for ordinal power of ten binning, also added zeroset parameter support
- these updates carry thorugh to other categories built on top of pwrs or pwor transformation functions
- this update has eliminated all cases of reserved string for entries of an input column to categoric encodings
- refering to our use of 'zzzinfill' string as a placeholder for infill in categoric encodings
- we still use the string 'zzzinfill' as a placeholder in a few places for string parsing functions, intent is to audit these instances for possible future update
- this update results in a 5% macro reduction in lines of code

6.75

- consolidated to a single NaN representation making use of np.nan to be consistent with other data types
- (we had used float("NaN") in a few places, turns out there are some scenarios where the two forms aren't equivalent)
- now all received NaN values in df_train and df_test are consolidated to the form of np.nan
- streamlined suffix convention for custom_train transforms, the optional suffix parameters are no longer accepted, suffix is always cast as the tree category (if you need a different suffix can define a new processdict entry)
- thus transforms ported to custom_train in recent updates (1010, ordl, and now onht) no longer have suffix parameter support
- added a clarification to custom_train_template in read me that "(automunge(.) may externally consider normalization_dict keys of 'inplace' or 'newcolumns_list')"
- consolidated one-hot encoding transforms text and onht into a single transform now distinguishable by parameter
- using same function just rolled out for onht in custom_train convention as the basis
- by adding to onht trasnform parameter suffix_convention as one of {'text', 'onht'}, defaulting to text, which distinguishes between suffix conventions
- note that when suffix_convention cast as text, str_convert is reset to True and null_activation is reset to False
- resulted in a few improvements to text, now with support for ordered_override parameter, lifted reserved string,
- new parameter accepted to one hot encodings onht and text as frequency_sort, boolean defaults to True
- when True the order of returned columns is sorted by the frequency of entries as found in the train set, False is for alphabetic sorting
- note that when ordered_override is activated if received as a pandas ordered categoric set that order takes precedence over frequency_sort
- also added support for ordered encodings in conjunction with the activation parameters (all_activations / add_activations / less_activations / consolidated_activations)
- similar updates to ordinal encodings
- now our two primary variants (ordl, ord3) are consolidated into a single function differentiated by parameter frequency_sort
- where frequency_sort is boolean defaulting to True indicating that the order of integer encodings will be based on the entries sorted by frequency as found in the training set (consistent with ord3)
- and frequency_sort=False is integer encodings sorted by an alphabetic sort of entries, which is default for ordl, lbor, lbos
- (noting that lbor is our default categoric label set encoding under automation)
- and other similar conventions updated as just discussed for onht/text, including support for ordered encodings in conjunction with the activation parameters
- rewrite of the label smoothing trasnform smth
- split the operation into seperate trasnformation categtories, now smth is applied downstream of a seperate one-hot encoding
- variants available with root categories smth/smt0/fsmh/fsm0/lbsm/lbfs
- please consider this implementation a demonstration of the long existing funcitonality for specifying family trees with transforms applied downstream of transforms that return multicolumn sets
- this new configuration enables full parameter support for the upstream one-hot encoding consistent with onht
- and also means label smoothing and fitted label smoothing can be applied downstream of any multirt ML infilltype trasnform by specifying a family tree
- this update impacts backward compatibility for label smoothing trasnforms
- please note we intend for our next update to have several additional impacts to backward compatibility, as we intend to consolidate our full range of categoric encodings to the custom_train convention in order to mitigate some recent bloat in lines of code from porting transforms to custom_train
- and after this next update we intend to be much more intentional about future impacts to backward compatibility
- we have a new validation test rolled which we are comfortable will help us identify future backward compatibility impacts prior to rollout

6.74

- rewrite of onht transform for one hot encoding, ported to the custom_train convention
- similar simplifications as with 6.73, should benefit latency
- added ordered_overide parameter support, similar to use with ordl, for setting order of activation columns when a column is received as pandas ordered categoric, accepts boolean to deactivate
- onht missing data by default returned with no activation, new convention when null_activation parameter activated missing data now encoded in final column in onht set (as opposed to based on alphabetic sorting)
- struck some troubleshooting printouts had missed with last update
- rewrote the evaluations performed for powertransform == 'infill', including fixed a few derivation edge cases and consolidated to pandas operations
- identified an edge case for np.where, which we were using extensively
- edge case was associated with dtype drift
- for example, if target column starts with a uniform dtype, including NaN, int, float, when np.where inserts a string it converts all other dtypes to str, although does not do so when target column starts with mixed dtypes
- so created an alternative to np.where as "autowhere", which is built on top of pandas.loc
- it is kind of a compromise between np.where and pd.where with a few distinctions verses each
- global replacement of np.where throughout codebase with autowhere
- a small udpate to the onehot support function in how we access index
- found and fixed a bug with None entry conversion to NaN
- decided to make stochastic_impute_categoric and stochastic_impute_numeric on by default
- this is a nontrivial update, the justification is comments from conference review on potential for deterministic imputations to contribute to bias
- hope to perform some form of validation down the road, will take some creativity for experiment design
- stochastic impute for ML infill can be deactivated in ML_cmnd if desired, i.e.
ML_cmnd = {'stochastic_impute_numeric': False,
'stochastic_impute_categoric':False},
- found and fixed an edge case for postmunge drift report associated with excl suffix convention

6.73

- some clarifications provided to README valpercent parameter writeup associated with use of automunge(.) in context of a cross-validation operation
- a few small code comment cleanups and a small tweak associated with activation parameters to _custom_train_1010
- ported the ordl transform for ordinal encoding to the custom_train convention
- similar to 1010, we believe this will benefit latency
- new form has consistent functionality and parameter support
- lifted 'zzzinfill' reserved string requirement
- primary deviations from user standpoint is that missing data prior to ML infill now by default is encoded as integer 0
- also trimmed a few stored entries carrying redundant information in the normalization_dict saved in postprocess_dict['column_dict'] to reduce memory overhead, as there are scenarios where ordl sets may have a large number of unique entries
- parallel small update to the ordered_overide parameter convention for ordinal encodings ordl and ord3 as implemented in dualprocess convention, now ordered treatment is based only on train set instead of both train and test (consistent with the custom_train version)
- found an assigninfill scenario where we were piggy backing off of normalization_dict populated in trasnformation functions to store 'infillvalue' as a derived infill value (associated with mean, median, mode, and lc infill from assigninfill)
- realized this was in effect resulting in a reserved string for normalization_dict populated in user defined transformation functions
- so simple simple solution, moved the infill value to first tier of column_dict associated with the column instead of populating in normalization_dict, and renamed to 'assigninfill_infillvalue'
- which in the process also resolved an issue I think originating from saved normalization_dict's accross columns in a categorylist sharing same address in memory, so when we were saving infill to one column's normalization_dict was overwriting entry in other columns from categorylist
- small tweak to Binary dimensionality reduction, now when aggregating activations from boolean integer columns, the activations are recast as integers, which addresses an edge case when negzeroinfill is applied with assigninfill to a boolean integer column resulting in dtype drift to float
- as a note, with pending various portings of transformation functions to custom_train convention, it will result in some bloat to lines of code, intent is sometime (not too far down the road) to consolidate any redundancies to just the custom_train form, which will impact backwards compatibility, so saving this step for once have a bulk of consolidations ready so can roll out in one fell swoop, like ripping a bandaid off

6.72

- reintroduced the custom_train convention for 1010 (default categoric encoding under automation)
- identified the source of the edge case for ported 1010 to the custom_train convention noted in 6.71
- it was associated with our removal of the 'zzzinfill' reserved string for NaN missing data representation
- found that NaN had potential to interfere witih operations in a few edge cases related to both set operations and pandas column dtype drift
- for set operations, NaN demonstrated at times inconsistent behavior, for example with 'in' detection or use of '|'
- also in some cases set operations resulted in duplicate inclusions of a NaN entry to a set
- additionally, NaN inclusions in replace operations had potential to result in pandas column dtype drift
- which could in edge case replace our string representations of binarization sets to integers, stripping the leading zeros
- settled on a convention that every pandas.replace operation is now applied along with a .astype('object'), which helps mitigate dtype drift
- also for set management surrounding NaN inclusions, we've applied a different method to remove NaN entry, now relying on a set comprehension taking account for NaN != NaN
- we still like using set operations to manage the unique entries and encodings as much more efficient than list methods, now that we can accomodate these further identified NaN entry edge cases we can lift the 'zzzinfill' reserved string requirement
- as noted in 6.70 intent going forward is to continuing porting a few more foundational transforms to custom_train convention and lift reserved string requirements where possible

Page 24 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.