Automunge

Latest version: v8.33

Safety actively analyzes 681844 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 70 of 99

3.91

- updated spl2 family tree for downstream ord3 instead of ordl
- updated ors6 processdict for use of spl2 transformation function instead of spl5
- fixed "returned column" printouts to show columns in consistent order as they are returned in data sets
- new transform src4, comparable to srch but activations returned in an ordinal encoded column
- note that for cases for multiple search activations, encoding priority is given to entries at end of search parameter list over those at beginning
- new processing function strn, which extracts the longest length string set from categoric entries whose values do not contain any numeric entries
- note that since strn returns non-numeric entries, it's family tree is populated with a downstream ord3 encoding
- found and fixed edge case bug associated with our reserved special character set 'zzzinfill'
- so now when raw data is passed which contains that entry, it is just treated as infill instead of halting operation (as was the case for a few categoric transforms prior to this update)
- new option for srch function to pass search parameter, previously supported as a list of search terms, to now embed lists within that list of search terms
- these embedded lists are aggregated into a common activation
- for example, if parsing a list of names, could pass search parameter to aggregate female surnames as as ['Mr', ['Miss', 'Ms', 'Mrs']] for a common activation to the embeded list terms
- From a nuts and bolts standpoint the convention to name returned column of the aggregated activations is taken by the final entry in the embedded list, so here the aggregated list would be returned as column_srch_Mr and column_srch_Mrs
- Note that the intent is to carry this search parameter option to aggregate activations to the other srch functions (src2,src3,src4), saving that for another update
- Added inversion function for UPCS transform (simply a pass-through function, original lower cases not retained)

3.90

- corrected typo in datatype assignment for inversion from smoothed labels (corrected "int8" to "np.int8")
- small cleanup for floatprecision methods to only inspect for 64bit precision in single location for clarity
- populated a new data structure in postprocess_dict as postprocess_dict['inputcolumn_dict']
- don't have a particular use for this right now, but thought it might be of potential benefit for downstream use

3.89

- new postmunge parameter inversion
- to recover formatting of source columns
- such as for example to invert predictions to original formatting of labels
- this method is intended as an improvement upon the labelsencoding_dict returned label set normalization dictionaries which were intended for this purpose
- note that for cases where columns were originally returned in multiple configurations
- inversion selects a path of inversion transformations based on heuristic of shortest depth
- giving priority to those paths of full information retention
- method supported by new optional processdict entries inverseprocess and info_retention
- inversion transformations now available for transformation categories: nmbr, nbr2, nbr3, mean, mea2, mea3, MADn, MAD2, MAD3, mnmx, mnm2, retn, text, txt2, ordl, ord2, ord3, bnry, bnr2, 1010, pwrs, pwr2, pwor, por2, bins, bint, boor, bnwd, bnwK, bnwM, bnwo, bnKo, bnMo, bene, bne7, bne9, bneo, bn7o, bn9o, log0, log1, logn, lgnm, sqrt, addd, sbtr, mltp, divd, rais, absl, bkt1, bkt2, bkt3, bkt4
- inversion can be passed to postmunge as one of {False, 'test', 'labels'}, where default is False
- where 'test' accepts a test set consistent in form to the train set returned from automunge(.)
- and 'labels' accepts a label set consistent in form to the label set returned from automunge(.)
- note that inversion operation also inspects parameters LabelSmoothing, pandasoutput and printstatus
- note that inversion operation is pending support for train sets upon which dimensionality reduction techniques were performed (such as PCA, feature importance, or Binary).
- recovery from smoothed labels is supported.
- note that categorical encodings (bnry, 1010, text, etc) had a few diferent conventions of plug values for various scenarios requiring them, now standardizing on the arbitrary string 'zzzinfill' as a plug value in categorical set encodings for cases where required for infill.
- so to be clear, 'zzzinfill' is now a reserved special character set for categorical encodings
- (this item becomes more customer facing with inversion)

3.88

- found a few small efficiency improvement opportunities in bnry and bnr2 transforms
- replaced pandas isin calls with np.where
- (this item was a relic from some early experiments)

3.87

- replaced all instances of scikit train_test_split and shuffle functions with streamlined pandas methods
- trying to reduce the number of imports for simplicity, pandas is prefered for consistency of infrastrucure

3.86

- new transform logn, comparable to log0 but natural log instead of base 10
- new normalization lgnm intended for lognormal distributed sets
- where lgnm is achieved by a logn performed upstream of a nmbr normalization
- reintroduced the original srch configuration as src3
- where expectation is srch is preferred for unbounded range of unique values
- including for scaling with increasing number of search terms
- src2 preferred when have bounded range of unique values for both train & test
- and (speculating) that src3 may be beneficial when have a bounded range of unique values and high number of search terms but still want capacity to handle values in test set not found in train set
- (leaving validation of this point for future inquiry)

Page 70 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.