Automunge

Latest version: v8.33

Safety actively analyzes 681844 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 8 of 99

7.82

- each of the DP root categories (e.g. DPnb, DPmm, DP**, etc), which default to injecting noise to train data and not to test data (i.e. trainnoise=True, testnoise=False) now have otherwise equivalent variations as DT root categories (e.g. DTnb, DTmm, DT**, etc) which default to injecting to test data and not to train data (i.e. trainnoise=False, testnoise=True), or as DB root categories (e.g. DBnb, DBmm, DB**, etc) which default to injecting to both train and test data (i.e. trainnoise=True, testnoise=True).
- added a clarifying code comment to np.inf -> nan conversion regarding reason for not using internal pandas method (use_inf_as_na), which is not used in order to preserve property of NaN != NaN

7.81

- numeric noise trasnforms (DPnb, DPmm, DPrt) now have options for sampling from uniform distirbution
- available by noisedistribution or test_noisedistribution parameters
- uniform sampling activated by passing as one of {'uniform', 'abs_uniform', 'negabs_uniform'}
- where the abs and negabs scenarios are for all positive noise or all negative noise
- note that the high and low of uniform range are derived as high=(sigma-mu) and low=(mu-sigma)
- found and fixed a variable naming snafu with abs/negabs test_noisedistribution scenarios in DPnb
- found and fixed a variable naming snafu with noisedistribution for traindata=True scenario in DPnb postprocess
- also new option for str_convert parameter in categoric encodings (bnry, onht, ordl/ord3, 1010)
- previously str_convert was available as boolean, in 7.79 changed default to True to accomodate edge case for some of these categoric variants when bytes type entries are present
- where str_convert=True results in common encodings between e.g. int and str(int) (2=='2')
- new scenario is str_convert='conditional_on_bytes'
- which resets to str_convert=True when bytes entries are present in train set (to avoid the edge case), otherwise defaults to resetting to str_convert=False (which has benefit of distinct encodings between int and str(int) )
- leaving str_convert=True as the default since it is easier to describe

7.80

- revisited the ID column extraction scenarios via trainID_column and testID_column paraemters
- ID columns were based on identifying training/test set columns to extract from the feature sets and return in ID sets consistently shuffled and partitioned
- thus if the df_train/df_test had features not intended for passing to a downstream ML model, they could be extracted and returned with retained coordination to the other features
- note that the ID sets are also used to return the "Automunge_index" and any other non-range integer index columns that were included in df_train/df_test dataframes
- the new convention adds an option to redundantly include columns both in the feature sets for encoding, and the ID sets for retention of original form
- the thought was that there might be scenarios where original form would be desired for comparison to encodings while retaining order and partionings from any shuffling or validation data splits
- this option available by passing trainID_column or testID_column as a list of two lists, e.g. [list1, list2]
- where the first list may include ID columns to be struck from the features and the second list may include ID columns to be retained in the features
- also improved the convention that if testID_column not specified and trainID columns present in df_test they are automatically extracted for ID sets
- previously this ony worked when the full ID set was present in df_test, now even if ID sets are partially populated in df_test, they will be automatically extracted based on their inclusion in the trainID sets
- so basically we only suggest using the testID_column specification for cases where df_test includes ID columns not present in df_train
- also added some code comments clarifying interpretation of various postprocess_dict entries

7.79

- found a data type scenario missing from our validation tests associated with bytes type entries
- categoric encodings (e.g. 1010, onht, ordl, bnry, etc) have updated default configuration for str_convert from False to True to accomodate edge case for bytes type entries
- str_convert means that strings and integer equivalents have common encodings (e.g. '2' == 2)
- note that this results in inversion returning str(int) entries corresponding to int input
- for unique encodings between str(int) and int can still deactivate the str_convert parameter, noting that this results in an edge case for bytes type entries
- also updated the evaluations under automation to treat majority bytes entries as targets for caetgoric encodings
- also small bit of minutia, defaultinfill for lcinifill now converts to str if any bytes are present
- the result of this update is that library now has global support for bytes type entries

7.78

- resolved a printout associated with receiving bytes dtype entries
- originating from an inspection of entries for NaN conversion from np.inf

7.77

- noise injection transforms now have assignparam option as retain_basis
- retain_basis is associated with passing other noise distribution parameters as lists of candidates or as scipy stats distributions
- in the default retain_basis = False case a unique sampling is applied in both automunge(.) and postmunge(.)
- in the retain_basis = True case the sampling from automunge(.) is retained as basis for postmunge(.)
- also postmunge(.) new accepts df_test as a set of labels without features
- (requires df_test as dataframe with label column headers populated)
- we expect the case of features without labels will be more common, this just provided to be thorough

Page 8 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.