Automunge

Latest version: v8.33

Safety actively analyzes 681844 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 16 of 99

7.34

- found an inconsistency in workflow order of operations between automunge and postmunge
- associated with label set extraction and index extraction
- basically convention is that when df_train recieves a non-range integer index, it is extracted and returned in ID sets
- the issue was that in automunge we extracted non ranged integer index prior to extracting label set
- but in postmunge we had in reverse order
- resulting in potential scenario where the postmunge labels had an index not matching the features
- which is an error channel
- resolved by simply swapping the order between postmunge workflow blocks "postmunge ID set populated" and "postmunge labels and other variable initializations"
- (now the ID set gets populated first)

7.33

- found a (remote) edge case for drift report, now mitigated
- associated with the new categorylist[0] convention for storing normalization_dict
- associated with cases where orig categorylist[0] != new categorylist[0]
- also a few other remote edge cases for drift report identified from runing comprehensive valications
- including edge cases associated with None / nan conversion and edge case case for null transform
- in the process performed some variable naming cleanup for drift report assembly
- had aggregated lists of returnedcolumns and newreturnedcolumns, but was refering to an entry in newreturnedcolumns as returncolumn, which was kind of confusing
- now referring to entry in newreturnedcolumns as newreturnedcolumn
- also revised the source column drift stat assembly heuristic for categoric features, increasing the heuristic threshold from 500 to 5000

7.32

- added a source column drift stat for categoric root categories as unique_ratio
- (recorded in cases of root category with justNaN NArowtype)
- unique_ratio records a dictionary mapping unique entries to a float representing percent of entries in the feature with that value
- this is sort of similar to the derived column drift stat activation_ratios previously recorded for a few categoric and binning transforms

7.31

- a slight reshuffling of postmunge drift reporting to improve flow of printouts
- (source column drift stats are now collected prior to derived column drift stats)
- in the process identified a driftreport scenario that wasn't collecting source column drift stats as intended
- (associated with driftreport = True)
- a cleanup to returned validation results in cases of privacy encoding to remove leakage channel
- technically this wasn't really a leakage channel since was part of encrypted entries
- still if user wanted to do their own privacy management this probably would have been a hard one to spot
- so went ahead and stripped suffix validation results for this purpose
- note that the aggregated suffix overlap results are still available as suffixoverlap_aggregated_result

7.30

- new trainnoise parameter accepted for DP family of noise injection transforms
- trainnoise accepts boolean defaulting to True, indicating that noise injection will be applied to training data
- note that for postmunge application, when a DP transform inspects the traindata parameter, it now also takes into account the trainnoise parameter for determination of whether to inject
- i.e. traindata postmunge parameter selects whether a data set is to be treated as test or train data
- and the trainnoise assignparam option selects whether for a given transform data considered training data will receive injection
- in the process fixed a snafu for test set noise injection in the DPod automunge transform associated with weighted sampling

7.29

- identified a channel for customML error associated with some quirks of the xgboost library
- specifically, xgboost requires labels as fully represented integers within the set of sequential range from 0 to max of the encoding space
- our customML implementation previously had potential for gaps in the encoding space
- resolved by introducing a new convention for conversion of gaps for customML labels to meet this criteria
- in other words, customML now provides labels as str(int) type with non-negative entries as a fully representated set for the sequential ranged integers from 0 to max of encoding space
- implementation supported by a new data structure returned in postprocess_dict['ML_cmnd']['customML_inference_support']
- struck some redundant ML_cmnd default initializations applied at random forest training (ML_cmnd validations and default initializations all applied in __check_ML_cmnd)
- struck the value error bypass for customML training and inference. Figured that user needs to know if their model is not training.

Page 16 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.