
Latest version: v8.33

Safety actively analyzes 681844 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 16 of 99


- found an inconsistency in workflow order of operations between automunge and postmunge
- associated with label set extraction and index extraction
- basically convention is that when df_train recieves a non-range integer index, it is extracted and returned in ID sets
- the issue was that in automunge we extracted non ranged integer index prior to extracting label set
- but in postmunge we had in reverse order
- resulting in potential scenario where the postmunge labels had an index not matching the features
- which is an error channel
- resolved by simply swapping the order between postmunge workflow blocks "postmunge ID set populated" and "postmunge labels and other variable initializations"
- (now the ID set gets populated first)


- found a (remote) edge case for drift report, now mitigated
- associated with the new categorylist[0] convention for storing normalization_dict
- associated with cases where orig categorylist[0] != new categorylist[0]
- also a few other remote edge cases for drift report identified from runing comprehensive valications
- including edge cases associated with None / nan conversion and edge case case for null transform
- in the process performed some variable naming cleanup for drift report assembly
- had aggregated lists of returnedcolumns and newreturnedcolumns, but was refering to an entry in newreturnedcolumns as returncolumn, which was kind of confusing
- now referring to entry in newreturnedcolumns as newreturnedcolumn
- also revised the source column drift stat assembly heuristic for categoric features, increasing the heuristic threshold from 500 to 5000


- added a source column drift stat for categoric root categories as unique_ratio
- (recorded in cases of root category with justNaN NArowtype)
- unique_ratio records a dictionary mapping unique entries to a float representing percent of entries in the feature with that value
- this is sort of similar to the derived column drift stat activation_ratios previously recorded for a few categoric and binning transforms


- a slight reshuffling of postmunge drift reporting to improve flow of printouts
- (source column drift stats are now collected prior to derived column drift stats)
- in the process identified a driftreport scenario that wasn't collecting source column drift stats as intended
- (associated with driftreport = True)
- a cleanup to returned validation results in cases of privacy encoding to remove leakage channel
- technically this wasn't really a leakage channel since was part of encrypted entries
- still if user wanted to do their own privacy management this probably would have been a hard one to spot
- so went ahead and stripped suffix validation results for this purpose
- note that the aggregated suffix overlap results are still available as suffixoverlap_aggregated_result


- new trainnoise parameter accepted for DP family of noise injection transforms
- trainnoise accepts boolean defaulting to True, indicating that noise injection will be applied to training data
- note that for postmunge application, when a DP transform inspects the traindata parameter, it now also takes into account the trainnoise parameter for determination of whether to inject
- i.e. traindata postmunge parameter selects whether a data set is to be treated as test or train data
- and the trainnoise assignparam option selects whether for a given transform data considered training data will receive injection
- in the process fixed a snafu for test set noise injection in the DPod automunge transform associated with weighted sampling


- identified a channel for customML error associated with some quirks of the xgboost library
- specifically, xgboost requires labels as fully represented integers within the set of sequential range from 0 to max of the encoding space
- our customML implementation previously had potential for gaps in the encoding space
- resolved by introducing a new convention for conversion of gaps for customML labels to meet this criteria
- in other words, customML now provides labels as str(int) type with non-negative entries as a fully representated set for the sequential ranged integers from 0 to max of encoding space
- implementation supported by a new data structure returned in postprocess_dict['ML_cmnd']['customML_inference_support']
- struck some redundant ML_cmnd default initializations applied at random forest training (ML_cmnd validations and default initializations all applied in __check_ML_cmnd)
- struck the value error bypass for customML training and inference. Figured that user needs to know if their model is not training.

Page 16 of 99



Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.