Automunge

Latest version: v8.33

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 30 of 99

6.40

- important update, deprecated postmunge parameter labelscolumn
- which originally was used to specify when a labels column was included in df_test
- eventually evolved the convention that presence of labels were automatically detected independant of labelscolumn
- so this parameter is no longer needed
- also deprecated boolean True option for parameter testID_column for both automunge(.) and postmunge(.)
- originally testID_column=True was used to indicate that test set has same ID columns as train set
- later introduced the convention that presence of trainID_column entries in test set is automatic when testID_column=False
- so testID_column=True scenario no longer needed
- note user can still pass testID_column as list if ID columns for test set are different than for train set
- as is needed if test ID columns are a subset of train ID columns or are otherwise different
- moved the support function _list_replace in code base to location with other automunge support functions
- (in general, the code base is organized as:
- initialize transform_dict and process_dict
- processfamily functions
- dualprocess / singleprocess functions
- column evaluations via evalcategory
- ML infill support functions
- misc automunge support functions
- automunge(.)
- postprocessfamily functions
- postprocess functions
- postmunge ML infill support functions
- misc postmunge support functions
- postmunge(.)
- inverseprocess functions
- inversion support functions
- there is a method to the madness :)
- also, we had a convention of duplicating column_dict entries for excl columns so they could be accessed easily both with and without suffix appenders
- now that we've simplified the suffix conversion scheme with _list_replace decided to strike
- as the redundant entries kind of muddies the waters
- as the zen of python says, there should be one and preferably only one way to do it
- so went ahead and scrubbed these redundant column_dict entries for excl without suffix
- also found one more excl suffix relic in postmunge missed in last rollout, converted to the new _list_replace convention
- struck two unused variables in postmunge labelsprocess_dict and labelstransform_dict
- also consolidated to single location accounting for parameters passed as lists to be copied into internal state
- in process found a few more parameters where was needed
- (now in addition to trainID_column and testID_column also perform for Binary and PCAexcl and inversion in postmunge)
- new validation check twofeatures_for_featureselect_valresult performed for featureselection to confirm a minimum of two features included in train set (a remote edge case)
- new automunge(.) validation check trainID_column_subset_of_df_train_valresult performed to identify cases where user specified trainID_column entries not present in df_train
- similar new automunge(.) validation check testID_column_subset_of_df_test_valresult performed to identify cases where user specified testID_column entries not present in df_test
- similar new postmunge(.) validation check pm_testID_column_subset_of_df_test_valresult to identify cases where user specified testID_column entries not present in df_test
- new postprocess_dict entry PCAn_components_orig which is the PCAn_components as passed to automunge(.) (which may be different than the recorded PCAn_components entry if was reset based on heuristic)
- finally, a few more cleanups to PCA
- eliminated a redundant call of support function _evalPCA
- in the process a few tweaks for clarity to support funcitons _PCAfunction and _postPCAfunction

6.39

- reconsidered treatment of one of primary edge cases in the library
- which is for the excl passthrough category (direct passthrough with no transforms or infill)
- we have convention that for excl transform a suffix is appended to populate data structures
- and then the suffix is scrubbed from the returned columns
- unless user elects to maintain suffix with the excl_suffix parameter
- (as suffix retention makes navigating data structures much easier for excl columns)
- so we had a few places in library, particularily in inversion, where we were manipulating column header strings to accomodate this convention
- which was somewhat inelegant
- so populated two new data structures returned in postprocess_dict as excl_suffix_conversion_dict and excl_suffix_inversion_dict
- where excl_suffix_conversion_dict maps from columns without suffix to columns with suffix
- and excl_suffix_inversion_dict maps from columns with suffix to columns without
- these are now the primary means of conversions to accomodate excl suffix stuff
- coupled with a short support function _list_replace which replaces items in list based on a conversion dictionary
- used the opportunity to cleanup the initial scubbing of suffix appenders and improved code comments
- extensive cleanups to inversion stuff to simplify excl edge case accomodation
- which in the process fixed a newly identified edge case for inversion of excl columns
- rewrote _inverseprocess_excl making use of excl_suffix_inversion_dict
- and finally a cleanup (streamlined with equivalent functionality) to _df_inversion_meta

6.38

- ok an additional audit found that 6.37 introduced a small bug in automunge label processing
- associated with inconsistent variable name in conjunction with category assignment to one of ['eval', 'ptfm']
- aligned label processing to use a single variable name
- also removed an unused variable for train data processing
- added code comment about process_dict inspection for label processing potential for future extension
- (somewhat similar to comment that was struck in 6.37)

6.37

- extensive walkthrough of automunge(.) and postmunge(.) functions
- several code comment cleanups and clarifications for both
- a few reconfigurations to order of operations for clarity
- struck some unused variables
- several improvements along these lines
- new validation test reported as familytree_for_offspring_result
- performed in _check_haltingproblem, _check_offspring
- familytree_for_offspring_result triggered when a transformation category is entered as an entry to family tree primitive with offspring but doesn't have it's own family tree defined for accessing downstream offspring
- gave some added aqttention to PCA implmentaiton which was kind of inelegant
- a few cleanups and clarificaitons to PCA workflow in automunge
- fixed PCA heuristic (when PCAn_components passed as None) so that it can be applied with other PCA types other than default
- where PCA type is based on ML_cmnd['PCA_type']
- rewrote support function _evalPCA
- now with improved function description and much clearer implementation
- found and fixed bug for returning numpy sets when pandasoutput set as False associated with an uninitialized variable
- removed placeholders for deprecated postmunge parameters LabelSmoothing and LSfit

6.36

- rewrote bxcx transformation functions to closer align with conventions of rest of library and for clarity of code
- new transform available as qttf
- which is built on top of Scikit-Learn sklearn.preprocessing.QuantileTransformer
- qttf supports inversion, inplace not currently supported
- qttf defaults to returning a normal output distribution
- which differs from scikit default of uniform
- the thought is in general we expect surrounding variables will closer adhere (by degree) to normal vs. uniform
- and thus this default will better approximate i.i.d. in aggregate
- alternate category available as qtt2 which defaults to uniform output distribution
- more info on quantile transform available in scikit documentation
- currently qttf is the only transform in library built on top of a scikit-learn preprocessing implementation
- and bxcx is only transform in library built on top of a scipy stats transformation
- (we also use scikit-learn for predictive models / tuning / PCA and scipy stats for distributions / measurement)
- the incorporation of a quantile transform option to library was partly inspired by a comment in paper “Revisiting Deep Learning Models for Tabular Data” by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, Artem Babenko
- also a few cleanups to imports for clarity of presentation
- struck numpy import for format_float_scientific which wasn't used
- struck some redundant scipy imports for statistics measurement
- a few other cleanups and clarifications to imports

6.35

- updated PCAexcl default from False to empty list [] to address an edge case with prior convention
- new default convention: boolean and ordinal columns are excluded from PCA unless otherwise specified in ML_cmnd
- new scenarios added to Binary dimensionality reduction as can be specified by the Binary parameter
- which original implementaiton was to consolidate categoric features into single common binarization
- either with or without replacement of the consolidated features
- which in cases of redundancies/correlations between features may result in reducing column count of returned set
- as well as help to alleviate overfit that may result from highly redundant features
- new scenarios are to instead of consolidate into a common binarization, will instead consolidate into a common ordinal encoding
- which may be useful to feed into a entity embedding layer for instance
- Binary='ordinal' => to replace the consolidated categoric sets with a single column with ordinal encoding (via ord3)
- Binary='ordinalretain' => to supplement the categoric sets with a single consolidated column with ordinal encoding (via ord3)
- so Binary already had convention that could pass as a list of column headers to only consolidate a subset instead of all categoric features, with first entry optionally as a boolean False to trigger retain option
- so added more special data types for first entry when Binary passed as list to accomodate ordinal options
- so now if Binary passed as list, Binary[0] can be True for 'ordinalretain', Binary[0] can be False for 'retain', Binary[0] can be None for 'ordinal', otherwise when Binary[0] is a string column header Binary is just treated as the default which is 1010 encoding replacing consolidated columns
- a few cleanups to Binary and Binary inversion implementation in the process
- finally, a small tweak to default transformation category under automation applied to numeric label sets
- reverting to a prior convention from the library of treating numeric labels with z-score normalization instead of leaving data unscaled
- have seen kind of conflicting input on this matter in literature on whether there is ever any benefit to scaling numeric labels
- did a little more digging and found some valid discussions on stack overflow offering scenarios where scaling labels may be of benefit
- as well as finding domain experts following the practice such as in implementation of experiments in paper “Revisiting Deep Learning Models for Tabular Data” by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, Artem Babenko

Page 30 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 30 of 99

6.40

6.39

6.38

6.37

6.36

6.35

Page 30 of 99

Links

Releases