Automunge

Latest version: v8.33

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 25 of 99

6.71

- new form of input accepted for automunge(.) valpercent parameter
- previously valpercent was accepted as float in range 0-1
- used to specify the ratio of the validation split for partitioning from the training data
- where validation set was either partitioned based on a random sampling of rows when shuffletrain was activated
- or otherwise partitioned from bottom sequential rows of the training set when shuffletrain was deactivated
- new convention is that valpercent can optionally be specified as a tuple in the form valpercent=(start, end)
- where start is a float in the range 0<=start<1
- and end is a float in the range 0<end<=1
- and where start < end
- the tuple option allows user to designate specific portions of the training set for partitioning
- for example, if specified as valpercent=(0.2, 0.4), the returned training data would consist of the first 20% of rows and the last 60% of rows, while the validation set would consist of the remaining rows
- note that if shuffletrain activated (as either True or as 'traintest'), the returned train set and validation set rows will subsequent to partitioning be individually shuffled
- please note that automunge(.) already had support for simultaneous preparations of training and validation data, where validation data was partitioned from the received training data and prepared seperately on the train set basis
- however in prior configuration user only had options for validation partitioning from random sampled rows or bottom sequential
- with the new configuration user can now specify specific partitions of the training data to segregate for validation sets
- the purpose of this new valpercent tuple option is to support integration into a cross validation operation
- also revised the prior function for partitioning validation sets which should result in reduced memory overhead
- also, further validation identified a scenario where the new porting of 1010 to custom_train (from 6.70) had an edge case. It’s a very remote edge case, but an edge case nonetheless. Going to revert 1010 to the prior transform convention until get this resolved. (Edge case only manifested when 1010 was performed downstream of a string operation on a particular testing feature, why it was missed in testing with last rollout, we didn’t think to validate application of 1010 as a downstream transform. Lesson learned to adhere to comprehensive validations with each rollout.)

6.70

- rewrite of 1010 binarization transform, which is the default categoric encoding under automation
- ported to the custom_train convention
- with new functions _custom_train_1010 / _custom_test_1010 / _custom_inversion_1010
- resulting in much cleaner code presentation, easier to understand
- while retaining consistent functionality and accepted parameters
- the primary updated convention is that now the missing data representation prior to ML infill defaults to all 0's
- as opposed to prior where missing data may have had different activation sets per application
- in the process trimmed a little fat
- such as removed the 'zzzinfill' reserved string requirement
- as well as eliminated what we refered to as an overlap replace operation
- (which was there to accomodate an assumed edge case with pd.replace which I now am unable to duplicate)
- expecting this revision will benefit latency for this transform
- which since is the default under automation we thought was a worthy goal
- in the process I think identified the edge case that was original rationale for the 'zzzinfill' convention
- which turns out can be accomodated in a much simpler fashion by simply resetting dtype to 'object'
- note the prior defined 1010 functions are retained in library for backward compatibility purposes
- which are also used for Binary dimensionality reduction
- also small tweak to the wrapper function for custom_train
- now the recorded categorylist is guaranteed as consistent order of entries as found in returned data
- intend going forward to continuing porting a few more foundational transforms to custom_train convention and lift reserved string requirements where possible

6.69

- ok just realized committed a non-trivial error in both 6.65 and 6.68
- with respect to the new parameter support in text/onht/ordl/1010
- specifically, the new configuration interfered with backward compatibility
- in scenario where a user populated a postprocess_dict in earlier version
- and tried to use with postmunge in new version
- tbh had just forgotten about this error channel
- going to give it some thought how to formalize this scenario in testing to avoid the risk going forward
- additionally, new policy to help mitigate this risk
- new policy: significant transformation function deviations are now by policy conducted by defining new transfomation function to eliminate this backward compatibility scenario

6.68

- added activation parameter support to ordinal and binarization categoric encodings via ordl and 1010
- specifically referring to some of activaiton parameters rolled out for one-hot in 6.65: 'all_activations', 'add_activations', 'less_activations', and 'consolidated_activations'
- all_activations defaults to False, user can pass as a list of all entries that will be targets for activations (which may have fewer or more entries than the set of unique values found in the train set, including entries not found in the train set)
- add_activations defaults to False, user can pass as a list of entries that will be added as targets for activations (resulting in extra returned columns if those entries aren't present in the train set)
- less_activations defaults to False, user can pass as a list of entries that won't be treated as targets for activation (these entries will instead recieve no activation)
- consolidated_activations defaults to False, user can pass a list of entries (or a list of lists of entries) that will have their activations consolidated to a single common activation
- note that 'null_activation' not supported because ordinal and 1010 by definition require a distinct enocding for missing data (in one hot encoding the missing data has option of being represented by no activation)
- note that for ordinal encoding activation parameter support limited to ordl transform (as opposed to other ordinal transform ord3)
- reason is that ord3 bases integer sorting on a different method from frequency of entries and would make this a little more complex, saving that for when I have a spare weekend, I don't consider it high priority since we have ordl supported for an ordinal option with activation parameter support
- note that this effort has helped us to identify possibly another opportunity for improved latency associated with these categoric encodings, is a little complex but this is an intended direction to further refine going forward

6.67

- new validation test performed for categoric encodings with reserved string 'zzzinfill'
- where string is reserved with respect to usage among entries of recieved column for a few categoric transforms
- note that this does not cause error, entries are simply treated consistent with NaN for the transform
- now when string found present a printout is returned and a validation result logged as
zzzinfill_valresult = {i : {'column' : column,
'traintest' : traintest}}
- where i is an integer incremented with each occurance
- traintest is one of {'train', 'test'} indicating where entry was found i.e. df_train vs df_test (singleprocess functions log both as 'train')
- recorded in postprocess_dict['temp_miscparameters_results']['zzzinfill_valresult']
- or for postmunge recorded in postreports_dict['pm_miscparameters_results']['zzzinfill_valresult']
- this validation test perfomed in relevant transforms text/onht/smth/strn/ordl/ord3/ucct/1010
- note that to support added a temporary entry to postprocess_dict in posmtunge to log support function validation results as 'temp_pm_miscparameters_results' which is subsequently consolidated with the returned validation results in postreports_dict and the temporary entry struck
- found and fixed a small defaultinfill process flow snafu in strn
- added inplace support for 1010

6.65

- a big cleanup to both text and onht transforms for one-hot encoding
- in some portions may be considered a full rewrite
- note that in addition to the suffix convention, onht has a few subtle distinctions vs. text
- in text numbers are converted to strings prior to encoding, so 2 == '2' for instance (needed for suffix convention)
- whereas in onht numbers and strings recieve a distinct activation (unless str_convert parameter activated)
- also in text missing data is represented prior to ML infill as no activation, whereas prior onht missing data was given distinct activation
- new convention for onht, missing data is returned without activation to be consistent with text
- new set of parameters accepted for both text and onht as 'null_activation', 'all_activations', 'add_activations', 'less_activations', and 'consolidated_activations'
- null_activation defaults to False, when True missing data is returned with distinct activation as per prior convention for onht
- all_activations defaults to False, user can pass as a list of all entries that will be targets for activations (which may have fewer or more entries than the set of unique values found in the train set, including entries not found in the train set)
- add_activations defaults to False, user can pass as a list of entries that will be added as targets for activations (resulting in extra returned columns if those entries aren't present in the train set)
- less_activations defaults to False, user can pass as a list of entries that won't be treated as targets for activation (these entries will instead recieve no activation)
- consolidated_activations defaults to False, user can pass a list of entries (or a list of lists of entries) that will have their activations consolidated to a single common activation
- the returned activation reported as the first entry in each consolidation list
- also found and fixed edge case with pickle download operation to save a populated postprocess_dict associated with internal processdict manipulations editing exterior object (appeared to manifest when passing the same processdict to multiple automunge(.) calls without reinitializing)

Page 25 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 25 of 99

6.71

6.70

6.69

6.68

6.67

6.65

Page 25 of 99

Links

Releases