Automunge

Latest version: v8.33

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 80 of 99

3.31

- added new derived column to the returned ID sets, automatic population with a column titled 'Automunge_index_' which contains integers corresponding to the rows in the original order that they were presented in the corresponding train or test sets
- (where is a 12 digit number associated with the specific automunge(.) call, kind of a serial stamp associated with each function call)
- this 'Automunge_index_' column is included in ID sets returned for train, validaiton, and test data from automunge(.) as well as the test data returned from postmunge(.)
- (note that dataframe index columns or any other desired columns may also be designated for segregation and returned in the ID sets, consistently partitioned and shuffled but otherwise unaltered, by passing column header strings to automunge(.) parameters trainID_column and testID_column)
- (note also that any non-range index columns passed in a dataframe are already automatically placed into the ID sets)
- oh and found and fixed this edge case bug for feature selection methods causing under-represented columns to be dropped (basically for very small datasets the feature importance evaluation sometimes may only see a subset of values in a categorical column, causing analysis to not return results for those category entries not seen)
- changed default feature importance evaluation validation set ratio from 0.33 to 0.2 to lower the risk of this edge case

3.30

- added 'traintest' option for automunge(.) shuffletrain parameter to shuffle both train and test sets
- (previously shuffletrain only shuffled training data, occured to me there is a workflow scenario in which the "test" set is applied for training as well, so now have option to shuffle both train and test sets within automunge(.))
- added shuffletrain parameter to postmunge(.) to allow shuffling returned test set
- (for alternate workflow scenario where the postmunge "test" data is used in a training operation)

3.29

- a code quality audit revealed that the "process_dict" data structure for "labelctgy" entry was poorly documented and used kind of awkwardly in two ways - one as a category identifier and one as a suffix identifier, which obviously doesn't scale well
- so revised the use of process_dict entry for labelctgy to only use as a category identifier, via updates to feature selection methods and label frequency levelizer for oversampling methods
- updated READ ME for improved documentation of this item
- also decided to update another case where we were relying on column suffix appenders for transformations, which was in the label frequency levelizer, updated methods for better universality, in the process now have support accross all categories offered in libary for numerical set one-hot encoded bins
- in the process realized that many of the transformation categories processing functions had this empty relic from earlier methods which inspected column strings, scrubbed to avoid confusion
- and well went this far and just occured to me there's really no need to base any methods at all on the column header strings, as turned out had a few more straggler instances from earlier methods. So did a full audit for cases of basing methods on column header string suffix appenders and replaced all instances with methods making use of postprocess_dict data store.
- collectively these improvements are expected to make the tool much more robust to edge case bug scenarios such as from passed column headers with coincident strings.
- found and fixed edge case bug for bxcx transform (design philosophy is that every transform supports any kind of passed data)
- found and fixed two edge case bugs for bkt4 (ordinal encoding of custom buckets)
- found and fixed edge case bug for drift property collection when nunique <3
- updated feature selection methods to default to shuffling training data for the evaluation model training
- a few small cleanups for formatting to printouts
- improved documentation for powertransform options in default transforms section of read me

3.28

- revised date-time processing functions mdsn and mdcs for tailored monthly periodicity based on number of days in the month instead of average days in a month (including account for leap year)
- updated default label column processing category for numeric sets to exc2 (to match documentation)
- updated default of shuffletrain parameter from False to True, meaning automunge(.) returned train sets are now by default returned shuffled, and validation sets are randomly selected (taking this step was inspired by review of a few papers that had noted shuffling tabular training data was generally beneficial to training, including "fastai: A Layered API for Deep Learning" by Howard and Gugger and I believe I also saw this point in "Entity Embeddings of Categorical Variables" by Guo and Berkhahn). Note that returned "test" sets are not shuffled with this parameter.
- Family of sequential data transformations were all extended from three tier options to six tiers (e.g. dxdt/d2dt/d3dt now have comparable extensions to d4dt/d5dt/d6dt, and comparable extensions for series dxd2/nmdx/mmdx/dddt/dedt)
- Corrected auntsuncles primitive entry for mmdx transform_dict from mmmx to nbr2 to be consistent with other downstream normalizations
- added a few clarifications to READ ME documentation
- slight tweek to printouts for processing label columns

3.27

- revisited family of transforms for sequential data (eg time series or cumulative data streams)
- created new normalization method "retain" as 'retn', somewhat similar to a min/max scaling but retains the +/- values of original set, such that if max>0 and min<0, x = x/(max-min), else just defaults to traditional min/max scaling x=(x-min)/(max-min)
- incorproated 'retn' normalizaiton into all outputs of previously configured sequential trasnforms 'dxdt', 'd2dt', 'd3dt', 'dxd2', 'd2d2', 'd3d2', 'nmdx', 'nmd2', 'nmd3'
- (as a reminder dxdt/d2dt/d3dt are approximations of velocity/acceleration/jerk based on deltas for a user designated time step, dxd2/d2d2/d3d2 are comparable but for smoothed values based on an averaged range of user designated time step, nmdx/nmd2/nmd3 are comparable to dxdt but performed downstream of a string parsing for numeric entries)
- repurposed the transforms mmdx/mmd2/mmd3 to comparable to dxdt set but with use of z-score normalziation instead of retn (comparable to video demonstration)
- also created new varient of dxdt trasnforms as dddt/ddd2/ddd3 and dedt/ded2/ded3 where the dddt set are comparable to dxdt but with no normalizations performed, and the dedt set are comparable to dxd2 set with no normalziations performed
- also new parameters allowed for mnm3 transform (which is a min-max scaling with cap and floor based on quantiles), user can now pass qmin and qmax through assignparam to designate the quantile values for a given column
- also a few new driftreport metrics for the lngt transform rolled out in 3.26

3.26

- new transform categories: lngt, lnlg
- lngt calculates string length of categorical entries, followed by a min/max scaling
- lnlg calculates string length of categorical entries, followed by a log transform
- think of this as a rough heuristic for information content of a categorical string entry

Page 80 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 80 of 99

3.31

3.30

3.29

3.28

3.27

3.26

Page 80 of 99

Links

Releases