Automunge

Latest version: v8.33

Safety actively analyzes 681844 Python packages for vulnerabilities to keep your Python projects secure.

Page 49 of 99

5.26

- extension of hash transforms rolled out in 5.24 and 5.25
- new root categories hs10, Uh10
- hs10 differs from hsh2 in that returned encodings are translated from integers to a binary encoding
- with a column count as determined by the vocab_size parameter defaulting to 128 for 7 returned columns
- in other words encodings are returned in a set of boolean integer columns with activations encoded as zero, one, or more simultaneous activations
- so hs10 differs from 1010 in that no conversion dictionary is recorded
- which is a tradeoff in that inversion is not supported for hs10
- also as with hash there is a possibility of redundant activation sets for different entries
- as the range of supported activations is a function of vocab_size parameter
- Uh10 performs an upstream UPCS uppercase conversion for consistent encoding between different case configurations

5.25

- extension of hash transform rolled out in 5.24
- new root categories hsh2, Uhs2
- hsh2 differs from hash in that space seperator for seperately encoding distinct words found in entries is discarded
- in other words encodings are returned in a single column with singe encoding for each entry
- hsh2 also differs in that special characters are not scrubbed
- so hsh2 closer resembles traditional categoric encodings like text, 1010, etc
- Uhs2 performs an upstream UPCS uppercase conversion for consistent encoding between different case configurations
- also small bug fix for space parameter in hash transform

5.24

- new 'hash' transform intended for high cardinality categoric sets
- applies what is known "the hashing trick"
- works by segregating entries into a list of words based on space seperator
- stripping any special characters
- and hashing each word with hashlib md5 hashing algorithm
- which is converted to integer and taken remainder from a division by vocab_size
- where vocab_size is passed parameter intended to align with vocabulary size
- note that if vocab_size is not large enough some of words may be returned with encoding overlap
- returns set of columns containing integer word representations
- with suffix appenders '_hash_' where is integer
- note that entries with fewer words than max word count are padded out with 0
- also accepts parameter for excluded_characters, space
- uppercase conversion if desired is performed externally by the UPCS transform
- ('hash' root category doesn't includes UPCS, 'Uhsh' root category does)
- hash transform was inspired by some discussions in "Machine Learning Design Patterns" by Valliappa Lakshmanan, Sara Robsinson, and Michael Munn
- also added inplace support for UPCS transforms

5.23

- updated ML infill hyperparameter tuning metric for classificaion from accuracy to weighted F1 score
- which we understand does a better job of balancing bias variance tradeoff and takes into account class imbalance
- updated the default parameter values of 'flip_prob' ratio for DPnb, DPmm, DPrt from 1.0 to 0.03
- which means for noise injection the noise is injected to only 3% of entries instead of full set
- which based on our experiments we believe makes for a better default for data augmentation
- updated the default numbercategoryheuristic automunge(.) parameter from 63 to 127
- which is the threshold for number of unique values where default categoric encoding under automation changes from '1010' binary to ordinal
- also updated the default ordinal encoding under automation from 'ord3' to 'ord5', which applies the ordl transform and is excluded from ML infill
- this update was based on some experiments with very high cardinality sets and finding that ML infill models were impacted

5.22

- a small cleanup to some errant printouts associated with code demonstrations for assignparam parameter
- now printouts muted for read me demonstration plug value '(category)'

5.21

- some new options incorporated into assignnan
- which is parameter to designate received entries that will be targets for infill by nan conversion
- assignnan now supports stochastic and range-based nan injections
- such as to inject infill points into specific segments of a set's distribution
- further documented in read me
- this isn't expected to come up often in mainstream use
- primarily intended to support some experiments on missing data infill

Page 49 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 49 of 99

5.26

5.25

5.24

5.23

5.22

5.21

Page 49 of 99

Links

Releases