Automunge

Latest version: v8.33

Safety actively analyzes 681844 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 49 of 99

5.26

- extension of hash transforms rolled out in 5.24 and 5.25
- new root categories hs10, Uh10
- hs10 differs from hsh2 in that returned encodings are translated from integers to a binary encoding
- with a column count as determined by the vocab_size parameter defaulting to 128 for 7 returned columns
- in other words encodings are returned in a set of boolean integer columns with activations encoded as zero, one, or more simultaneous activations
- so hs10 differs from 1010 in that no conversion dictionary is recorded
- which is a tradeoff in that inversion is not supported for hs10
- also as with hash there is a possibility of redundant activation sets for different entries
- as the range of supported activations is a function of vocab_size parameter
- Uh10 performs an upstream UPCS uppercase conversion for consistent encoding between different case configurations

5.25

- extension of hash transform rolled out in 5.24
- new root categories hsh2, Uhs2
- hsh2 differs from hash in that space seperator for seperately encoding distinct words found in entries is discarded
- in other words encodings are returned in a single column with singe encoding for each entry
- hsh2 also differs in that special characters are not scrubbed
- so hsh2 closer resembles traditional categoric encodings like text, 1010, etc
- Uhs2 performs an upstream UPCS uppercase conversion for consistent encoding between different case configurations
- also small bug fix for space parameter in hash transform

5.24

- new 'hash' transform intended for high cardinality categoric sets
- applies what is known "the hashing trick"
- works by segregating entries into a list of words based on space seperator
- stripping any special characters
- and hashing each word with hashlib md5 hashing algorithm
- which is converted to integer and taken remainder from a division by vocab_size
- where vocab_size is passed parameter intended to align with vocabulary size
- note that if vocab_size is not large enough some of words may be returned with encoding overlap
- returns set of columns containing integer word representations
- with suffix appenders '_hash_' where is integer
- note that entries with fewer words than max word count are padded out with 0
- also accepts parameter for excluded_characters, space
- uppercase conversion if desired is performed externally by the UPCS transform
- ('hash' root category doesn't includes UPCS, 'Uhsh' root category does)
- hash transform was inspired by some discussions in "Machine Learning Design Patterns" by Valliappa Lakshmanan, Sara Robsinson, and Michael Munn
- also added inplace support for UPCS transforms

5.23

- updated ML infill hyperparameter tuning metric for classificaion from accuracy to weighted F1 score
- which we understand does a better job of balancing bias variance tradeoff and takes into account class imbalance
- updated the default parameter values of 'flip_prob' ratio for DPnb, DPmm, DPrt from 1.0 to 0.03
- which means for noise injection the noise is injected to only 3% of entries instead of full set
- which based on our experiments we believe makes for a better default for data augmentation
- updated the default numbercategoryheuristic automunge(.) parameter from 63 to 127
- which is the threshold for number of unique values where default categoric encoding under automation changes from '1010' binary to ordinal
- also updated the default ordinal encoding under automation from 'ord3' to 'ord5', which applies the ordl transform and is excluded from ML infill
- this update was based on some experiments with very high cardinality sets and finding that ML infill models were impacted

5.22

- a small cleanup to some errant printouts associated with code demonstrations for assignparam parameter
- now printouts muted for read me demonstration plug value '(category)'

5.21

- some new options incorporated into assignnan
- which is parameter to designate received entries that will be targets for infill by nan conversion
- assignnan now supports stochastic and range-based nan injections
- such as to inject infill points into specific segments of a set's distribution
- further documented in read me
- this isn't expected to come up often in mainstream use
- primarily intended to support some experiments on missing data infill

Page 49 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.