- new 'hash' transform intended for high cardinality categoric sets
- applies what is known "the hashing trick"
- works by segregating entries into a list of words based on space seperator
- stripping any special characters
- and hashing each word with hashlib md5 hashing algorithm
- which is converted to integer and taken remainder from a division by vocab_size
- where vocab_size is passed parameter intended to align with vocabulary size
- note that if vocab_size is not large enough some of words may be returned with encoding overlap
- returns set of columns containing integer word representations
- with suffix appenders '_hash_' where is integer
- note that entries with fewer words than max word count are padded out with 0
- also accepts parameter for excluded_characters, space
- uppercase conversion if desired is performed externally by the UPCS transform
- ('hash' root category doesn't includes UPCS, 'Uhsh' root category does)
- hash transform was inspired by some discussions in "Machine Learning Design Patterns" by Valliappa Lakshmanan, Sara Robsinson, and Michael Munn
- also added inplace support for UPCS transforms