- corrected the conversion from np.inf to np.nan from 3.80
- so to be clear, by default Automunge does not recognize np.inf values, they are treated as np.nan for purposes of infill
- added new 'retain' option for Binary parameter
- which can now be passed as one of {True, False, 'retain'}
- as prior, False does no conversion, True collectively applies a Binary transform to all boolean encoded sets as a replacement (such as for improved memory bandwidth)
- in the new 'retain' option, the returned collective Binary encoding acts as a supplement instead of a replacement to the columns serving as basis (such as a means of presenting boolean sets collectively in alternate configuration)
- I suspect this may prove a very useful option
- found and fixed edge case for spl9 and sp10 transforms preivously missed in testing
- associated with string conversion of numerical entries to test data
- performed a walkthrough of postmunge(.) labelscolumn parameter
- found a code snippet that had been carried over from automunge(.) that was inconsistent with documentation, now struck
- moved the postmunge(.) initialization of empty label sets a little earlier for clarity
- added a marker to returned dictionary noting cases when df_train is a single column designated as labels, just in case that might come in handy
- new transformation category ucct
- in same neighborhood as ord3 which is an ordinal integer categorical encoding sorted by frequency
- ucct counts in train set the unique class count for each categopry class and returns that count divided by total row count in place of the category
- e.g. for a train set with 10 rows, if we have two cases of category "circle", those instances would be returned with the value 0.2
- and then test set conversion would be same value independant of test set row count
- ucct inspired by review of the ICLR paper "Weakly Supervised Clustering by Exploiting Unique Class Count" by Mustafa Umit Oner, Hwee Kuan Lee, Wing-Kin Sung
- additional new transform category Ucct, performs an uppercase character conversion prior to encoding (e.g. the strings usa, Usa, and USA treated consistently)
- followed by a downstream pair of offspring ucct and ord3