Automunge

Latest version: v8.33

Safety actively analyzes 681866 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 21 of 99

6.95

- Binary now also accepts ordinal encoded categoric columns as targets for consolidation
- new validation test performed for labels_column parameter

6.94

- quick fix
- found a bug introduced in 6.88 with the new 'binary' NArowtype
- I don't know why this didn't show up in testing, will put some thought into that
- bug originated from the new NArw aggregations returning integers instead of boolean
- which was inconsistent with other NArowtypes
- resolved

6.93

- postmunge(.) inversion now records and returns validation results in inversion_info_dict

6.92

- updated Binary onehot scenario so that test activation sets not found in train are returned with all zeros to be consistent with other Binary options by way of tweaks to the null_activation='Binary' scenario
- aligned Binary postmunge(.) printouts with automunge(.)
- added Binary support for specifying multiple consolidation subsets by passing Binary as list of lists
- where first entry in each sublist can optionally serve to pass specification for that sublist by embedding specificaiton in set
- thus Binary can now be applied to seperately consolidate multiple non overlapping categoric sets
- corrected paraemter list copying at start of automunge(.) and postmunge(.)
- formalized the convention that we follow version numbers using float equivalent strings
- to support backward compatibility checks
- going forward, when reaching a round integer, the next version will be selected as int + 0.10 instead of 0.01

6.91

- with last rollout had put some thought into accomodating all potential edge cases for column header overlaps
- including a clarification added to the documentation in section for other reserved strings
- realized the Binary and PCA dimensionality reductions deserve some additional treatment due to their unique convention
- referring to how they apply new column headers by means other than suffix appention which is case for rest of library
- so have added convention that in cases where header overlaps are identified for PCA or Binary, the new column header is adjusted by addition of a string of the application number, which is a 12 digit random integer sampled for each automunge(.) call
- e.g. for PCA, 'PCA__' would be replaced by 'PCA__'
- this is similar to what has previously been done for the Automunge_index column returned in the ID sets
- additionally, to ensure we are being comprehensive, in very remote edge case where even the adjusted header has overlap, we add comma characters until resolved
- a validaiton result is logged for both Binary and PCA for cases where header is adjusted as set_Binary_column_valresult and PCA_columns_valresult
- added a note to the Binary writeup associated with how stochastic_impute_categoric may reduce the extent of contraction
- found an oversight in the Binary dimensionality reduction
- most of our rollout validations are conducted to ensure prepared train data matches prepared consistent test data
- in Binary dimensionality reduction, there is also a complication where test data activation sets may not match activation sets found in train data
- which since we have to accomodate inversion needs to be handled a little differently than the 1010 transform it is built on top of
- so accomodated by new null_activation scenario to 1010 transform as null_activation = 'Binary'
- which results in test data activation sets not matching train data activation sets being both returned from Binary as all zero activations as well as being returned from inversion as all zero activations
- returning from inversion as all zeros ensures upstream transforms will be able to handle inversion since 1010 transform encodings start from the all zero case, and in multirt the all zeros are treated as missing data
- similarly, new null_activation scenario to ordl and onht transform as null_activation = 'Binary' to support Binary options
- new convention for specifying list of Binary target by passing column headers as list
- previously we had a few reserved entries such as boolean False/True, None, etc which were reserved for this use in these lists, realized it is much cleaner to use alternate convention that uses same specification entries as the master Binary parameter, so now when specifying a subset of columns for Binary, can pass the first entry in the list as a Binary parameter value embedded in set brackets, for example if you want to consolidate any boolean integer sets derived from columns 'c1', 'c2', 'c3', and you want them aggregated into a ordinal Binary encoding, you can pass the automunge(.) Binary parameter as Binary = [{'ordinal'}, 'c1', 'c2', 'c3']
- new options for Binary dimensionality reduction as Binary = 'onehot' or 'onehotretain'.
- these are comparable to 'ordinal' and 'ordinalretain' except the returned consolidation is onehot encoded instead of ordinal encoded as you would expect
- the reason for this extension was to align with general convention in the library that all of the fundamental categoric transforms are available in corresponding forms differing as ordinal, one hot, or binarized representations.
- and even when a transform is only available in one of these forms can apply a downstream transform to translate
- such as can make use of an intermediate bnst to translate multicolumn forms or can otherwise apply directly

6.90

- lngt family tree revised to omit downstream mnmx scaling
- new root category lngm, comparable to prior configuration of lngt
- lnlg now has downstream logn instead of log0
- (lngt returns string length of categoric entries)
- new root category GPS5, comparable to GPS3 (with GPS_convention of nonunique and assumption of test entries same or subset of train entries), but with downstream ordinal enocding instead of numeric scaling, with lattitude and longitude seperately encoded
- GPS5 may be appropriate when there are a fixed range of GPS coordinates and they are wished to be treated as categoric
- note that alternate categoric enocdings may be applied by passing norm_category partameter to the downstream mlto
- note that if a single categoric encoding is desired representing the combined lattitude and longitude, the string representation can be passed directly to a categoric transform without a GPS1 parsing
- new root category GPS6, comparable to GPS5 but performs both a downstream normalization and a downstream ordinal encoding, allowing lattitude and longitude to be evaluated both as categoric and continuous numeric features. This is probably a better default than GPS3 or GPS5 for sets with a fixed range of entries.
- updated validation tests so that there is a category assignment representing each of the MLinfilltype options and a correpsonding inversion excluding PCA
- fixed the feature selection carveouts associated with totalexclude MLinfilltype that had been incorporated in 6.84 (we had missed a few edge cases resulting from the exclusions)
- new transform bnst, intended for use downstream of multicolumn categoric encodings, such as with 1010 or multirt MLinfilltype
- bnst aggregates multicolumn representations into a single column categoric string representation
- accepts parameter upstreaminteger defaulting to True to indicate that the upstream encodings are integers
- new root categories bnst and bnso, where bnst returns the string representation, bnso performs a downstream ordinal encoding
- inversion supported
- bnst or bnso may be useful in scenario where a multicolumn categoric transform is desired for label encoding targeting a downstream library that doesn't accept multicolumn representations for labels
- update to mlti transform to take into account dtype_convert processdict entry associated with the normcategory
- (so that if normcategory is in custom_train convention and dtype is conditionalinteger dtype conversion only applied when dtype_convert is not False, consistent with basis for custom_train otherwise)
- audited defaultinfill and dtype_convert, identified missing dtype_convert acomodation associated with floatprecision application to label sets, resolved
- updated tutorials to include link to data sets
- found and fixed bug in Binary inversion
- slight tweak to the returned column header conventions associated with Binary and PCA dimensionality reduction. Added an extra underscore to align with convention that received column headers that omit the underscore character are ensured of no suffix overlap edge cases
- now Binary returned with form 'Binary__' and PCA returned with form 'PCA__'
- conducted a walkthough of openning automunge code surrounding initial dataframe preparations, repositioned a few snippets for clarity
- moved a few spots up for automunge inplace parameter inspection so that it is performed prior to column header conversions
- moved assign_param variable initialization to a more reasonale spot
- moved list copying to internal state next where we do same for dictionaries
- moved application of _check_assignnan
- this walkthrough was partly motivated by ensuring inplace parameter inspection performed prior to any header conversions, and turned out to result in a much cleaner layout
- a few similar repositionings at openning of postmunge
- new validation results reported as check_df_train_type_result, check_df_test_type_result
- these results are from a validation that df_train is received as one of np.array, pd.Series, or pd.DataFrame and df_test is received as one of same or False
- similar validation results reported in postmunge for check_df_test_type_result
- the omission of this validation prior was an oversight

Page 21 of 99

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.