Automunge

Latest version: v8.33

Safety actively analyzes 723158 Python packages for vulnerabilities to keep your Python projects secure.

Page 23 of 99

6.83

- we had noted with 6.76 an intent to audit use of reserved string in string parsing functions
- turns out the bulk of the usage was in binarized string parsing functions sp19 and sbs3
- and then also used in inversion functions
- and then in a few otther places as an arbitrary string
- the sp19 and sbs3 usages are now replaced with the np.nan convention to be consistent with rest of library
- the inversion usages are also replaced with np.nan
- noting that we had a mistaken code comment on the matter, we were attributing a data type drift issue to pandas that we now believe had originated from numpy
- now resolved with the autowhere function rolled out in 6.74
- otherwise struck the reserved string throughout just out of spite for all of the trauma it put me through
- although left it in one place as a memorial for all of the times that we shared together
- reserved string now fully lifted, automunge accepts all the data, all the time, anywhere, anyhow

6.82

- small rewrite of GPS1 transforms, superseding the version rolled out yesterday, impacting backward compatibility with 6.81
- GPS1 now accepts additional GPS_convention parameter scenario of 'nonunique'
- 'nonunique' encodes comparably to 'default', but instead of parsing each row individually, only parses unique values, as may benefit latency when the data set contains a lot of redundant entries
- note that we expect most GPS applications will have primarily all unique measurements making them suitable for the default applied with GPS1, so nonunique really just here to support a particular alternate use case
- new root categories GPS3 and GPS4, comparable to GPS1 (i.e. with downstrema normalization via mlti), but applies GPS_convention = 'nonunique' as the default.
- GPS3 differs from GPS4 in that GPS3 parses unique entries both in the train and test data, while GPS4 only parses entries in the train data, relying on the assumption that the test data entries will be the same or a subset of train data entries (as may benefit latency for this scenario)
- we recommend defaulting to GPS1 unless latency is an important factor, and otherwise experimenting based on the penetration of unique entries in your data to compare between GPS1/3/4
- new ML_cmnd specification now supported as ML_cmnd['full_exclude']
- full_exclude accepts a list of columns in input or returned column header convention, which are to be excluded from all model training, including for ML infill, feature importance, and PCA
- full_exclude may be useful when transforms may return non-numeric data or data without infill and you still want to apply ML infill on the other features
- new postprocess_dict entry 'PCA_transformed_columns' logging columns serving as input to PCA
- a tweak to PCA printouts, now displaying PCA_transformed_columns instead of column exclusions

6.81

- new transformation category available as GPS1
- GPS1 is for converting sets of GPS coordinates to normalized lattitude and longitude
- accepts parameter GPS_convention, which currently only supports the base configuration of 'default'
- which in future extensions may allow selection between alternate GPS reporting conventions
- 'default' is based on structure of the "$GPGGA message" which was output from an RTK GPS receiver
- which follows NMEA conventions, and has lattitude in between commas 2-3, and longitude between 4-5
- reference description available at https://www.gpsworld.com/what-exactly-is-gps-nmea-data/
- allows for variations in precisions of reported coordinates (i.e. number of significant figures)
- or variations in degree magnitude, such as between DDMM. or DDDMM.
- relies on comma seperated inputs
- accepts parameter comma_addresses as a list of four integers to designate locations for lattitude/direction/longitude/direction
- which consistent with the demonstration defaults to [2,3,4,5]
- i.e. lattitude is after comma 2, direction after comma 3, longitude after 4, direction after 5
- assumes the lattitude will precede the longitude in reporting, which appears to be a general convention
- also accepts parameter comma_count, defaulting to 14, which is used for inversion to pad out to format convention
- returns lattitude and longitude coordinates as +/- floats in units of arc minutes
- in the base root category definition GPS1, this transform is followed by a mlti transform for independent normalization of the lattitude and longitude sets
- in the alternate root category GPS2, the two columns are returned in units of arc minutes
- GPS1 returns two columns with suffix as column_GPS1_latt_mlti_nmbr and column_GPS1_long_mlti_nmbr
- also, moved naninfill application to following ML infill iterations to avoid interference
- new parameters accepts for power of ten binning transforms such as pwrs/pwr2/pwor/por2 as cap and floor
- cap and floor default to False, when passed as integer or float they cap or set floor on values in set
- for example if feature distribution is mostly is in range 0-100, you may not want a dinstinct bin encoding for outlier values over 1000
- found a flaw in our backward compatibility validation test, working now as intended

6.80

- fixed a variable assignment in evalcategory that was interfering with the hash scenario for all-unique under automation
- which was the scenario associated with providing multicolumn hashing to unstructured text under automation
- update to the mlti transform rolled out in 6.79
- new parameter supported as 'dtype', accepts one of {'float', 'conditionalinteger'}, defaults to float
- found and fixed an edge case associated with memory sharing in data structure causing an overwrite scenario
- also now any defaultparam entries stored in process_dict entry for norm_category are taken into account
- which is more consistent with intent that mlti can apply any transformation category
- the only caveat is that since mlti is defined as concurrent_nmbr MLinfilltype, norm_category should be based on a process_dict entry with numeric MLinfilltype
- new MLinfilltype now supported as 'concurrent_ordl'
- concurrent_ordl is for transforms that return multiple ordinal encoded columns (nonnegative integer classification)
- note that for ML infill, each ordinal column is predicted seperately
- new process_dict entry available as mlto, which is built on top of the mlti transform but has concurrent_ordl MLinfilltype so allows norm_category specification with singlct MLinfilltype categories
- mlto defaults to a norm_category of ord3 and conditionalinteger dtype

6.79

- found and fixed implementation snafu with tlbn transform associated with the top bucket in edge case when data does not have enough diversity to populate full range of specified bincount buckets
- updated some data structure maintenance taking place in processparent that was interfering with ML infill in conjunction with transforms performed downstream of a transform returning a multi-column set
- new transformation category available for population in family trees as mlti
- mlti is intended for use to apply normalizations downstream of a concurrent_nmbr MLinfilltype which may have returned a multi-column set of independant continuous numeric sets
- and thus mlti normalizes each of the received columns on an independant basis
- mlti defaults to applying z-score normalizaiton by the nmbr trasnform, but alternative normalizations may be specified by passing an alternate trasnformation category to parameter norm_category, such as e.g. assignparam = {'mlti' : {'(targetcolumn)' : {'norm_category' : 'mnmx'}}}
- where specified transforms are accessed by inspecting that category's process_dict entries
- where targetcolumn needs to be specified as either the input column received in df_train or the first column in the upstream categorylist with suffix appenders
- note that parameters may be passed to the normalization trasnform by passing to mlti through parameter norm_params, e.g. assignparam = {'mlti' : {'(targetcolumn)' : {'norm_params' : {(parameter) : (value)}}}}
- inplace, inversion, and ML infill are all supported
- note that if an alternate treatment is desired where to apply a family tree of transforms to each column user should instead structure upstream trasnform as a set of numeric mlinfilltype transforms

6.78

- added null_activation parameter support to 1010 trasnform
- null_activation accepts boolean, defaulting to True for a dinstinct missing data encoding of all zeros
- when passed as False missing data grouped with the otherwise all zero encoding, which will be the first unique entry in an alphabetic sorting
- found and fixed a bug originating from 6.76 associated with replacing the support function _postprocess_textsupport
- it turns out the prior convention was not directly equivalent to the new support function, resulting in one-hot aggregations with no activations
- it appears I did not test this aspect sufficiently prior to rollout
- easy fix, added a new scenario to the alternative support function _onehot_support
- this impacted transformation categories bins, bnwd, bnep, tlbn, bkt1, bkt2, and also impacted ML infill to binarized categoric encodings via 1010
- issue resolved

Page 23 of 99

Releases

Has known vulnerabilities

Previous Next

Automunge

Page 23 of 99

6.83

6.82

6.81

6.80

6.79

6.78

Page 23 of 99

Links

Releases