Klib

Latest version: v1.3.2

Safety actively analyzes 688940 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 4 of 5

0.0.91

Changelog:

Additions
- **clean_column_names():**
Cleans the column names of the provided Pandas Dataframe and optionally provides hints on duplicate and long column names. This functionality is also added to data_cleaning() by default.


Changes
- **small fixes and refinements**
Revert from split = {None, 'pos', 'neg', 'above', 'below'} to split = {None, 'pos', 'neg', 'high', 'low'} for all correlation functions.

- **increase test coverage**

- **update docstrings:**
Several updates to docstrings to improve clarity and conform with numpy style.

- **black formatting:**
Format the entire codebase with black.

0.0.86

Changelog:

Changes
- **data_cleaning():**
- Changed the default setting to do a shallow instead of a deep analysis of memory_usage.
- **Lowers function runtime compared to the previous version by about 70% - 80%**!

- **missingval_plot():**
- Minor changes to font size and spacing to accommodate very large datasets (40+ cols)

- **update docstrings:**
- Several updates to the readme, to the examples as well as to docstrings to improve clarity and formatting.

0.0.85

Changelog:

Additions
- **PipeInfo():**
Prints intermediary information about the dataset from within a pipeline. Can be included at any point in a Pipeline to print out the shape of the dataset at the specified point and to receive an indication of the progress.

- **optimize_ints():**
Adds an additional function call to convert_datatypes() to improve datatype conversion. This reduces int64, if possible, to a more parsimonious integer dtype such as int8 or int32.

- **optimize_floats():**
Adds an additional function call to convert_datatypes() to improve datatype conversion. This reduces float64, if possible, to a more parsimonious float dtype such as float32.


Changes
- **pool_duplicate_subsets():**
- Adds the possibility to explicitly specify columns, e.g. the label/target column, to be excluded from the subset pooling, what preserves the specified columns.
- Increases computational efficiency to speeds up calculation compared to the previous version by about 20% - 25%.

- **update docstrings:**
Several updates to docstrings to improve clarity and formatting.

0.0.84

Changelog:

Changes
- **mv_col_handling():**
Update to the default settings and default output and add flexibility to show affected columns.

- **pool_duplicate_subsets():**
Update to the default settings and default output and add flexibility to show affected columns.


Bug Fixes
- **dist_plot():**
Fix label identification when target is given as a List, or np.array instead of a pd.Series.

- **mv_col_handling():**
Fix correlation computation when target is given as a List, np.array or pd.Series instead of a String.

0.0.83

Changelog:

Additions
- **pool_duplicate_subsets():**
Checks for duplicates in subsets of columns and pools them. This reduced the number of columns in the data without loosing any information. Suitable columns are combined to subsets and tested for duplicates. In case sufficient duplicates can be found, the respective columns are aggregated into a 'pooled_var' column. Identical numbers in the 'pooled_var' column indicate identical information in the respective rows.
- **cat_pipe():**
Includes MaxAbsScaler() in the pipeline for categorical data. This scaler does not center the data, thus preserving it's sparsity. It is recommended to check for and deal with outliers before applying MaxAbsScaler().

- **Add further examples:**
- cat_plot()
- corr_plot(), with a target variable

Changes
- **cat_plot():**
Speeds up calculation compared to the previous version by about 15%.

Bug Fixes
- **cat_plot():**
Removes an erroneous line of code which could result in the plot to fail in certain situations.
- **dist_plot():**
Fixes an erroneous input validation which limited the number of maximum bins in the histogram to the number of features.

0.0.80

**klib.describe** - functions for visualizing datasets
- klib.cat_plot() - returns a visualization of the number and frequency of categorical features.
- klib.corr_mat() - returns a color-encoded correlation matrix
- klib.corr_plot() - returns a color-encoded heatmap, ideal for correlations
- klib.dist_plot() - returns a distribution plot for every numeric feature
- klib.missingval_plot() - returns a figure containing information about missing values

**klib.clean** - functions for cleaning datasets
- klib.data_cleaning() - performs datacleaning (drop duplicates & empty rows/columns, adjust dtypes,...) on a dataset
- klib.convert_datatypes() - converts existing to more efficient dtypes, also called inside ".data_cleaning()"
- klib.drop_missing() - drops missing values, also called in ".data_cleaning()"
- klib.mv_col_handling() - drops features with a high ratio of missing values based on their informational content

**klib.preprocess** - functions for data preprocessing (feature selection, scaling, ...)
- klib.train_dev_test_split() - splits a dataset and a label into train, optionally dev and test sets
- klib.feature_selection_pipe() - provides common operations for feature selection
- klib.num_pipe() - provides common operations for preprocessing of numerical data
- klib.cat_pipe() - provides common operations for preprocessing of categorical data

Page 4 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.