Eda-toolkit

Latest version: v0.0.12

Safety actively analyzes 681775 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 4

0.08c

0.0.12

New Features

- Added `data_doctor` function:

A versatile tool designed to facilitate detailed feature analysis, outlier detection, and data transformation within a DataFrame.

**Key Capabilities**:

- **Outlier Detection**:

- Detects and highlights outliers visually using boxplots, histograms, and other visualization options.
- Allows cutoffs to be applied directly, offering a configurable approach for handling extreme values.

- **Data Transformation**:

- Supports a range of scaling transformations, including absolute, log, square root, min-max, robust, and Box-Cox transformations, among others.
- Configurable via `scale_conversion` and `scale_conversion_kws` parameters to customize transformation approaches based on user needs.

- **Visualization Options**:

- Provides flexible visualization choices, including KDE plots, histograms, and box/violin plots.
- Allows users to specify multiple plot types in a single call (e.g., `plot_type=["hist", "kde"]`), facilitating comprehensive visual exploration of feature distributions.

- **Customizable Display**:

- Adds text annotations, such as cutoff values, below plots, and enables users to adjust various styling parameters like `label_fontsize`, `tick_fontsize`, and `figsize`.

- **Output Control**:

- Offers options to save plots directly to PNG or SVG formats, with file names reflecting key transformations and cutoff information for easy identification.

0.0.11

Fix TypeError in `stacked_crosstab_plot` for `save_formats`

Description:
Fixes a `TypeError` in the `stacked_crosstab_plot` function when `save_formats` is `None`. The update ensures that `save_formats` defaults to an empty list, preventing iteration over a `NoneType` object.

Changes:
- Initializes `save_formats` as an empty list if not provided.
- Adds handling for string and tuple input types for `save_formats`.

Issue Fixed:
Resolves `TypeError` when `save_formats` is `None`.

0.0.11a2

Data Doctor Updates

1. `new_col_name` logic for when `scale_conversion==None`, but there are cutoffs to be applied to a new column, allowing such situations to go through so that the new column is created.

2. Fix for `apply_as_new_col_to_df` logic

Updated the logic for generating the new column name when `apply_as_new_col_to_df=True`. This ensures that the column name is correctly assigned based on the applied transformation or cutoff.

**Original code**:

python
New column name options when apply_as_new_col_to_df == True
if apply_as_new_col_to_df == True and scale_conversion == None and apply_cutoff == True:
new_col_name = feature_name + "_" + 'w_cutoff'
elif apply_as_new_col_to_df == True and scale_conversion != None:
new_col_name = feature_name + "_" + scale_conversion

**Updated version**:

python
Default new column name in case no conditions are met
new_col_name = feature_name

New column name options when apply_as_new_col_to_df == True
if apply_as_new_col_to_df:
if scale_conversion is None and apply_cutoff:
new_col_name = feature_name + "_w_cutoff"
elif scale_conversion is not None:
new_col_name = feature_name + "_" + scale_conversion


3. Custom `ValueError` for missing conditions
Added a custom `ValueError` to handle cases where the user sets `apply_as_new_col_to_df=True` but does not specify either a `scale_conversion` or enable `apply_cutoff`. This provides clearer feedback to users and avoids unexpected behavior.

4. New error-handling block:

python
if apply_as_new_col_to_df:
if scale_conversion is None and not apply_cutoff:
raise ValueError(
"When applying a new column with `apply_as_new_col_to_df=True`, "
"you must specify either a `scale_conversion` or set `apply_cutoff=True`."
)


Overall Changes
- Corrected the logic for generating new column names when transformations or cutoffs are applied.
- Added a custom `ValueError` when `apply_as_new_col_to_df=True` but neither a valid `scale_conversion` nor `apply_cutoff=True` is specified.
- Updated the docstring to reflect the new logic and error handling.

0.0.11a1

Plotting Changes

Added `histplot()` to the plot grid

python
Histplot
sns.histplot(
x=feature_,
ax=axes[1],
**(hist_kws or {}),
)
axes[1].set_title(f"Histplot: {feature_name} (Scale: {scale_conversion})")
axes[1].set_xlabel(f"{feature_name}") Add x-axis label here


Additional changes to plotting:
Added flexibility for keyword arguments (`kde_kws`, `hist_kws`, and `box_kws`) in the `data_doctor` function to allow users to customize Seaborn plots directly. This enhancement enables users to pass additional parameters for KDE, histogram, and boxplot customization, making the function more adaptable to specific plotting requirements.

Changes:
- added x-axis label to `histplot()`
- Introduced the following dictionary-based keyword argument inputs:
- **`kde_kws`**: Allows customization of the KDE plot (e.g., color, fill, etc.).
- **`hist_kws`**: Allows customization of the histogram plot (e.g., stat, color, etc.).
- **`box_kws`**: Allows customization of the boxplot (e.g., palette, color, etc.).
- Updated docstrings to reflect these changes and improved the description of the plotting logic.

This should provide users with more control over the visual output without altering the core functionality of the `data_doctor` function.

0.0.11a

**Release Date:** October 2024

We are excited to announce the `0.0.11a` release of `eda_toolkit`, which brings an important new feature: the `data_doctor` function. This release focuses on providing enhanced data quality checks and improving your exploratory data analysis workflow.

🚀 New Features:

`data_doctor` Function

The `data_doctor` function has been added to assist with automated data health checks. It performs a series of diagnostics on your dataset to identify potential issues such as:

- **Missing Data**: It scans for any null values across columns and provides a summary.
- **Data Types**: Verifies the consistency of data types across each column.
- **Outliers**: Detects and highlights statistical outliers based on customizable thresholds (e.g., IQR method).
- **Duplicated Entries**: Identifies duplicate rows in the dataset.
- **Inconsistent Values**: Flags anomalies or inconsistent data entries (such as mixed types in categorical variables).
- **Unique Values**: Reports unique value counts for each feature, helping spot features with low variance.

This function helps users clean their data efficiently by pointing out key issues that may need attention before proceeding with analysis or model training.

Example usage:
python
from eda_toolkit import data_doctor

Run diagnostics on a DataFrame
report = data_doctor(df, outlier_method='iqr', display_full_report=True)

flex_corr_matrix fix

Fix: Set the default input `title` in `flex_corr_matrix()` to `None`, since it was previously set to `"Cervical Cancer Data: Correlation Matrix"`

Page 1 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.