eda-toolkit Changelog

0.08c

0.0.15

Scatter Plot Function Updates

Avoid In-Place Modification of `exclude_combinations`

Description

This addresses an issue where the `scatter_fit_plot` function modifies the `exclude_combinations` parameter in-place, causing errors when reused in subsequent calls.

Changes Made

- Create a local copy of `exclude_combinations` for normalization instead of modifying the input directly:
python
exclude_combinations_normalized = {tuple(sorted(pair)) for pair in exclude_combinations}

----

Improve Progress Tracking and Resolve Last Plot Saving Issue

Changes Proposed

1. **Separate Progress Bars for Grid Plot:**
- Added a `tqdm` progress bar to track the rendering of subplots in the grid.
- Introduced a second `tqdm` progress bar to handle the saving step of the entire grid plot.

2. **Fix for Last Plot Saving with `save_plots="all"`:**
- Ensured individual plots and the grid plot are saved independently without overlap or interference.
- Addressed an issue where the last individual plot was incorrectly saved or overwritten.

3. **Accurate Updates and Feedback:**
- Progress bars now provide clear updates for rendering and saving stages, avoiding any hanging or delays.

Updated `tqdm` saving logic in `scatter_fit_plot`

- Refactored `tqdm` progress bar in `scatter_fit_plot` to track the overall plot-saving process, covering both individual and grid plots.
- Updated `tqdm` progress bar description in `scatter_fit_plot` to use universal phrasing: "Saving scatter plot(s)."
- Ensured consistency for singular or multiple plot-saving scenarios in progress tracking.

0.0.14

Ensure Crosstabs Dictionary is Populated with `return_dict=True`

This resolves the issue where the `stacked_crosstab_plot` function fails to populate and return the crosstabs dictionary (`crosstabs_dict`) when `return_dict=True` and `output="plots_only"`. The fix ensures that crosstabs are always generated when `return_dict=True`, regardless of the output parameter.

- Always Generate Crosstabs with `return_dict=True`:

- Added logic to ensure crosstabs are created and populated in `crosstabs_dict` whenever `return_dict=True`, even if the output parameter is set to `"plots_only"`.

- Separation of Crosstabs Display from Generation:
- The generation of crosstabs is now independent of the output parameter.
- Crosstabs display (`print`) occurs only when output includes `"both"` or` "crosstabs_only"`.

Enhancements and Fixes for `scatter_fit_plot` Function

This addresses critical issues and introduces key enhancements for the `scatter_fit_plot` function. These changes aim to improve usability, flexibility, and robustness of the function.

---

**Enhancements and Fixes**

**1. Added `exclude_combinations` Parameter**
- **Feature:** Users can now exclude specific variable pairs from being plotted by providing a list of tuples with the combinations to omit.

**2. Added `combinations` Parameter to `show_plot`**
- **Feature:** Users can now show just the list of combinations that are part of the selection process when `all_vars=True`

**3. Fixed Bug with Single Variable Pair Plotting**
- `Bug`: When plotting a single variable pair with `show_plot="both"`, the function threw an `AttributeError`.
- `Fix`: Single-variable pairs are now properly handled.

**4. Updated Default for `show_plot` Parameter**
- **Enhancement**: Changed the default value of `show_plot` to `"both"` to prevent excessive individual plots when handling large variable sets.

5. Legend, `xlim`, `ylim` inputs were not being used; fixed.

---

Fix Default Title and Filename Handling in `flex_corr_matrix`

This resolves issues in the `flex_corr_matrix` function where:

1. No default title was provided when `title=None`, resulting in missing titles on plots.
2. Saved plot filenames were incorrect, leading to issues like `.png.png` when `title` was not provided.

The fix ensures that a default title ("Correlation Matrix") is used for both plot display and file saving when no `title` is explicitly provided. If `title` is explicitly set to `None`, the plot will have no title, but the saved filename will still use `"correlation_matrix"`.

1. Default Filename and Title Logic:
- If no `title` is provided, `"Correlation Matrix"` is used as the default for filenames and displayed titles.
- If `title=None` is explicitly passed, no title is displayed on the plot.

2. File Saving Improvements:
- File names are generated based on the `title` or default to `"correlation_matrix"` if `title` is not provided.
- Spaces in the `title` are replaced with underscores, and special characters like `:` are removed to ensure valid filenames.

0.0.13

Description

This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.

Add `ValueError` for Insufficient Pool Size in `add_ids` and Enhance ID Deduplication

This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:

**Key Changes**

- New `ValueError` for Insufficient Pool Size:

- Calculates the pool size ($(9 \times 10^{d-1}$)) and compares it with the number of rows in the DataFrame.

- Behavior:

- Throws a ValueError if n_rows > pool_size.
- Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.

- Improved ID Deduplication:
- Introduced a set (`unique_ids`) to track generated IDs.
- IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
- Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.

**Benefits**

- Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
- Guarantees unique IDs even for large DataFrames, improving reliability and scalability.

Enhance `strip_trailing_period` to Support Strings and Mixed Data Types

- This enhances the `strip_trailing_period` function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like `NaN`.

**Key Enhancements**

- Support for Strings with Trailing Periods:
- Removes trailing periods from string values, such as "123." or "test.".
- Mixed Data Types:
- Handles columns containing both numeric and string values seamlessly.
- Graceful Handling of `NaN`:
- Skips processing for `NaN` values, leaving them unchanged.
- Robust Type Conversion:
- Converts numeric strings (e.g., "123.") back to float where applicable.
- Retains strings if conversion to float is not possible.

Changes in `stacked_crosstab_plot`

Remove `IPython` Dependency by Replacing `display` with `print`
This resolves an issue where the `eda_toolkit` library required `IPython` as a dependency due to the use of `display(crosstab_df)` in the `stacked_crosstab_plot` function. The dependency caused import failures in environments without `IPython`, especially in non-Jupyter terminal-based workflows.

**Changes Made**

1. **Replaced** `display` with `print`:

- The line `display(crosstab_df)` was replaced with `print(crosstab_df)` to eliminate the need for `IPython`. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies.

- Removed `IPython` Import:
- The `from IPython.display import display` import statement was removed from the codebase.

**Updated Function Behavior:**

- Crosstabs are displayed using print, maintaining functionality in all runtime environments.
- The change ensures no loss in usability or user experience.

Root Cause and Fix
The issue arose from the reliance on `IPython.display.display` for rendering crosstab tables in Jupyter notebooks. Since `IPython` is not a core dependency of `eda_toolkit`, environments without `IPython` experienced a `ModuleNotFoundError`.

To address this, the `display(crosstab_df)` statement was replaced with `print(crosstab_df)`, simplifying the function while maintaining compatibility with both Jupyter and terminal environments.

Testing

- Jupyter Notebook:
- Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.

- Terminal Session:
- Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.

Add Environment Detection to `dataframe_columns` Function

This enhances the `dataframe_columns` function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.

**Changes Made**

1. **Environment Detection:**
- Added a check to determine if the function is running in a Jupyter Notebook or terminal:

python
is_notebook_env = "ipykernel" in sys.modules

2. **Dynamic Output Behavior:**
- **Terminal Environment:**
- Returns a plain DataFrame (`result_df`) when running outside of a notebook or when `return_df=True`.
- **Jupyter Notebook:**
- Retains the styled DataFrame functionality when running in a notebook and `return_df=False`.
3. **Improved Compatibility:**
- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
4. **Preserved Existing Features:**
- Maintains sorting behavior via `sort_cols_alpha`.
- Keeps the background color styling for specific columns (`unique_values_total`, `max_unique_value`, etc.) in notebook environments.

Add `tqdm` Progress Bar to `dataframe_columns` Function

This enhances the `dataframe_columns` function by incorporating a `tqdm` progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.

**Changes Made**

1. Added `tqdm` Progress Bar:

- Wrapped the column processing loop with a `tqdm` progress bar:

python
for col in tqdm(df.columns, desc="Processing columns"):
...

2. The progress bar is labeled with the description `"Processing columns"` for clarity.
3. The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.

`box_violin_plot` Fix Plot Display for Terminal Applications and Simplify` save_plot` Functionality

This addresses the following issues:

1. Removes `plt.close(fig)`
- Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
- Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
2. Simplifies `save_plot` Parameter
- Converts `save_plot` into a `boolean` for simplicity and better integration with the existing `show_plot` parameter.
- Automatically saves plots based on the value of `show_plot` (`"individual,"` `"grid,"` or `"both"`) when `save_plot=True`.

These changes improve the usability and flexibility of the plotting function across different environments.

**Changes Made**

- Removed `plt.close(fig)` to allow plots to remain open in non-Jupyter environments.
- Updated the `save_plot` parameter to be a `boolean`, streamlining the control logic with `show_plot`.
- Adjusted the relevant sections of the code to implement these changes.
- Updated `ValueError` check based on the new `save_plots` input:

python
Check for valid save_plots value
if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")

`scatter_fit_plot`: Render Plots Before Saving

- Update the `scatter_fit_plot` function to render all plots (`plt.show()`) before saving, improving user experience and output quality validation.

**Changes**

- Added `plt.show()` to render individual and grid plots before saving.
- Integrated `tqdm` for progress tracking during saving individual plots and grid plots

Add `tqdm` Progress Bar to `save_dataframes_to_excel`

This enhances the `save_dataframes_to_excel` function by integrating a `tqdm` progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.

**Changes Made**

- Added a `tqdm` Progress Bar:
- Tracks the progress of saving DataFrames to individual sheets.
- Ensures that the user sees an incremental update as each DataFrame is written.

- Updated Functionality:

- Incorporated the progress bar into the loop that writes DataFrames to sheets.
- Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).

Add Progress Tracking and Enhance Functionality for `summarize_all_combinations`

This enhances the `summarize_all_combinations` function by adding user-friendly progress tracking using `tqdm` and addressing usability concerns. The following changes have been implemented:

1. Progress Tracking with `tqdm`
2. Excel File Finalization:
- Addressed `UserWarning` messages related to close() being called on already closed files by explicitly managing file closure.
- Added a final confirmation message when the Excel file is successfully saved.

Fix Plot Display Logic in `plot_2d_pdp`

This resolves an issue in the `plot_2d_pdp` function where all plots (grid and individual) were being displayed unnecessarily when `save_plots="all"`. The function now adheres strictly to the `plot_type` parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.

**Changes Made:**

1. **Grid Plot Logic:**
- Grid plots are only displayed if `plot_type="grid"` or `plot_type="both"`.
- If `save_plots="all"` or `save_plots="grid"`, plots are saved without being displayed unless specified by `plot_type`.

2. **Individual Plot Logic:**
- Individual plots are only displayed if `plot_type="individual"` or `plot_type="both"`.
- If `save_plots="all"` or `save_plots="individual"`, plots are saved but not displayed unless specified by `plot_type`.

3. **Plot Closing:**
- Added `plt.close(fig)` after saving plots to release memory when plots are not intended for display.

Additional Features and Enhancements

- **Environment Testing**: Successfully tested across multiple environments for compatibility.
- **New Features**:
- Added a streamlined and sweet **Makefile** for simplified project management.
- Implemented a new `__init__.py` for modularization and clarity.
- Introduced a robust and flexible ASCII art printing script.
- **Dependency Updates**: Refreshed `requirements.txt`, `pyproject.toml`, and `setup.py` to align with the latest changes and ensure seamless installation.

0.0.13a

Description

This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.

Add `ValueError` for Insufficient Pool Size in `add_ids` and Enhance ID Deduplication

This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:

**Key Changes**

- New `ValueError` for Insufficient Pool Size:

- Calculates the pool size ($(9 \times 10^{d-1}$)) and compares it with the number of rows in the DataFrame.

- Behavior:

- Throws a ValueError if n_rows > pool_size.
- Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.

- Improved ID Deduplication:
- Introduced a set (`unique_ids`) to track generated IDs.
- IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
- Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.

**Benefits**

- Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
- Guarantees unique IDs even for large DataFrames, improving reliability and scalability.

Enhance `strip_trailing_period` to Support Strings and Mixed Data Types

- This enhances the `strip_trailing_period` function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like `NaN`.

**Key Enhancements**

- Support for Strings with Trailing Periods:
- Removes trailing periods from string values, such as "123." or "test.".
- Mixed Data Types:
- Handles columns containing both numeric and string values seamlessly.
- Graceful Handling of `NaN`:
- Skips processing for `NaN` values, leaving them unchanged.
- Robust Type Conversion:
- Converts numeric strings (e.g., "123.") back to float where applicable.
- Retains strings if conversion to float is not possible.

Changes in `stacked_crosstab_plot`

Remove `IPython` Dependency by Replacing `display` with `print`
This resolves an issue where the `eda_toolkit` library required `IPython` as a dependency due to the use of `display(crosstab_df)` in the `stacked_crosstab_plot` function. The dependency caused import failures in environments without `IPython`, especially in non-Jupyter terminal-based workflows.

**Changes Made**

1. **Replaced** `display` with `print`:

- The line `display(crosstab_df)` was replaced with `print(crosstab_df)` to eliminate the need for `IPython`. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies.

- Removed `IPython` Import:
- The `from IPython.display import display` import statement was removed from the codebase.

**Updated Function Behavior:**

- Crosstabs are displayed using print, maintaining functionality in all runtime environments.
- The change ensures no loss in usability or user experience.

Root Cause and Fix
The issue arose from the reliance on `IPython.display.display` for rendering crosstab tables in Jupyter notebooks. Since `IPython` is not a core dependency of `eda_toolkit`, environments without `IPython` experienced a `ModuleNotFoundError`.

To address this, the `display(crosstab_df)` statement was replaced with `print(crosstab_df)`, simplifying the function while maintaining compatibility with both Jupyter and terminal environments.

Testing

- Jupyter Notebook:
- Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.

- Terminal Session:
- Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.

Add Environment Detection to `dataframe_columns` Function

This enhances the `dataframe_columns` function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.

**Changes Made**

1. **Environment Detection:**
- Added a check to determine if the function is running in a Jupyter Notebook or terminal:

python
is_notebook_env = "ipykernel" in sys.modules

2. **Dynamic Output Behavior:**
- **Terminal Environment:**
- Returns a plain DataFrame (`result_df`) when running outside of a notebook or when `return_df=True`.
- **Jupyter Notebook:**
- Retains the styled DataFrame functionality when running in a notebook and `return_df=False`.
3. **Improved Compatibility:**
- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
4. **Preserved Existing Features:**
- Maintains sorting behavior via `sort_cols_alpha`.
- Keeps the background color styling for specific columns (`unique_values_total`, `max_unique_value`, etc.) in notebook environments.

Add `tqdm` Progress Bar to `dataframe_columns` Function

This enhances the `dataframe_columns` function by incorporating a `tqdm` progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.

**Changes Made**

1. Added `tqdm` Progress Bar:

- Wrapped the column processing loop with a `tqdm` progress bar:

python
for col in tqdm(df.columns, desc="Processing columns"):
...

2. The progress bar is labeled with the description `"Processing columns"` for clarity.
3. The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.

`box_violin_plot` Fix Plot Display for Terminal Applications and Simplify` save_plot` Functionality

This addresses the following issues:

1. Removes `plt.close(fig)`
- Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
- Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
2. Simplifies `save_plot` Parameter
- Converts `save_plot` into a `boolean` for simplicity and better integration with the existing `show_plot` parameter.
- Automatically saves plots based on the value of `show_plot` (`"individual,"` `"grid,"` or `"both"`) when `save_plot=True`.

These changes improve the usability and flexibility of the plotting function across different environments.

**Changes Made**

- Removed `plt.close(fig)` to allow plots to remain open in non-Jupyter environments.
- Updated the `save_plot` parameter to be a `boolean`, streamlining the control logic with `show_plot`.
- Adjusted the relevant sections of the code to implement these changes.
- Updated `ValueError` check based on the new `save_plots` input:

python
Check for valid save_plots value
if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")

`scatter_fit_plot`: Render Plots Before Saving

- Update the `scatter_fit_plot` function to render all plots (`plt.show()`) before saving, improving user experience and output quality validation.

**Changes**

- Added `plt.show()` to render individual and grid plots before saving.
- Integrated `tqdm` for progress tracking during saving individual plots and grid plots

Add `tqdm` Progress Bar to `save_dataframes_to_excel`

This enhances the `save_dataframes_to_excel` function by integrating a `tqdm` progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.

**Changes Made**

- Added a `tqdm` Progress Bar:
- Tracks the progress of saving DataFrames to individual sheets.
- Ensures that the user sees an incremental update as each DataFrame is written.

- Updated Functionality:

- Incorporated the progress bar into the loop that writes DataFrames to sheets.
- Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).

Add Progress Tracking and Enhance Functionality for `summarize_all_combinations`

This enhances the `summarize_all_combinations` function by adding user-friendly progress tracking using `tqdm` and addressing usability concerns. The following changes have been implemented:

1. Progress Tracking with `tqdm`
2. Excel File Finalization:
- Addressed `UserWarning` messages related to close() being called on already closed files by explicitly managing file closure.
- Added a final confirmation message when the Excel file is successfully saved.

Fix Plot Display Logic in `plot_2d_pdp`

This resolves an issue in the `plot_2d_pdp` function where all plots (grid and individual) were being displayed unnecessarily when `save_plots="all"`. The function now adheres strictly to the `plot_type` parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.

**Changes Made:**

1. **Grid Plot Logic:**
- Grid plots are only displayed if `plot_type="grid"` or `plot_type="both"`.
- If `save_plots="all"` or `save_plots="grid"`, plots are saved without being displayed unless specified by `plot_type`.

2. **Individual Plot Logic:**
- Individual plots are only displayed if `plot_type="individual"` or `plot_type="both"`.
- If `save_plots="all"` or `save_plots="individual"`, plots are saved but not displayed unless specified by `plot_type`.

3. **Plot Closing:**
- Added `plt.close(fig)` after saving plots to release memory when plots are not intended for display.

Additional Features and Enhancements

- **Environment Testing**: Successfully tested across multiple environments for compatibility.
- **New Features**:
- Added a streamlined and sweet **Makefile** for simplified project management.
- Implemented a new `__init__.py` for modularization and clarity.
- Introduced a robust and flexible ASCII art printing script.
- **Dependency Updates**: Refreshed `requirements.txt`, `pyproject.toml`, and `setup.py` to align with the latest changes and ensure seamless installation.

0.0.12

New Features

- Added `data_doctor` function:

A versatile tool designed to facilitate detailed feature analysis, outlier detection, and data transformation within a DataFrame.

**Key Capabilities**:

- **Outlier Detection**:

- Detects and highlights outliers visually using boxplots, histograms, and other visualization options.
- Allows cutoffs to be applied directly, offering a configurable approach for handling extreme values.

- **Data Transformation**:

- Supports a range of scaling transformations, including absolute, log, square root, min-max, robust, and Box-Cox transformations, among others.
- Configurable via `scale_conversion` and `scale_conversion_kws` parameters to customize transformation approaches based on user needs.

- **Visualization Options**:

- Provides flexible visualization choices, including KDE plots, histograms, and box/violin plots.
- Allows users to specify multiple plot types in a single call (e.g., `plot_type=["hist", "kde"]`), facilitating comprehensive visual exploration of feature distributions.

- **Customizable Display**:

- Adds text annotations, such as cutoff values, below plots, and enables users to adjust various styling parameters like `label_fontsize`, `tick_fontsize`, and `figsize`.

- **Output Control**:

- Offers options to save plots directly to PNG or SVG formats, with file names reflecting key transformations and cutoff information for easy identification.

Eda-toolkit

Page 1 of 5