Description
This release introduces a series of updates and fixes across multiple functions in the to enhance error handling, improve cross-environment compatibility, streamline usability, and optimize performance. These changes address critical issues, add new features, and ensure consistent behavior in both terminal and notebook environments.
Add `ValueError` for Insufficient Pool Size in `add_ids` and Enhance ID Deduplication
This update enhances the add_ids function by adding explicit error handling and improving the uniqueness guarantee for
generated IDs. The following changes have been implemented:
**Key Changes**
- New `ValueError` for Insufficient Pool Size:
- Calculates the pool size ($(9 \times 10^{d-1}$)) and compares it with the number of rows in the DataFrame.
- Behavior:
- Throws a ValueError if n_rows > pool_size.
- Prints a warning if n_rows approaches 90% of the pool size, suggesting an increase in digit length.
- Improved ID Deduplication:
- Introduced a set (`unique_ids`) to track generated IDs.
- IDs are checked against this set to ensure uniqueness before being added to the DataFrame.
- Prevents collisions by regenerating IDs only for duplicates, minimizing retries and improving performance.
**Benefits**
- Ensures robust error handling, avoiding silent failures or excessive retries caused by small digit lengths.
- Guarantees unique IDs even for large DataFrames, improving reliability and scalability.
Enhance `strip_trailing_period` to Support Strings and Mixed Data Types
- This enhances the `strip_trailing_period` function to handle trailing periods in both numeric and string values. The updated implementation ensures robustness for columns with mixed data types and gracefully handles special cases like `NaN`.
**Key Enhancements**
- Support for Strings with Trailing Periods:
- Removes trailing periods from string values, such as "123." or "test.".
- Mixed Data Types:
- Handles columns containing both numeric and string values seamlessly.
- Graceful Handling of `NaN`:
- Skips processing for `NaN` values, leaving them unchanged.
- Robust Type Conversion:
- Converts numeric strings (e.g., "123.") back to float where applicable.
- Retains strings if conversion to float is not possible.
Changes in `stacked_crosstab_plot`
Remove `IPython` Dependency by Replacing `display` with `print`
This resolves an issue where the `eda_toolkit` library required `IPython` as a dependency due to the use of `display(crosstab_df)` in the `stacked_crosstab_plot` function. The dependency caused import failures in environments without `IPython`, especially in non-Jupyter terminal-based workflows.
**Changes Made**
1. **Replaced** `display` with `print`:
- The line `display(crosstab_df)` was replaced with `print(crosstab_df)` to eliminate the need for `IPython`. This ensures compatibility across terminal and Jupyter environments without requiring additional dependencies.
- Removed `IPython` Import:
- The `from IPython.display import display` import statement was removed from the codebase.
**Updated Function Behavior:**
- Crosstabs are displayed using print, maintaining functionality in all runtime environments.
- The change ensures no loss in usability or user experience.
Root Cause and Fix
The issue arose from the reliance on `IPython.display.display` for rendering crosstab tables in Jupyter notebooks. Since `IPython` is not a core dependency of `eda_toolkit`, environments without `IPython` experienced a `ModuleNotFoundError`.
To address this, the `display(crosstab_df)` statement was replaced with `print(crosstab_df)`, simplifying the function while maintaining compatibility with both Jupyter and terminal environments.
Testing
- Jupyter Notebook:
- Crosstabs are displayed as plain text via print(), rendered neatly in notebook outputs.
- Terminal Session:
- Crosstabs are printed as expected, ensuring seamless use in terminal-based workflows.
Add Environment Detection to `dataframe_columns` Function
This enhances the `dataframe_columns` function to dynamically adjust its output based on the runtime environment (Jupyter Notebook or terminal). It resolves issues where the function's styled output was incompatible with terminal environments.
**Changes Made**
1. **Environment Detection:**
- Added a check to determine if the function is running in a Jupyter Notebook or terminal:
python
is_notebook_env = "ipykernel" in sys.modules
2. **Dynamic Output Behavior:**
- **Terminal Environment:**
- Returns a plain DataFrame (`result_df`) when running outside of a notebook or when `return_df=True`.
- **Jupyter Notebook:**
- Retains the styled DataFrame functionality when running in a notebook and `return_df=False`.
3. **Improved Compatibility:**
- The function now works seamlessly in both terminal and notebook environments without requiring additional dependencies.
4. **Preserved Existing Features:**
- Maintains sorting behavior via `sort_cols_alpha`.
- Keeps the background color styling for specific columns (`unique_values_total`, `max_unique_value`, etc.) in notebook environments.
Add `tqdm` Progress Bar to `dataframe_columns` Function
This enhances the `dataframe_columns` function by incorporating a `tqdm` progress bar to track the processing of each column. This is particularly useful for analyzing large DataFrames, providing real-time feedback on the function's progress.
**Changes Made**
1. Added `tqdm` Progress Bar:
- Wrapped the column processing loop with a `tqdm` progress bar:
python
for col in tqdm(df.columns, desc="Processing columns"):
...
2. The progress bar is labeled with the description `"Processing columns"` for clarity.
3. The progress bar is non-intrusive and works seamlessly in both terminal and Jupyter Notebook environments.
`box_violin_plot` Fix Plot Display for Terminal Applications and Simplify` save_plot` Functionality
This addresses the following issues:
1. Removes `plt.close(fig)`
- Ensures plots display properly in terminal-based applications and IDEs outside Jupyter Notebooks.
- Fixes the incompatibility with non-interactive environments by leaving figures open after rendering.
2. Simplifies `save_plot` Parameter
- Converts `save_plot` into a `boolean` for simplicity and better integration with the existing `show_plot` parameter.
- Automatically saves plots based on the value of `show_plot` (`"individual,"` `"grid,"` or `"both"`) when `save_plot=True`.
These changes improve the usability and flexibility of the plotting function across different environments.
**Changes Made**
- Removed `plt.close(fig)` to allow plots to remain open in non-Jupyter environments.
- Updated the `save_plot` parameter to be a `boolean`, streamlining the control logic with `show_plot`.
- Adjusted the relevant sections of the code to implement these changes.
- Updated `ValueError` check based on the new `save_plots` input:
python
Check for valid save_plots value
if not isinstance(save_plots, bool): raise ValueError("`save_plots` must be a boolean value (True or False).")
`scatter_fit_plot`: Render Plots Before Saving
- Update the `scatter_fit_plot` function to render all plots (`plt.show()`) before saving, improving user experience and output quality validation.
**Changes**
- Added `plt.show()` to render individual and grid plots before saving.
- Integrated `tqdm` for progress tracking during saving individual plots and grid plots
Add `tqdm` Progress Bar to `save_dataframes_to_excel`
This enhances the `save_dataframes_to_excel` function by integrating a `tqdm` progress bar for improved tracking of the DataFrame saving process. Users can now visually monitor the progress of writing each DataFrame to its respective sheet in the Excel file.
**Changes Made**
- Added a `tqdm` Progress Bar:
- Tracks the progress of saving DataFrames to individual sheets.
- Ensures that the user sees an incremental update as each DataFrame is written.
- Updated Functionality:
- Incorporated the progress bar into the loop that writes DataFrames to sheets.
- Retained the existing formatting features (e.g., auto-fitting columns, numeric formatting, and header styles).
Add Progress Tracking and Enhance Functionality for `summarize_all_combinations`
This enhances the `summarize_all_combinations` function by adding user-friendly progress tracking using `tqdm` and addressing usability concerns. The following changes have been implemented:
1. Progress Tracking with `tqdm`
2. Excel File Finalization:
- Addressed `UserWarning` messages related to close() being called on already closed files by explicitly managing file closure.
- Added a final confirmation message when the Excel file is successfully saved.
Fix Plot Display Logic in `plot_2d_pdp`
This resolves an issue in the `plot_2d_pdp` function where all plots (grid and individual) were being displayed unnecessarily when `save_plots="all"`. The function now adheres strictly to the `plot_type` parameter, showing only the intended plots. It also ensures unused plots are closed to prevent memory issues.
**Changes Made:**
1. **Grid Plot Logic:**
- Grid plots are only displayed if `plot_type="grid"` or `plot_type="both"`.
- If `save_plots="all"` or `save_plots="grid"`, plots are saved without being displayed unless specified by `plot_type`.
2. **Individual Plot Logic:**
- Individual plots are only displayed if `plot_type="individual"` or `plot_type="both"`.
- If `save_plots="all"` or `save_plots="individual"`, plots are saved but not displayed unless specified by `plot_type`.
3. **Plot Closing:**
- Added `plt.close(fig)` after saving plots to release memory when plots are not intended for display.
Additional Features and Enhancements
- **Environment Testing**: Successfully tested across multiple environments for compatibility.
- **New Features**:
- Added a streamlined and sweet **Makefile** for simplified project management.
- Implemented a new `__init__.py` for modularization and clarity.
- Introduced a robust and flexible ASCII art printing script.
- **Dependency Updates**: Refreshed `requirements.txt`, `pyproject.toml`, and `setup.py` to align with the latest changes and ensure seamless installation.