Eda-report

Latest version: v2.8.1

Safety actively analyzes 632731 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 5

2.8.1

What's New

* Drop `python-3.8`. It is no longer supported in *NumPy* and related packages.
* Update package config files to exclude tests in sdist.
* Bug Fix: Handle mixed-dtype datasets:

Mixed dtype datasets were stored with the `object` dtype. This caused issues e.g. sorting triggered a `TypeError` when numbers & strings are present. Data with the `object` dtype is now converted to the `string` dtype.

**Full Changelog**: https://github.com/Tim-Abwao/eda-report/compare/v2.8.0...v2.8.1

2.8.0

Improvements

* Use `matplotlib.rc_context` to customize plots while avoiding modifying global matplotlib state.
* Avoid storing copies of data in `Variable` instances to save on memory.
* Dynamically allocate image widths in documents to ensure plots maintain aspect ratio and are of appropriate size.
* Strip trailing zeros from float values in tables.
* Make `Dataset` repr more concise. Remove the "overview" title. Indent content. Rename summary statistics: "mean" to "avg", "std" to "stddev". Convert "count" values to integers.
* Consider data consisting of {Yes, No} or {Y, N} as boolean.
* Consider numeric variables with <11 unique values as categorical.
* Update the `summarize` function. Return a `Variable` for 1D data; a `Dataset` otherwise.
* Return `Axes` instances instead of `Figures` in all plotting functions.
* Dynamically handle repr logic in `Variable.__repr__`. Drop the redundant` _NumericStats`, `_DatetimeStats` and `_CategoricalStats` classes.

Additions

* Add the `ax` argument. Accept axes input in plotting functions. Very handy in cases where axes instances already exist e.g. subplots.
* Add `marker_color` and `line_color` args to `regression_plot` function.
* Add "(Normal)" to `prob_plot` x-axis label to indicate distribution
* Add xlabel and legend title to kde-plots.
* Add ylabel in grouped box-plots.
* Add "Count" ylabel to bar-plots.
* Add `Variable._get_most_common_categories` method to provide additional details for categorical variable repr.
* Add separate analysis and content-creation modules. The analysis module will focus on the actual analysis. The content module will focus on processing results for display.

Renamings
* Mark the analysis, cli, content, read_file and validate modules as private.
* Rename `groupby_data` argument in `get_word_report`, `ReportDocument`,
`_ReportContent` and `_AnalysisResult` to `groupby_variable`.
* Rename `ReportDocument._format_heading_spacing` method to `ReportDocument._format_paragraph_spacing`. Modify the method to work with any paragraph.
* Rename "numeric (<10 levels)" to "numeric (<=10 levels)", which is actually correct.
* Rename the multivariate module to bivariate. Correlation & scatterplots are bivariate analysis methods. Likewise, Rename `MultiVariable` to
`Dataset`.

**Full Changelog**: https://github.com/Tim-Abwao/eda-report/compare/v2.7.3...v2.8.0

2.7.3

What's new?

- Plot & display only the top 20 correlated pairs:

- Reduce scatterplot threshold from 50 to 20.
- Show correlation info in descending order of magnitude.

- Update contingency table logic:

Only create contingency table if data has < 20 unique values, to avoid cluttering the report.

- Update `plot_correlation` function:

- Darken and narrow bar edge-color.
- Set x-axis range to [-1.1, 1.1] so that bars with width slightly over 1.0 are not cut off.
- Return `None` when correlation info is missing.

- Refactor the multivariate module:

- Add the `_compute_correlation`, `_describe_correlation` and `_select_dtypes` functions, for reusability and more comprehensive testing.
- Add the `_correlation_values`, `_correlation_descriptions`, `_numeric_stats` and `_categorical_stats` attributes; and cut non-essential ones.
- Add the `_get_summary_statistics` method, and cut non-essential methods.
- Avoid omitting numeric columns with less than 0.05% unique values from bivariate analysis. This was meant to reduce the resultant scatter-plots, but in retrospect is not a good idea.

- Add python3.11 to test workflow.

2.7.2

What's New

- Allow running in the CLI without `tkinter`:
- If the input file and other args are provided, everything will run just fine - with neither a `ModuleNotFoundError` nor `ImportError` in case `tkinter` is missing.
- If no args are specified and `tkinter` is missing, then show a friendly message and exit gracefully.

- Add contingency tables to the report:

If a valid group-by variable is provided, a contingency table will now be added to the univariate analysis results of categorical variables.

- Update table creation function:
- Make column headers and index bold.
- Improve logic for handling header and other rows.

2.7.1

What's New

- Allow color selection in all plotting functions:

- Add the `color` arg to `bar_plot`, `box_plot`, `kde_plot` and `regression_plot`.
- Add `marker_color` and `line_color` args to `prob_plot`.
- Add `color_neg` and `color_pos` args to `plot_correlation`.

- Replace `set_custom_palette` with `_get_color_shades_of`:

- `_get_color_shades_of` generates shades of a desired color with no side-effects.
- `set_custom_palette` modifies the default matplotlib color cycle, which has a residual effect on other plots where the modified color cycle is undesired.
- Add `max_pairs` arg to `plot_correlation`:

Sets the maximum number of numeric pairs to include.
- Remove redundant `hue` arg from `prob_plot`.
- Mark `savefig` as private. It's really just used internally.

2.7.0

What's New

- Add the `set_custom_palette`, `box_plot`, `kde_plot`, `probability_plot`, `bar_plot`, `regression_plot` and `plot_correlation` functions (See [Plotting Examples][plotting]).

- Rename `target_variable` to `groupby_data`:
- Select a more intuitive name. Target_variable is ambiguous.
- Add the groupby [-g, --groupby] cli arg.

- Update document layout:
- Center-align images and tables.
- Reduce unnecessary page-breaks.

- Replace correlation heatmap with a bar chart:

Show coloured & labeled bars of the top 20 correlated numeric variable pairs (by magnitude). Makes it much easier to notice highly correlated variables.

- Limit bivariate summaries & regression plots to 50.

Necessary since combinations blow up quickly. 50 numeric columns could easily result in a 500 page report, taking ages to prepare (`combination(50_numeric_cols, 2) == 1225` pairs, and 1 page == 2 pairs). Now only the top 50 pairs will be published (approx 25 pages).

- Configure color in each subprocess:

Update helper functions to accept color choice, and set custom palette. Spawned subprocesses (Windows & Mac currently) weren't getting the globally modified colors.

- Reduce graph image dpi from 250 to 150:

Results in smaller, but very decent images. Significantly reduces the size of report documents with many variables.

- Revise correlation interpretation.

Use R.H. Evans (1966) guide:

.00-.19 -> very weak
.20-.39 -> weak
.40-.59 -> moderate
.60-.79 -> strong
.80-1.0 -> very strong

- Fix handling of int values for `groupby` specifier:
- Int input from the cli and gui is parsed as a string, and failed the `isinstance(x, int)` test.
- The `str.isdecimal` test is more suitable here.

- Optimize tests:
- Add conftest.py.
- Add a session-level temp_data_dir fixture.

[plotting]: https://eda-report.readthedocs.io/en/latest/eda_report.plotting.html#plotting-examples



**Full Changelog**: https://github.com/Tim-Abwao/eda-report/compare/v2.6.0...v2.7.0

Page 1 of 5

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.