Handyspark

Latest version: v0.2.2a1

Safety actively analyzes 662819 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

0.2.0a1

Performance Improvements

- summaries are no longer computed when a `HandyFrame` is created.
- column statistics (`q1`, `q3`, `median`, `percentile`) now accept a `precision` argument (default = 0.01) to compute approximate statistics faster
- `stratify` operations are no longer using RDD methods and rely on Spark's DataFrame built-in optimizer to deliver fast columnar statistics. A substantial performance improvement was achieved for almost every `stratify` operation.

Stratified transformers

Transformers `HandyImputer` and `HandyFencer` now store values for stratified operations using the ***column name*** as first level of dictionary and ***filter clause*** as second level, as opposed to the inverse structure being used in version 0.1.0a1.

- in version 0.1.0a1:

{'Pclass == "1" and Sex == "female"': {'Age': 34.61176470588235},
'Pclass == "1" and Sex == "male"': {'Age': 41.28138613861386},
'Pclass == "2" and Sex == "female"': {'Age': 28.722972972972972},
'Pclass == "2" and Sex == "male"': {'Age': 30.74070707070707},
'Pclass == "3" and Sex == "female"': {'Age': 21.75},
'Pclass == "3" and Sex == "male"': {'Age': 26.507588932806325}}

- in version 0.2.0a1:

{'Age': {'Pclass == "1" and Sex == "female"': 34.61176470588235,
'Pclass == "1" and Sex == "male"': 41.28138613861386,
'Pclass == "2" and Sex == "female"': 28.722972972972972,
'Pclass == "2" and Sex == "male"': 30.74070707070707,
'Pclass == "3" and Sex == "female"': 21.75,
'Pclass == "3" and Sex == "male"': 26.507588932806325}}

Outlier detection and removal

Two new methods are available, at both `HandyFrame` and `HandyColumns` object, for detecting and removing outliers, based on Mahalanobis distance:

- `get_outliers`: returns a Spark DataFrame containing all rows considered outliers
- `remove_outliers`: returns a filtered Spark DataFrame where all outliers were removed

Those methods consider only numeric columns and use a threshold (default 99.9%) to compute the corresponding chi-square critical value to filter the rows.

Binary classification metrics

The `BinaryClassificationMetrics` object was extended to take a Spark DataFrame (instead of an RDD only) and the corresponding `scoreCol`, with the vector of probabilities output from a classifier, and a `labelCol` with the true labels.

It exposes several methods that were not available to PySpark:
- `thresholds`
- `roc`
- `pr`
- `fMeasureByThreshold`
- `precisionByThreshold`
- `recallByThreshold`

It also implements some new methods:
- `getMetricsByThreshold`: returns a Spark DataFrame with all metrics, FPR, Recall and Precision, by threshold
- `confusionMatrix`: returns a DenseMatrix representing the confusion matrix for the informed threshold
- `print_confusion_matrix`: returns a nice pandas DataFrame with the confusion matrix
- `plot_roc_curve`
- `plot_pr_curve`

Information Theory

`HandyColumn` object now exposes methods for computing entropy and mutual information:
- `entropy`: returns pandas Series with entropy for informed columns
- `mutual_info`: returns pandas DataFrame with mutual information between informed columns

Links

Releases

Has known vulnerabilities

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.