Sklearn2pmml

Latest version: v0.116.2

Safety actively analyzes 723650 Python packages for vulnerabilities to keep your Python projects secure.

Page 4 of 5

0.105.2

Breaking changes

None.

New features

None.

Minor improvements and fixes

* Improved support for categorical encoding over mixed datatype column sets.

Scikit-Learn transformers such as `OneHotEncoder`, `OrdinalEncoder` and `TargetEncoder` can be applied to several columns in one go.
Previously it was assumed that all columns shared the same data type. If that was assumption was violated in practice, they were all force cast to the `string` data type.

The JPMML-SkLearn library now detects and maintains the data type on a single column basis.

* Made Category-Encoders classes directly exportable to PMML.

For example, training and exporting a `BaseNEncoder` transformer into a PMML document for manual analysis and interpretation purposes:

python
from category_encoders import BaseNEncoder
from sklearn2pmml import sklearn2pmml

transformer = BaseNEncoder(base = 3)
transformer.fit(X, y = None)

sklearn2pmml(transformer, "Base3Encoder.pmml")

* Fixed support for `(category_encoders.utils.)BaseEncoder.feature_names_in_` attribute.

According to [SLEP007](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html), the value of a `feature_names_in_` attribute should be an array of strings.

Category-Encoders transformers are using a list of strings instead.

* Refactored `ExpressionClassifier` and `ExpressionRegressor` constructors.

The evaluatable object can now also be a string literal.

0.105.1

Breaking changes

None.

New features

* Added support for [`sklearn.preprocessing.TargetEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html) class.

* Added support for [`sklearn.preprocessing.SplineTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.SplineTransformer.html) class.

The `SplineTransformer` class computes a B-spline for a feature, which is then used to expand the feature into new features that correspond to B-spline basis elements.

This class is not suitable for simple feature and prediction scaling purposes (eg. calibration of computer probabilities).
Consider using the `sklearn2pmml.preprocessing.BSplineTransformer` class in such a situation.

* Added support for [`statsmodels.api.QuantReg`](https://www.statsmodels.org/dev/generated/statsmodels.regression.quantile_regression.QuantReg.html) class.

* Added `input_float` conversion option.

Scikit-Learn tree and tree ensemble models prepare their inputs by first casting them to `(numpy.)float32`, and then to `(numpy.)float64` (exactly so, even if the input value already happened to be of `(numpy.)float64` data type).

PMML does not provide effective means for implementing "chained casts"; the chain must be broken down into elementary cast operations, each of which is represented using a standalone `DerivedField` element.
For example, preparing the "Sepal.Length" field of the iris dataset:

xml
<PMML>
<DataDictionary>
<DataField name="Sepal.Length" optype="continuous" dataType="double">
<Interval closure="closedClosed" leftMargin="4.3" rightMargin="7.9"/>
</DataField>
</DataDictionary>
<TransformationDictionary>
<DerivedField name="float(Sepal.Length)" optype="continuous" dataType="float">
<FieldRef field="Sepal.Length"/>
</DerivedField>
<DerivedField name="double(float(Sepal.Length))" optype="continuous" dataType="double">
<FieldRef field="float(Sepal.Length)"/>
</DerivedField>
</TransformationDictionary>
</PMML>

Activating the `input_float` conversion option:

python
pipeline = PMMLPipeline([
("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_X, iris_y)

Default mode
pipeline.configure(input_float = False)
sklearn2pmml("DecisionTree-default.pmml")

"Input float" mode
pipeline.configure(input_float = True)
sklearn2pmml("DecisionTree-input_float.pmml")

This conversion option updates the data type of the "Sepal.Length" data field from `double` to `float`, thereby eliminating the need for the first `DerivedField` element of the two:

xml
<PMML>
<DataDictionary>
<DataField name="Sepal.Length" optype="continuous" dataType="float">
<Interval closure="closedClosed" leftMargin="4.300000190734863" rightMargin="7.900000095367432"/>
</DataField>
</DataDictionary>
<TransformationDictionary>
<DerivedField name="double(Sepal.Length)" optype="continuous" dataType="double">
<FieldRef field="Sepal.Length"/>
</DerivedField>
</TransformationDictionary>
</PMML>

Changing the data type of a field may have side effects if the field contributes to more than one feature.
The effectiveness and safety of configuration options should be verified by integration testing.

* Added `H2OEstimator.pmml_classes_` attribute.

This attribute allows customizing target category levels.
It comes in handly when working with ordinal targets, where the H2O.ai framework requires that target category levels are encoded from their original representation to integer index representation.

A fitted H2O.ai ordinal classifier predicts integer indices, which must be manually decoded in the application layer.
The JPMML-SkLearn library is able to "erase" this encode-decode helper step from the workflow, resulting in a clean and efficient PMML document:

python
ordinal_classifier = H2OGeneralizedLinearEstimator(family = "ordinal")
ordinal_classifier.fit(...)

Customize target category levels
Note that the default lexicographic ordering of labels is different from their intended ordering
ordinal_classifier.pmml_classes_ = ["bad", "poor", "fair", "good", "excellent"]

sklearn2pmml(ordinal_classifier, "OrdinalClassifier.pmml")

Minor improvements and fixes

* Fixed the categorical encoding of missing values.

This bug manifested itself when the input column was mixing different data type values.
For example, a sparse string column, where non-missing values are strings, and missing values are floating-point `numpy.NaN` values.

Scikit-Learn documentation warns against mixing string and numeric values within a single column, but it can happen inadvertently when reading a sparse dataset into a Pandas' DataFrame using standard library functions (eg. the [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function).

* Added Pandas to package dependencies.

See [SkLearn2PMML-418](https://github.com/jpmml/sklearn2pmml/issues/418)

* Ensured compatibility with H2O.ai 3.46.0.1.

* Ensured compatibility with BorutaPy 0.3.post0 (92e4b4e).

0.105.0

Breaking changes

None.

New features

* Added `Domain.n_features_in_` and `Domain.feature_names_in_` attributes.

This brings domain decorators to conformance with "physical" Scikit-Learn input inspection standards such as [SLEP007](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html) and [SLEP010](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html).

Domain decorators are natively about "logical" input inspection (ie. establishing and enforcing model's applicability domain).

By combining these two complementary areas of functionality, they now make a great **first** step for any pipeline:

python
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
from sklearn2pmml.decoration import ContinuousDomain

iris_X, iris_y = load_iris(return_X_y = True, as_frame = True)

pipeline = Pipeline([
Collect column-oriented model's applicability domain
("domain", ContinuousDomain()),
("classifier", ...)
])
pipeline.fit(iris_X, iris_y)

Dynamic properties, delegate to (the attributes of-) the first step
print(pipeline.n_features_in_)
print(pipeline.feature_names_in_)

* Added `MultiDomain.n_features_in_` and `MultiDomain.feature_names_in_` attribute.

* Added support for missing values in tree and tree ensemble models.

Scikit-Learn 1.3 extended the `Tree` data structure with a `missing_go_to_left` field.
This field indicates the default split direction for each split, and is always present and populated whether the training dataset actually contained any missing values or not.

As a result, Scikit-Learn 1.3 tree models are able to accept and make predictions on sparse datasets, even if they were trained on a fully dense dataset.
There is currently no mechanism for a data scientist to tag tree models as "can or cannot be used with missing values".

The JPMML-SkLearn library implements two `Tree` data structure conversion modes, which can be toggled using the `allow_missing` conversion option.
The default mode corresponds to Scikit-Learn 0.18 through 1.2 behaviour, where a missing input causes the evaluation process to immediately bail out with a missing prediction.
The "missing allowed" mode corresponds to Scikit-Learn 1.3 and newer behaviour, where a missing input is ignored, and the evaluation proceeds to the pre-defined child branch until a final non-missing prediction is reached.

Right now, the data scientist must activate the latter mode manually, by configuring `allow_missing = True`:

python
from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
("classifier", DecisionTreeClassifier())
])
pipeline.fit(X, y)

Default mode
pipeline.configure(allow_missing = False)
sklearn2pmml(pipeline, "DecisionTree-default.pmml")

"Missing allowed" mode
pipeline.configure(allow_missing = True)
sklearn2pmml(pipeline, "DecisionTree-missing_allowed.pmml")

Both conversion modes generate standard PMML markup.
However, the "missing allowed" mode results in slightly bigger PMML documents (say, up to 10-15%), because the default split direction is encoded using extra `NodedefaultChild` and `Nodeid` attributes.
The size difference disappears when the tree model is compacted.

* Added support for nullable Pandas' scalar data types.

If the dataset contains sparse columns, then they should be cast from the default Numpy `object` data type to the most appropriate nullable Pandas' scalar data type. The cast may be performed using a data type object (eg. `pandas.BooleanDtype`, `pandas.Int64Dtype`, `pandas.Float32Dtype`) or its string alias (eg. `Boolean`, `Int64`, `Float32`).

This kind of "type hinting" is instrumental to generating high(er) quality PMML documents.

Minor improvements and fixes

* Added `ExpressionRegressor.normalization_method` attribute.

This attribute allows performing some most common normalizations atomically.

The list of supported values is `none` and `exp`.

* Refactored `ExpressionClassifier.normalization_method` attribute.

The list of supported values is `none`, `logit`, `simplemax` and `softmax`.

* Fixed the formatting of non-finite tree split values.

It is possible that some tree splits perform comparisons against the positive infinity to indicate "always true" and "always false" conditions (eg. `x <= +Inf` and `x > +Inf`, respectively).

Previously, infinite values were formatted using Java's default formatting method, which resulted in Java-style `-Infinity` and `Infinity` string literals.
They are now detected and replaced with PMML-style `-INF` and `INF` (case insensitive) string literals, respectively.

* Ensured compatibility with CHAID 5.4.1.

0.104.1

Breaking changes

* Removed `sklearn2pmml.ensemble.OrdinalClassifier` class.

The uses of this class should be replaced with the uses of the `sklego.meta.OrdinalClassifier` class (see below), which implements exactly the same algorithm, and offers extra functionality such as calibration and parallelized fitting.

New features

* Added support for `sklego.meta.OrdinalClassifier` class.

python
from pandas import CategoricalDtype, Series

A proper ordinal target
y_bin = Series(_bin(y), dtype = CategoricalDtype(categories = [...], ordered = True), name = "bin(y)")

classifier = OrdinalClassifier(LogisticRegression(), use_calibration = True, ...)
Map categories from objects to integer codes
classifier.fit(X, (y_bin.cat).codes.values)

Store the categories mapping:
the `OrdinalClassifier.classes_` attribute holds integer codes,
and the `OrdinalClassifier.pmml_classes_` holds the corresponding objects
classifier.pmml_classes_ = y_bin.dtype.categories

See [Scikit-Lego-607](https://github.com/koaning/scikit-lego/issues/607)

Minor improvements and fixes

* Removed the SkLearn-Pandas package from installation requirements.

The `sklearn_pandas.DataFrameMapper` meta-transformer is giving way to the `sklearn.compose.ColumnTransformer` meta-transformer in most common pipelines.

* Fixed the base-N encoding of missing values.

This bug manifested itself when missing values were assigned to a category by itself.

This bug was discovered when rebuilding integration tests with Category-Encoders 2.6(.3).
It is currently unclear if the base-N encoding algorithm had its behaviour changed between Category-Encoders 2.5 and 2.6 development lines.

In any case, when using SkLearn2PMML 0.104.1 or newer, it is advisable to upgrade to Category-Encoders 2.6.0 or newer.

* Ensured compatibility with Category-Encoders 2.6.3, Imbalanced-Learn 0.12.0, OptBinning 0.19.0 and Scikit-Lego 0.7.4.

0.104.0

Breaking changes

* Updated Scikit-Learn installation requirement from `0.18+` to `1.0+`.

This change helps the SkLearn2PMML package to better cope with breaking changes in Scikit-Learn APIs.
The underlying [JPMML-SkLearn](https://github.com/jpmml/jpmml-sklear) library retains the maximum version coverage, because it is dealing with Scikit-Learn serialized state (Pickle/Joblib or Dill), which is considerably more stable.

New features

* Added support for Scikit-Learn 1.4.X.

The JPMML-SkLearn library integration tests were rebuilt with Scikit-Learn `1.4.0` and `1.4.1.post1` versions.
All supported transformers and estimators passed cleanly.

See [SkLearn2PMML-409](https://github.com/jpmml/sklearn2pmml/issues/409) and [JPMML-SkLearn-195](https://github.com/jpmml/jpmml-sklearn/issues/195)

* Added support for `BaseHistGradientBoosting._preprocessor` attribute.

This attribute gets initialized automatically if a `HistGradientBoostingClassifier` or `HistGradientBoostingRegressor` estimator is inputted with categorical features.

In Scikit-Learn 1.0 through 1.3 it is necessary to pre-process categorical features manually.
The indices of (ordinally-) encoded columns must be tracked and passed to the estimator using the `categorical_features` parameter:

python
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OrdinalEncoder
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain

mapper = DataFrameMapper(
[([cont_col], ContinuousDomain()) for cont_col in cont_cols] +
[([cat_col], [CategoricalDomain(), OrdinalEncoder()]) for cat_col in cat_cols]
)

regressor = HistGradientBoostingRegressor(categorical_features = [...])

pipeline = Pipeline([
("mapper", mapper),
("regressor", regressor)
])
pipeline.fit(X, y)

In Scikit-Learn 1.4, this workflow simplifies to the following:

python
Activate full Pandas' support by specifying `input_df = True` and `df_out = True`
mapper = DataFrameMapper(
[([cont_col], ContinuousDomain()) for cont_col in cont_cols] +
[([cat_col], CategoricalDomain(dtype = "category")) for cat_col in cat_cols]
, input_df = True, df_out = True)

Auto-detect categorical features by their data type
regressor = HistGradientBoostingRegressor(categorical_features = "from_dtype")

pipeline = Pipeline([
("mapper", mapper),
("regressor", regressor)
])
pipeline.fit(X, y)

Print out feature type information
This list should contain one or more `True` values
print(pipeline._final_estimator.is_categorical_)

Minor improvements and fixes

* Improved support for `ColumnTransformer.transformers` attribute.

Column selection using dense boolean arrays.

0.103.3

Breaking changes

* Refactored the `PMMLPipeline.customize(customizations: [str])` method into `PMMLPipeline.customize(command: str, xpath_expr: str, pmml_element: str)`.

This method may be invoked any number of times.
Each invocation appends a `sklearn2pmml.customization.Customization` object to the `pmml_customizations_` attribute of the final estimator step.

The `command` argument is one of SQL-inspired keywords `insert`, `update` or `delete` (to insert a new element, or to update or delete an existing element, respectively).
The `xpath_expr` is an XML Path (XPath) expression for pinpointing the action site. The XPath expression is evaluated relative to the main model element.
The `pmml_element` is a PMML fragment string.

For example, suppressing the secondary results by deleting the `Output` element:

python
pipeline = PMMLPipeline([
("classifier", ...)
])
pipeline.fit(X, y)
pipeline.customize(command = "delete", xpath_expr = "//:Output")

New features

* Added `sklearn2pmml.metrics` module.

This module provides high-level `BinaryClassifierQuality`, `ClassifierQuality` and `RegressorQuality` pmml classes for the automated generation of [`PredictiveModelQuality`](https://dmg.org/pmml/v4-4-1/ModelExplanation.html#xsdElement_PredictiveModelQuality) elements for most common estimator types.

Refactoring the v0.103.0 code example:

python
from sklearn2pmml.metrics import ModelExplanation, RegressorQuality

pipeline = PMMLPipeline([
("regressor", ...)
])
pipeline.fit(X, y)

model_explanation = ModelExplanation()
predictive_model_quality = RegressorQuality(pipeline, X, y, target_field = y.name) \
.with_all_metrics()
model_explanation.append(predictive_model_quality)

pipeline.customize(command = "insert", pmml_element = model_explanation.tostring())

* Added `sklearn2pmml.util.pmml` module.

Minor improvements and fixes

* Added `EstimatorProxy.classes_` propery.

* Extracted `sklearn2pmml.configuration` and `sklearn2pmml.customization` modules.

Page 4 of 5

Releases

Has known vulnerabilities

Previous Next

Sklearn2pmml

Page 4 of 5

0.105.2

0.105.1

0.105.0

0.104.1

0.104.0

0.103.3

Page 4 of 5

Links

Releases