Catboost

Latest version: v1.2.7

Safety actively analyzes 688365 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 5 of 15

0.24

Not secure
New functionality
* We've finally implemented MVS sampling for GPU training. Switched default bootstrap algorithm to MVS for RMSE loss function while training on GPU
* Implemented near-zero cost model deserialization from memory blob. Currently, if your model doesn't use categorical features CTR counters and text features you can deserialize model from, for example, memory-mapped file.
* Added ability to load trained models from binary string or file-like stream. To load model from bytes string use `load_model(blob=b'....')`, to deserialize form file-like stream use `load_model(stream=gzip.open('model.cbm.gz', 'rb'))`
* Fixed auto-learning rate estimation params for GPU
* Supported beta parameter for QuerySoftMax function on CPU and GPU

New losses and metrics
* New loss function RMSEWithUncertainty - it allows to estimate data uncertainty for trained regression models. The trained model will give you a two-element vector for each object with the first element as regression model prediction and the second element as an estimation of data uncertainty for that prediction.

Speedups
* Major speedups for CPU training: kdd98 -9%, higgs -18%, msrank -28%. We would like to recognize Intel software engineering team’s contributions to Catboost project. This was mutually beneficial activity, and we look forward to continuing joint cooperation.

Bugfixes:
* Fixed CatBoost model export as Python code
* Fixed AUC metric creation
* Add text features to `model.feature_names_`. Issue1314
* Allow models, trained on datasets with NaN values (Min treatment) and without NaNs in `model_sum()` or as the base model in `init_model=`. Issue 1271

Educational materials
* Published new [tutorial](https://github.com/catboost/catboost/blob/master/catboost/tutorials/categorical_features/categorical_features_parameters.ipynb) on categorical features parameters. Thanks garkavem

0.23.2

Not secure
New functionality
* Added `plot_partial_dependence` method in python-package (Now it works for models with symmetric trees trained on dataset with numerical features only). Implemented by felixandrer.
* Allowed using `boost_from_average` option together with `model_shrink_rate` option. In this case shrinkage is applied to the starting value..
* Added new `auto_class_weights` option in python-package, R-package and cli with possible values `Balanced` and `SqrtBalanced`. For `Balanced` every class is weighted `maxSumWeightInClass / sumWeightInClass`, where sumWeightInClass is sum of weights of all samples in this class. If no weights are present then sample weight is 1. And maxSumWeightInClass - is maximum sum weight among all classes. For `SqrtBalanced` the formula is `sqrt(maxSumWeightInClass / sumWeightInClass)`. This option supported in binclass and multiclass tasks. Implemented by egiby.
* Supported `model_size_reg` option on GPU. Set to 0.5 by default (same as in CPU). This regularization works slightly differently on GPU: feature combinations are regularized more aggressively than on CPU. For CPU cost of a combination is equal to number of different feature values in this combinations that are present in training dataset. On GPU cost of a combination is equal to number of all possible different values of this combination. For example, if combination contains two categorical features c1 and c2, then the cost will be categories in c1 * categories in c2, even though many of the values from this combination might not be present in the dataset.
* Added calculation of Shapley values, (see formula (2) from https://arxiv.org/pdf/1802.03888.pdf). By default estimation from this paper (Algorithm 2) is calcucated, that is much more faster. To use this mode specify shap_calc_type parameter of CatBoost.get_feature_importance function as "Exact". Implemented by LordProtoss.

Bugfixes:
* Fixed onnx converter for old onnx versions.

0.23.1

Not secure
New functionality
* CatBoost model could be simply converted into ONNX object in Python with `catboost.utils.convert_to_onnx_object` method. Implemented by monkey0head
* We now print metric options with metric names as metric description in error logs by default. This allows you to distinguish between metrics of the same type with different parameters. For example, if user sets weigheted average `TotalF1` metric CatBoost will print `TotalF1:average=Weighted` as corresponding metric column header in error logs. Implemented by ivanychev
* Implemented PRAUC metric (issue 737). Thanks azikmsu
* It's now possible to write custom multiregression objective in Python. Thanks azikmsu
* Supported nonsymmetric models export to PMML
* `class_weights` parameter accepts dictionary with class name to class weight mapping
* Added `_get_tags()` method for compatibility with sklearn (issue 1282). Implemented by crazyleg
* Lot's of improvements in .Net CatBoost library: implemented IDisposable interface, splitted ML.NET compatible and basic prediction classes in separate libraries, added base UNIX compatibility, supported GPU model evaluation, fixed tests. Thanks khanova
* In addition to first_feature_use_penalties presented in the previous release, we added new option per_object_feature_penalties which considers feature usage on each object individually. For more details refer the [tutorial](https://github.com/catboost/catboost/blob/master/catboost/tutorials/feature_penalties/feature_penalties.ipynb).

Breaking changes
* From now on we require explicit `loss_function` param in python `cv` method.

Bugfixes:
* Fixed deprecation warning on import (issue 1269)
* Fixed saved models logging_level/verbose parameters conflict (issue 696)
* Fixed kappa metric - in some cases there were integer overflow, switched accumulation types to double
* Fixed per float feature quantization settings defaults

Educational materials
* Extended shap values [tutorial](https://github.com/catboost/tutorials/blob/master/model_analysis/shap_values_tutorial.ipynb) with summary plot examples. Thanks azanovivan02

0.23

Not secure
New functionality

* It is possible now to train models on huge datasets that do not fit into CPU RAM.
This can be accomplished by storing only quantized data in memory (it is many times smaller). Use `catboost.utils.quantize` function to create quantized `Pool ` this way. See usage example in the issue 1116.
Implemented by noxwell.
* Python Pool class now has `save_quantization_borders` method that allows to save resulting borders into a [file](https://catboost.ai/docs/concepts/output-data_custom-borders.html) and use it for quantization of other datasets. Quantization can be a bottleneck of training, especially on GPU. Doing quantization once for several trainings can significantly reduce running time. It is recommended for large dataset to perform quantization first, save quantization borders, use them to quantize validation dataset, and then use quantized training and validation datasets for further training.
Use saved borders when quantizing other Pools by specifying `input_borders` parameter of the `quantize` method.
Implemented by noxwell.
* Text features are supported on CPU
* It is now possible to set `border_count` > 255 for GPU training. This might be useful if you have a "golden feature", see [docs](https://catboost.ai/docs/concepts/parameter-tuning.html#golden-features).
* Feature weights are implemented.
Specify weights for specific features by index or name like `feature_weights="FeatureName1:1.5,FeatureName2:0.5"`.
Scores for splits with this features will be multiplied by corresponding weights.
Implemented by Taube03.
* Feature penalties can be used for cost efficient gradient boosting.
Penalties are specified in a similar fashion to feature weights, using parameter `first_feature_use_penalties`.
This parameter penalized the first usage of a feature. This should be used in case if the calculation of the feature is costly.
The penalty value (or the cost of using a feature) is subtracted from scores of the splits of this feature if feature has not been used in the model.
After the feature has been used once, it is considered free to proceed using this feature, so no substruction is done.
There is also a common multiplier for all `first_feature_use_penalties`, it can be specified by `penalties_coefficient` parameter.
Implemented by Taube03 (issue 1155)
* `recordCount` attribute is added to PMML models (issue 1026).

New losses and metrics

* New ranking objective 'StochasticRank', details in [paper](https://arxiv.org/abs/2003.02122).
* `Tweedie` loss is supported now. It can be a good solution for right-skewed target with many zero values, see [tutorial](https://github.com/catboost/tutorials/blob/master/regression/tweedie.ipynb).
When using `CatBoostRegressor.predict` function, default `prediction_type` for this loss will be equal to `Exponent`. Implemented by ilya-pchelintsev (issue 577)
* Classification metrics now support a new parameter `proba_border`. With this parameter you can set decision boundary for treating prediction as negative or positive. Implemented by ivanychev.
* Metric `TotalF1` supports a new parameter `average` with possible value `weighted`, `micro`, `macro`. Implemented by ilya-pchelintsev.
* It is possible now to specify a custom multi-label metric in python. Note that it is only possible to calculate this metric and use it as `eval_metric`. It is not possible to used it as an optimization objective.
To write a multi-label metric, you need to define a python class which inherits from `MultiLabelCustomMetric` class. Implemented by azikmsu.

Improvements of grid and randomized search

* `class_weights` parameter is now supported in grid/randomized search. Implemented by vazgenk.
* Invalid option configurations are automatically skipped during grid/randomized search. Implemented by borzunov.
* `get_best_score` returns train/validation best score after grid/randomized search (in case of refit=False). Implemented by rednevaler.

Improvements of model analysis tools

* Computation of SHAP interaction values for CatBoost models. You can pass type=EFstrType.ShapInteractionValues to `CatBoost.get_feature_importance` to get a matrix of SHAP values for every prediction.
By default, SHAP interaction values are calculated for all features. You may specify features of interest using the `interaction_indices` argument.
Implemented by IvanKozlov98.
* SHAP values can be calculated approximately now which is much faster than default mode. To use this mode specify `shap_calc_type` parameter of `CatBoost.get_feature_importance` function as `"Approximate"`. Implemented by LordProtoss (issue 1146).
* `PredictionDiff` model analysis method can now be used with models that contain non symmetric trees. Implemented by felixandrer.

New educational materials

* A [tutorial](https://github.com/catboost/tutorials/blob/master/regression/tweedie.ipynb) on tweedie regression
* A [tutorial](https://github.com/catboost/tutorials/blob/master/regression/poisson.ipynb) on poisson regression
* A detailed [tutorial](https://github.com/catboost/tutorials/blob/master/metrics/AUC_tutorial.ipynb) on different types of AUC metric, which explains how different types of AUC can be used for binary classification, multiclassification and ranking tasks.

Breaking changes

* When using `CatBoostRegressor.predict` function for models trained with `Poisson` loss, default `prediction_type` will be equal to `Exponent` (issue 1184). Implemented by garkavem.

This release also contains bug fixes and performance improvements, including a major speedup for sparse data on GPU.

0.22

Not secure
New features:
- The main feature of the release is the support of non symmetric trees for training on CPU.
Using non symmetric trees might be useful if one-hot encoding is present, or data has little noise.
To try non symmetric trees change [``grow_policy`` parameter](https://catboost.ai/docs/concepts/parameter-tuning.html#tree-growing-policy).
Starting from this release non symmetric trees are supported for both CPU and GPU training.
- The next big feature improves catboost text features support.
Now tokenization is done during training, you don't have to do lowercasing, digit extraction and other tokenization on your own, catboost does it for you.
- Auto learning-rate is now supported in CPU MultiClass mode.
- CatBoost class supports ``to_regressor`` and ``to_classifier`` methods.

The release also contains a list of bug fixes.

0.21

Not secure
New features:
- The main feature of this release is the Stochastic Gradient Langevin Boosting (SGLB) mode that can improve quality of your models with non-convex loss functions. To use it specify ``langevin`` option and tune ``diffusion_temperature`` and ``model_shrink_rate``. See [the corresponding paper](https://arxiv.org/abs/2001.07248) for details.

Improvements:

- Automatic learning rate is applied by default not only for ``Logloss`` objective, but also for ``RMSE`` (on CPU and GPU) and ``MultiClass`` (on GPU).
- Class labels type information is stored in the model. Now estimators in python package return values of proper type in ``classes_`` attribute and for prediction functions with ``prediction_type=Class``. 305, 999, 1017.
Note: Class labels loaded from datasets in [CatBoost dsv format](https://catboost.ai/docs/concepts/input-data_values-file.html) always have string type now.

Bug fixes:
- Fixed huge memory consumption for text features. 1107
- Fixed crash on GPU on big datasets with groups (hundred million+ groups).
- Fixed class labels consistency check and merging in model sums (now class names in binary classification are properly checked and added to the result as well)
- Fix for confusion matrix (PR 1152), thanks to dmsivkov.
- Fixed shap values calculation when ``boost_from_average=True``. 1125
- Fixed use-after-free in fstr PredictionValuesChange with specified dataset
- Target border and class weights are now taken from model when necessary for feature strength, metrics evaluation, roc_curve, object importances and calc_feature_statistics calculations.
- Fixed that L2 regularization was not applied for non symmetric trees for binary classification on GPU.
- [R-package] Fixed the bug that ``catboost.get_feature_importance`` did not work after model is loaded 1064
- [R-package] Fixed the bug that ``catboost.train`` did not work when called with the single dataset parameter. 1162
- Fixed L2 score calculation on CPU

Other:

- Starting from this release Java applier is released simultaneously with other components and has the same version.

Compatibility:

- Models trained with this release require applier from this release or later to work correctly.

Page 5 of 15

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.