Breaking changes
R package
- In R package we have changed parameter name `target` to `label` in method [`save_pool()`](https://tech.yandex.com/catboost/doc/dg/concepts/r-reference_catboost-save_pool-docpage/)
Python package
- We don't support Python 3.4 anymore
- CatBoostClassifier and CatBoostRegressor [`get_params()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_get_params-docpage/) method now returns only the params that were explicitly set when constructing the object. That means that CatBoostClassifier and CatBoostRegressor get_params() will not contain 'loss_function' if it was not specified.
This also means that this code:
(python)
model1 = CatBoostClassifier()
params = model1.get_params()
model2 = CatBoost(params)
will create model2 with default loss_function RMSE, not with Logloss.
This breaking change is done to support sklearn interface, so that sklearn GridSearchCV can work.
- We've removed several attributes and changed them to functions. This was needed to avoid sklearn warnings:
`is_fitted_` => [`is_fitted()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_is_fitted-docpage/)
`metadata_` => [`get_metadata()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_catboostclassifier_metadata-docpage/)
- We removed file with model from constructor of estimator. This was also done to avoid sklearn warnings.
Educational materials
- We added [tutorial](https://github.com/catboost/tutorials/blob/master/ranking/ranking_tutorial.ipynb) for our ranking modes.
- We published our [slides](https://github.com/catboost/catboost/tree/master/slides), you are very welcome to use them.
Improvements
All
- Now it is possible to save model in json format.
- We have added Java interface for CatBoost model
- We now have static linkage with CUDA, so you don't have to install any particular version of CUDA to get catboost working on GPU.
- We implemented both multiclass modes on GPU, it is very fast.
- It is possible now to use multiclass with string labels, they will be inferred from data
- Added `use_weights` parameter to [metrics](https://tech.yandex.com/catboost/doc/dg/concepts/loss-functions-docpage/). By default all metrics, except for AUC use weights, but you can disable it. To calculate metric value without weights, you need to set this parameter to false. Example: Accuracy:use_weights=false. This can be done only for custom_metrics or eval_metric, not for the objective function. Objective function always uses weights if they are present in the dataset.
- We now use snapshot time intervals. It will work much faster if you save snapshot every 5 or 10 minutes instead of saving it on every iteration.
- Reduced memory consumption by ranking modes.
- Added automatic feature importance evaluation after completion of GPU training.
- Allow inexistent indexes in ignored features list
- Added [new metrics](https://tech.yandex.com/catboost/doc/dg/concepts/loss-functions-docpage/): `LogLikelihoodOfPrediction`, `RecallAt:top=k`, `PrecisionAt:top=k` and `MAP:top=k`.
- Improved quality for multiclass with weighted datasets.
- Pairwise modes now support automatic pairs generation (see [tutorial](https://github.com/catboost/tutorials/blob/master/ranking/ranking_tutorial.ipynb) for that).
- Metric `QueryAverage` is renamed to a more clear `AverageGain`. This is a very important ranking metric. It shows average target value in top k documents of a group.
Introduced parameter `best_model_min_trees` - the minimal number of trees the best model should have.
Python
- We now support sklearn GridSearchCV: you can pass categorical feature indices when constructing estimator. And then use it in GridSearchCV.
- We added new method to utils - building of ROC curve: [`get_roc_curve`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_utils_get_roc_curve-docpage/).
- Added [`get_gpu_device_count()`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_utils_get_gpu_device_count-docpage/) method to python package. This is a way to check if your CUDA devices are available.
- We implemented automatical selection of decision-boundary using ROC curve. You can select best classification boundary given the maximum FPR or FNR that you allow to the model. Take a look on [`catboost.select_threshold(self, data=None, curve=None, FPR=None, FNR=None, thread_count=-1)`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_utils_select_threshold-docpage/). You can also calculate FPR and FNR for each boundary value.
- We have added pool slicing: [`pool.slice(doc_indices)`](https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool_slice-docpage/)
- Allow GroupId and SubgroupId specified as strings.
R package
- GPU support in R package. You need to use parameter `task_type='GPU'` to enable GPU training.
- Models in R can be saved/restored by means of R: save/load or saveRDS/readRDS
Speedups
- New way of loading data in Python using [FeaturesData structure](https://tech.yandex.com/catboost/doc/dg/concepts/python-features-data__desc-docpage/). Using FeaturesData will speed up both loading data for training and for prediction. It is especially important for prediction, because it gives around 10 to 20 times python prediction speedup.
- Training multiclass on CPU ~ 60% speedup
- Training of ranking modes on CPU ~ 50% speedup
- Training of ranking modes on GPU ~ 50% speedup for datasets with many features and not very many objects
- Speedups of metric calculation on GPU. Example of speedup on our internal dataset: training with - AUC eval metric with test dataset with 2kk objects is speeded up 7sec => 0.2 seconds per iteration.
- Speedup of all modes on CPU training.
We also did a lot of stability improvements, and improved usability of the library, added new parameter synonyms and improved input data validations.
Thanks a lot to all people who created issues on github. And thanks a lot to our contributor pukhlyakova who implemented many new useful metrics!