Rulekit

Latest version: v2.1.24.0

Safety actively analyzes 685670 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 8

1.7.1

1.7

1. Manually initializing RuleKit is not longer necessary.

Prior to this version, RuleKit had to be manually initialised using the `rulekit.RuleKit.init` method.

python
from rulekit import RuleKit
from rulekit.classification import RuleClassifier

RuleKit.init()

clf = RuleClassifier()
clf.fit(X, y)

Now it is no longer necessary, and you can simply use any of the RuleKit operators directly.

python
from rulekit.classification import RuleClassifier

clf = RuleClassifier()
clf.fit(X, y)

2. Introducing negated conditions for nominal attributes in rules.

Using the new `complementary_conditions` parameter, the induction of negated conditions for nominal attributes can be enabled. Such conditions are of the form **attribute = !{value}**. This parameter has been added to all operator classes.

python
import pandas as pd
from rulekit.classification import RuleClassifier

df = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/mushrooms.csv')
X = df.drop('type', axis=1)
y = df['type']

clf = RuleClassifier(complementary_conditions=True)
clf.fit(X, y)

for rule in clf.model.rules:
print(rule)

> IF stalk_surface_below_ring = !{y} AND spore_print_color = !{u} AND odor = !{n} AND stalk_root = !{c} THEN type = {p}
IF bruises = {f} AND odor = !{n} THEN type = {p}
IF stalk_surface_above_ring = {k} AND gill_spacing = {c} THEN type = {p}
IF bruises = {f} AND stalk_surface_above_ring = !{f} AND stalk_surface_below_ring = !{f} AND ring_number = !{t} AND stalk_root = !{e} AND gill_attachment = {f} THEN type = {p}
IF stalk_surface_below_ring = !{f} AND stalk_color_below_ring = !{n} AND spore_print_color = !{u} AND odor = !{a} AND gill_size = {n} AND cap_surface = !{f} THEN type = {p}
IF cap_shape = !{s} AND cap_color = !{c} AND habitat = !{w} AND stalk_color_below_ring = !{g} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{n} AND gill_spacing = {c} AND gill_color = !{u} AND stalk_root = !{c} AND stalk_color_above_ring = !{g} AND ring_type = !{f} AND veil_color = {w} THEN type = {p}
IF cap_shape = !{c} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF cap_color = !{y} AND cap_shape = !{c} AND stalk_color_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF spore_print_color = !{r} AND odor = !{f} AND stalk_color_above_ring = !{c} AND gill_size = {b} THEN type = {e}
IF cap_color = !{p} AND cap_shape = !{c} AND habitat = !{u} AND stalk_color_below_ring = !{y} AND gill_color = !{b} AND spore_print_color = !{r} AND ring_number = !{n} AND odor = !{f} AND cap_surface = !{g} THEN type = {e}

3. Approximate induction for classification rulesets.

To reduce the training time on the classification datasets, the so-called *approximate induction* can now be used. This will force the algorithm not to check all possible numerical conditions during the rule induction phase. You can configure the number of bins you want to use as possible splits to limit the calculation.

To enable *aproximate induction* use the `approximate_induction` parameter. To configure the maximum number of bins, use the `approximate_bins_count` parameter. At the moment, *aproximate induction* is only available for classification rule sets.

The following example shows how using this function can reduce training time without sacrificing predictive accuracy.

python
import pandas as pd
from rulekit.classification import RuleClassifier
from sklearn.metrics import balanced_accuracy_score

df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')
df_test = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/test.parquet')

X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
X_test = df_test.drop('class', axis=1)
y_test = df_test['class']

clf1 = RuleClassifier()
clf2 = RuleClassifier(approximate_induction=True, approximate_bins_count=20)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)

pd.DataFrame([
{
'Variant': 'Without approximate induction',
'Training time [s]': clf1.model.total_time,
'BAcc on test dataset': balanced_accuracy_score(y_test, clf1.predict(X_test)),
},
{
'Variant': 'Without approximate induction',
'Training time [s]': clf2.model.total_time,
'BAcc on test dataset': balanced_accuracy_score(y_test, clf2.predict(X_test)),
}
])

| Variant | Training time [s] | BAcc on test dataset |
|-------------------------------|-------------------|----------------------|
| Without approximate induction | 5.730046 | 0.688744 |
| With approximate induction | 0.142259 | 0.703959 |


4. Observing and stopping the training process

You can now watch the progress of the training process and stop it at a certain point. To do this, all you need to do is create a class extending the `events.RuleInductionProgressListener` class. Such a class should implement one of the following methods:
* `on_new_rule(self, rule)`: Method called when the new rule was induced.
* `on_progress(self, total_examples_count: int, uncovered_examples_count: int)`: Method that observes the training progress, how many examples have been covered relative to the total number of training examples. The division **uncovered_examples_count / total_examples_count** can be taken as some approximation of training progress. Keep in mind however that not in all scenarios the ruleset will cover all of the training examples. The increase in progress probably will not be linear.
* `should_stop`: Method to stop the training process at a given point. If it returns **True**, the training will be stopped. You can then proceed to use the not fully trained model.

Then you need to register your listener to the operator instance using `add_event_listener` method. All operators support this method.

An example of the use of this mechanism is shown below.

python
import pandas as pd
from rulekit.events import RuleInductionProgressListener
from rulekit.rules import Rule
from rulekit.classification import RuleClassifier

df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')

X_train = df_train.drop('class', axis=1)
y_train = df_train['class']


class MyProgressListener(RuleInductionProgressListener):
_uncovered_examples_count: int = None
_should_stop = False

def on_new_rule(self, rule: Rule):
pass

def on_progress(
self,
total_examples_count: int,
uncovered_examples_count: int
):
if uncovered_examples_count < total_examples_count * 0.1:
self._should_stop = True

def should_stop(self) -> bool:
return self._should_stop

clf = RuleClassifier()
clf.add_event_listener(MyProgressListener())
clf.fit(X_train, y_train)

for rule in clf.model.rules:
print(rule)

> IF number_customer_service_calls = (-inf, 3.50) AND account_length = (-inf, 224.50) AND total_day_minutes = (-inf, 239.95) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 254.05) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 255) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 263.25) AND international_plan = {no} THEN class = {no}
IF total_intl_calls = (-inf, 19.50) AND number_customer_service_calls = (-inf, 3.50) AND total_eve_minutes = (-inf, 346.20) AND total_intl_minutes = (-inf, 19.85) AND total_day_calls = (-inf, 154) AND total_day_minutes = (-inf, 263.25) THEN class = {no}
IF number_customer_service_calls = (-inf, 4.50) AND total_day_minutes = <1.30, 254.05) AND international_plan = {no} THEN class = {no}
IF voice_mail_plan = {no} AND total_eve_minutes = <175.35, inf) AND total_day_calls = (-inf, 149) AND total_day_minutes = <263.25, inf) AND total_night_minutes = <115.85, inf) THEN class = {yes}

To simplify the way learning is stopped after a certain number of rules have been learned, the parameter `max_rule_count` has been added to the operators.
> This parameter in classification denotes the maximum number of rules for each possible class in the training dataset.

5. Faster regression rules induction

Prior to version 1.7.0, regression rule induction using this package was inherently slow due to the calculation of median values. It is now possible to use the new `mean_based_regression` parameter to enable faster regression using mean values instead of calculating median values. See the example below.

python
import pandas as pd
from rulekit.regression import RuleRegressor

df_train = pd.read_csv('./housing.csv')

X_train = df_train.drop('class', axis=1)
y_train = df_train['class']

reg = RuleRegressor(mean_based_regression=True)
reg.fit(X_train, y_train)


5. Contrast set mining

THe package now includes an algorithm for contrast set (CS) identification [Gudyś et al, 2022](https://doi.org/10.48550/arXiv.2204.00497).

Following operator were introduced:

* `classification.ContrastSetRuleClassifier`
* `regression.ContrastSetRuleRegressor`
* `survival.ContrastSetSurvivalRules`

Other changes

* New parameter `control_apriori_precision` added to the operators `classification.RuleClassifier` and `classification.ExpertRuleClassifier`. If enabled, checks if the candidate precision is higher than the apriori precision of the class under test during classification rule induction. Enabled by default.

* Improved parameters types validation.

* Fixed Issue [10](https://github.com/adaa-polsl/RuleKit-python/issues/10) - *predict_proba method does not work for ExpertRuleClassifier*

* Update dependencies versions.

Deprecations

* Renamed the `min_rule_covered` parameter to `minsupp_new`. The old name is now deprecated, but will be treated as an alias to the next major version, resulting in a warning only.

1.7.0

* Classification:
* Significantly faster growing (two orders of magnitude for sets with >100k instances), faster pruning,
* Added approximate mode (`approximate_induction` parameter). Note: this is an experimental feature - the results may change in future releases.
* Regression:
* Mean-based growing set as default (few times faster then median, non-significant impact on accuracy).
* Survival:
* Faster growing and pruning (few fold improvement).

1.6.0

Added new hyperparameters: `select_best_candidate` and `max_uncovered_fraction` for models.

1.5.7

1.5.2

Changes from the previous release:
* Added fast induction of regression rules based on mean instead of median (`mean_based_regression`),
* Added boolean parameter for disabling apriori precision control (`control_apriori_precision`).

Page 6 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.