1. Manually initializing RuleKit is not longer necessary.
Prior to this version, RuleKit had to be manually initialised using the `rulekit.RuleKit.init` method.
python
from rulekit import RuleKit
from rulekit.classification import RuleClassifier
RuleKit.init()
clf = RuleClassifier()
clf.fit(X, y)
Now it is no longer necessary, and you can simply use any of the RuleKit operators directly.
python
from rulekit.classification import RuleClassifier
clf = RuleClassifier()
clf.fit(X, y)
2. Introducing negated conditions for nominal attributes in rules.
Using the new `complementary_conditions` parameter, the induction of negated conditions for nominal attributes can be enabled. Such conditions are of the form **attribute = !{value}**. This parameter has been added to all operator classes.
python
import pandas as pd
from rulekit.classification import RuleClassifier
df = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/mushrooms.csv')
X = df.drop('type', axis=1)
y = df['type']
clf = RuleClassifier(complementary_conditions=True)
clf.fit(X, y)
for rule in clf.model.rules:
print(rule)
> IF stalk_surface_below_ring = !{y} AND spore_print_color = !{u} AND odor = !{n} AND stalk_root = !{c} THEN type = {p}
IF bruises = {f} AND odor = !{n} THEN type = {p}
IF stalk_surface_above_ring = {k} AND gill_spacing = {c} THEN type = {p}
IF bruises = {f} AND stalk_surface_above_ring = !{f} AND stalk_surface_below_ring = !{f} AND ring_number = !{t} AND stalk_root = !{e} AND gill_attachment = {f} THEN type = {p}
IF stalk_surface_below_ring = !{f} AND stalk_color_below_ring = !{n} AND spore_print_color = !{u} AND odor = !{a} AND gill_size = {n} AND cap_surface = !{f} THEN type = {p}
IF cap_shape = !{s} AND cap_color = !{c} AND habitat = !{w} AND stalk_color_below_ring = !{g} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{n} AND gill_spacing = {c} AND gill_color = !{u} AND stalk_root = !{c} AND stalk_color_above_ring = !{g} AND ring_type = !{f} AND veil_color = {w} THEN type = {p}
IF cap_shape = !{c} AND stalk_surface_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF cap_color = !{y} AND cap_shape = !{c} AND stalk_color_below_ring = !{y} AND spore_print_color = !{r} AND odor = {n} AND cap_surface = !{g} THEN type = {e}
IF spore_print_color = !{r} AND odor = !{f} AND stalk_color_above_ring = !{c} AND gill_size = {b} THEN type = {e}
IF cap_color = !{p} AND cap_shape = !{c} AND habitat = !{u} AND stalk_color_below_ring = !{y} AND gill_color = !{b} AND spore_print_color = !{r} AND ring_number = !{n} AND odor = !{f} AND cap_surface = !{g} THEN type = {e}
3. Approximate induction for classification rulesets.
To reduce the training time on the classification datasets, the so-called *approximate induction* can now be used. This will force the algorithm not to check all possible numerical conditions during the rule induction phase. You can configure the number of bins you want to use as possible splits to limit the calculation.
To enable *aproximate induction* use the `approximate_induction` parameter. To configure the maximum number of bins, use the `approximate_bins_count` parameter. At the moment, *aproximate induction* is only available for classification rule sets.
The following example shows how using this function can reduce training time without sacrificing predictive accuracy.
python
import pandas as pd
from rulekit.classification import RuleClassifier
from sklearn.metrics import balanced_accuracy_score
df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')
df_test = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/test.parquet')
X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
X_test = df_test.drop('class', axis=1)
y_test = df_test['class']
clf1 = RuleClassifier()
clf2 = RuleClassifier(approximate_induction=True, approximate_bins_count=20)
clf1.fit(X_train, y_train)
clf2.fit(X_train, y_train)
pd.DataFrame([
{
'Variant': 'Without approximate induction',
'Training time [s]': clf1.model.total_time,
'BAcc on test dataset': balanced_accuracy_score(y_test, clf1.predict(X_test)),
},
{
'Variant': 'Without approximate induction',
'Training time [s]': clf2.model.total_time,
'BAcc on test dataset': balanced_accuracy_score(y_test, clf2.predict(X_test)),
}
])
| Variant | Training time [s] | BAcc on test dataset |
|-------------------------------|-------------------|----------------------|
| Without approximate induction | 5.730046 | 0.688744 |
| With approximate induction | 0.142259 | 0.703959 |
4. Observing and stopping the training process
You can now watch the progress of the training process and stop it at a certain point. To do this, all you need to do is create a class extending the `events.RuleInductionProgressListener` class. Such a class should implement one of the following methods:
* `on_new_rule(self, rule)`: Method called when the new rule was induced.
* `on_progress(self, total_examples_count: int, uncovered_examples_count: int)`: Method that observes the training progress, how many examples have been covered relative to the total number of training examples. The division **uncovered_examples_count / total_examples_count** can be taken as some approximation of training progress. Keep in mind however that not in all scenarios the ruleset will cover all of the training examples. The increase in progress probably will not be linear.
* `should_stop`: Method to stop the training process at a given point. If it returns **True**, the training will be stopped. You can then proceed to use the not fully trained model.
Then you need to register your listener to the operator instance using `add_event_listener` method. All operators support this method.
An example of the use of this mechanism is shown below.
python
import pandas as pd
from rulekit.events import RuleInductionProgressListener
from rulekit.rules import Rule
from rulekit.classification import RuleClassifier
df_train = pd.read_parquet('https://github.com/cezary986/classification_tabular_datasets/raw/main/churn/train_test/train.parquet')
X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
class MyProgressListener(RuleInductionProgressListener):
_uncovered_examples_count: int = None
_should_stop = False
def on_new_rule(self, rule: Rule):
pass
def on_progress(
self,
total_examples_count: int,
uncovered_examples_count: int
):
if uncovered_examples_count < total_examples_count * 0.1:
self._should_stop = True
def should_stop(self) -> bool:
return self._should_stop
clf = RuleClassifier()
clf.add_event_listener(MyProgressListener())
clf.fit(X_train, y_train)
for rule in clf.model.rules:
print(rule)
> IF number_customer_service_calls = (-inf, 3.50) AND account_length = (-inf, 224.50) AND total_day_minutes = (-inf, 239.95) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 254.05) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 255) AND international_plan = {no} THEN class = {no}
IF number_customer_service_calls = (-inf, 3.50) AND total_day_minutes = (-inf, 263.25) AND international_plan = {no} THEN class = {no}
IF total_intl_calls = (-inf, 19.50) AND number_customer_service_calls = (-inf, 3.50) AND total_eve_minutes = (-inf, 346.20) AND total_intl_minutes = (-inf, 19.85) AND total_day_calls = (-inf, 154) AND total_day_minutes = (-inf, 263.25) THEN class = {no}
IF number_customer_service_calls = (-inf, 4.50) AND total_day_minutes = <1.30, 254.05) AND international_plan = {no} THEN class = {no}
IF voice_mail_plan = {no} AND total_eve_minutes = <175.35, inf) AND total_day_calls = (-inf, 149) AND total_day_minutes = <263.25, inf) AND total_night_minutes = <115.85, inf) THEN class = {yes}
To simplify the way learning is stopped after a certain number of rules have been learned, the parameter `max_rule_count` has been added to the operators.
> This parameter in classification denotes the maximum number of rules for each possible class in the training dataset.
5. Faster regression rules induction
Prior to version 1.7.0, regression rule induction using this package was inherently slow due to the calculation of median values. It is now possible to use the new `mean_based_regression` parameter to enable faster regression using mean values instead of calculating median values. See the example below.
python
import pandas as pd
from rulekit.regression import RuleRegressor
df_train = pd.read_csv('./housing.csv')
X_train = df_train.drop('class', axis=1)
y_train = df_train['class']
reg = RuleRegressor(mean_based_regression=True)
reg.fit(X_train, y_train)
5. Contrast set mining
THe package now includes an algorithm for contrast set (CS) identification [Gudyś et al, 2022](https://doi.org/10.48550/arXiv.2204.00497).
Following operator were introduced:
* `classification.ContrastSetRuleClassifier`
* `regression.ContrastSetRuleRegressor`
* `survival.ContrastSetSurvivalRules`
Other changes
* New parameter `control_apriori_precision` added to the operators `classification.RuleClassifier` and `classification.ExpertRuleClassifier`. If enabled, checks if the candidate precision is higher than the apriori precision of the class under test during classification rule induction. Enabled by default.
* Improved parameters types validation.
* Fixed Issue [10](https://github.com/adaa-polsl/RuleKit-python/issues/10) - *predict_proba method does not work for ExpertRuleClassifier*
* Update dependencies versions.
Deprecations
* Renamed the `min_rule_covered` parameter to `minsupp_new`. The old name is now deprecated, but will be treated as an alias to the next major version, resulting in a warning only.