RuleXAI¶

In this notebook, the data from https://www.kaggle.com/c/titanic is analysed to show the advantages and possibilities of using the RuleXAI library for in-depth analysis of the dataset. It is a popular set, often used in various types of examples, therefore it was decided to use it in this analysis.

Overview¶

I. Initial data analysis and preprocesing
II. Use of a decision tree from sklearn
III. Analysis of the decision tree model from the previous point with RuleXAI
IV. Using the RuleKit library - a versatile tool for rule learning - to generate rule
V. Analysis with RuleXAI of rules derived with RuleKit
VI. Summary

I. Initial data analysis and preprocesing¶

1. Data load¶

The data used in this analysis comes from the kaggle competition (https://www.kaggle.com/c/titanic). Two datasets were published as part of this competition:
- training set (train.csv)
- test set (test.csv)

According to the competition rules: “The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features. The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.”

As the purpose of this analysis is to present the RuleXAI library not to take part in the competition, it was decided to use only the data contained in the training set in the further analysis. Therefore, the data from the train.csv file can be split into training and test data, so that it will be possible to evaluate the results obtained without participating in the competition.

[1]:

import pandas as pd

[4]:

dataset_path = "./data/titanic_kaggle.csv"
data = pd.read_csv(dataset_path)
data.head(5)

[4]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

2. Dataset overwiev¶

[3]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  891 non-null    int64
 1   Survived     891 non-null    int64
 2   Pclass       891 non-null    int64
 3   Name         891 non-null    object
 4   Sex          891 non-null    object
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64
 7   Parch        891 non-null    int64
 8   Ticket       891 non-null    object
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object
 11  Embarked     889 non-null    object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

[4]:

data[['PassengerId', 'Survived', 'Pclass']
     ] = data[['PassengerId', 'Survived', 'Pclass']].astype(str)

[5]:

numeric_data = data[['Age', 'SibSp', 'Parch', 'Fare']]
caterogical_data = data[['PassengerId', 'Survived', 'Pclass',
                         'Sex', 'Ticket', 'Cabin', 'Embarked']]

[6]:

numeric_data.describe()

[6]:

	Age	SibSp	Parch	Fare
count	714.000000	891.000000	891.000000	891.000000
mean	29.699118	0.523008	0.381594	32.204208
std	14.526497	1.102743	0.806057	49.693429
min	0.420000	0.000000	0.000000	0.000000
25%	20.125000	0.000000	0.000000	7.910400
50%	28.000000	0.000000	0.000000	14.454200
75%	38.000000	1.000000	0.000000	31.000000
max	80.000000	8.000000	6.000000	512.329200

[7]:

caterogical_data.describe()

[7]:

	PassengerId	Survived	Pclass	Sex	Ticket	Cabin	Embarked
count	891	891	891	891	891	204	889
unique	891	2	3	2	681	147	3
top	675	0	3	male	CA. 2343	C23 C25 C27	S
freq	1	549	491	577	7	4	644

3. Data preprocessing¶

In the first stage of data preprocessing it was decided to only remove the columns for PassengerId, Passenger Name, Ticket type and Cabin. Removing the PassengerId and Passenger Name columns is self-explanatory - in no way does PassengerId or Passenger Name have any bearing on whether a person survived. It would only be possible to derive passenger status from passenger name, as there are markings such as ‘Mr.’, ‘Mrs.’, ‘Miss.’, ‘Master.’. In case of tickets, the designations for most tickets vary - 681 unique values out of 891 occurrences. One could extract some information from the tickets from their designations (e.g., whether they begin with a number or a letter). However, historical data would need to be consulted to find out what the ticket designations mean. In the case of cabin designations, as many as 697 values are missing - for this reason it was decided to remove the entire column, as it carries too little information.

Of course, the preliminary data analysis and preprocessing stage itself could have been even more extensive - exploring the relationships between features, examining the impact of individual features, plotting graphs to better understand the data. However, the main purpose of this notebook is not to analyse a given set of data in detail, but only to show the possibilities of using the RuleXAI library. For this reason, some simplifications in the analysis have been decided.

[8]:

data.drop(["PassengerId", "Name", "Ticket", "Cabin"], axis=1, inplace=True)
data.reset_index(inplace=True, drop=True)
data.head(5)

[8]:

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
0	0	3	male	22.0	1	7.2500	S
1	1	1	female	38.0	1	71.2833	C
2	1	3	female	26.0	0	7.9250	S
3	1	1	female	35.0	1	53.1000	S
4	0	3	male	35.0	0	8.0500	S

II. Use of a decision tree from sklearn¶

In the first stage it was decided to use the decision tree for classification, which is available in the sklearn package and which is also supported by the RuleXAI library.

1. Data preparation for decision tree¶

Since the decision tree algorithm does not support missing values and only operates on numeric data, it was necessary to fill in missing values (for numeric data the median was used, and for categorical data the mode, which is the most frequent value) and dummify. The numerical data could also be rescaled - however, it was decided not to do so to facilitate further analysis, which will be seen later.

[9]:

data.Age = data.Age.fillna(data.Age.median())
data.Embarked = data.Embarked.fillna(data.Embarked.mode())

data_dummies = pd.get_dummies(data.drop(["Survived"], axis=1))
data_dummies_scaled = data_dummies.copy()

X = data_dummies_scaled
y = data.Survived

2. Data split for training and test datasets¶

[10]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, )

3. Building and testing the model¶

A simple decision tree model with default parameters was used, since the main goal is not to get the best possible results, but only to show the use of the RuleXAI library.

[11]:

from sklearn.model_selection import cross_val_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=1, max_depth=5)
cv = cross_val_score(dt, X_train, y_train, cv=5, scoring = "balanced_accuracy")
print(cv)
print(cv.mean())

[0.69121658 0.82526142 0.73472757 0.72491639 0.8157748 ]
0.7583793541088122

[12]:

dt.fit(X_train, y_train)

[12]:

DecisionTreeClassifier(max_depth=5, random_state=1)

[13]:

balanced_accuracy_score(y_test, dt.predict(X_test))

[13]:

0.7807425259654559

4. Determination of the feature importance from the decision tree in the sklearn package¶

[14]:

features_importances = pd.DataFrame(columns=["Feature", "Importance"])
features_importances.Feature = X_train.columns
features_importances.Importance = dt.feature_importances_
features_importances = features_importances.sort_values(
    by=["Importance"], ascending=False)
features_importances.reset_index(inplace=True, drop=True)

features_importances

[14]:

	Feature	Importance
0	Sex_female	0.516327
1	Pclass_3	0.158578
2	Age	0.120043
3	Fare	0.119806
4	SibSp	0.041863
5	Embarked_C	0.028040
6	Pclass_1	0.006695
7	Pclass_2	0.006262
8	Parch	0.002385
9	Sex_male	0.000000
10	Embarked_Q	0.000000
11	Embarked_S	0.000000

Looking at the importance of the features obtained directly from the decision tree in sklearn, it can be seen that the feature that most distinguished whether someone survived or not was the gender of the person, and more precisely whether the person was a woman or not. However, there is no information on whether the fact that a person was a woman caused survival or death. Knowing the context of the data, it can be concluded that men were gentlemen and they let women go first, therefore it can be concluded that if the person was a woman, she had a better chance of survival. The second most important feature that distinguished between survivors and non-survivors was whether the person was traveling in 3rd class. Assuming that the women were saved first, the next selection criterion would be in which class someone traveled. If it was 3rd class, it can be concluded that he was saved last, so there is a high probability that he did not survive. Another feature distinguishing whether or not a person survived was the age of the person. It can be assumed that children and the elderly people were saved from people in their prime, who had a better chance of surviving in difficult conditions. Another important feature was the amount of fare - it can be concluded that people who paid more for the ticket, traveled in a better class and in better conditions. This coincides with the fact that the survivors and the non-survivors also distinguished very well whether they were traveling in 3rd class or not.

We can see that the obtained validities of the features coincide with what can be inferred from the historical data and the context of the event. However, without the context, it would be difficult to conclude which feature indicates which class - on the basis of the presented importance of features, one could only obtain information about the separation of classes, and not about which class a given feature indicates.

5. Model generation based on top 50% of features¶

[15]:

features_number = 0.5 * features_importances.shape[0]

[16]:

X_train_sklearn_features = X_train[features_importances.loc[0:features_number-1, "Feature"]]
X_test_sklearn_features = X_test[features_importances.loc[0:features_number-1, "Feature"]]

X_train_sklearn_features

[16]:

	Sex_female	Pclass_3	Age	Fare	SibSp	Embarked_C
445	0	0	4.0	81.8583	0	0
650	0	1	28.0	7.8958	0	0
172	1	1	1.0	11.1333	1	0
450	0	0	36.0	27.7500	1	0
314	0	0	43.0	26.2500	1	0
...	...	...	...	...	...	...
106	1	1	21.0	7.6500	0	0
270	0	0	28.0	31.0000	0	0
860	0	1	41.0	14.1083	2	0
435	1	0	14.0	120.0000	1	0
102	0	0	21.0	77.2875	0	0

623 rows × 6 columns

[20]:

dt_sklearn_features = DecisionTreeClassifier(random_state=1, max_depth=5)
cv = cross_val_score(dt_sklearn_features,
                     X_train_sklearn_features, y_train, cv=5, scoring = "balanced_accuracy")
print(cv)
print(cv.mean())

[0.74440807 0.83791965 0.73018712 0.7703456  0.79598662]
0.7757694124203101

[21]:

dt_sklearn_features.fit(X_train_sklearn_features, y_train)

[21]:

DecisionTreeClassifier(max_depth=5, random_state=1)

[22]:

balanced_accuracy_score(y_test, dt_sklearn_features.predict(X_test_sklearn_features))

[22]:

0.7839272393412521

Looking at the results obtained on the training set (in cross-validation) and the test set, one can see that selecting only the most important features according to the ranking obtained with sklearn improved the results obtained. Indeed, the selected features have the greatest impact in distinguishing whether a person survived or not.

III. Analysis of the decision tree model from the previous point with RuleXAI¶

1. RuleXAI initialisation¶

[23]:

from rulexai.explainer import RuleExplainer
explainer = RuleExplainer(model=dt, X=X_train, y=y_train, type="classification")
explainer.explain()

[23]:

<rulexai.explainer.RuleExplainer at 0x29d6719f8b0>

2. Presentation of the rules derived from the decision tree¶

[24]:

rules = explainer.get_rules()
for rule in rules:
    print(rule)

IF Sex_female = (-inf, 0.5> AND Age = (6.5, 77.0> AND Fare = (-inf, 52.277099609375> AND Pclass_1 = (-inf, 0.5> THEN Survived = {0}
IF Sex_female = (0.5, inf) AND Pclass_3 = (-inf, 0.5> AND Age = (2.5, 49.5> AND Fare = (-inf, 149.035400390625> THEN Survived = {1}
IF Sex_female = (0.5, inf) AND Pclass_3 = (0.5, inf) AND Fare = (-inf, 15.372900009155273> AND Age = (-inf, 36.5> THEN Survived = {1}
IF Sex_female = (-inf, 0.5> AND Age = (6.5, 77.0> AND Fare = (-inf, 52.277099609375> AND Pclass_1 = (0.5, inf) THEN Survived = {0}
IF Sex_female = (0.5, inf) AND Pclass_3 = (0.5, inf) AND Fare = (15.372900009155273, 23.350000381469727> AND Age = (-inf, 36.5> THEN Survived = {1}
IF Sex_female = (-inf, 0.5> AND Age = (6.5, inf) AND Fare = (59.08749961853027, inf) AND Embarked_C = (-inf, 0.5> THEN Survived = {0}
IF Sex_female = (0.5, inf) AND Pclass_3 = (0.5, inf) AND Fare = (23.350000381469727, inf) AND Age = (7.0, inf) THEN Survived = {0}
IF Sex_female = (-inf, 0.5> AND Age = (6.5, inf) AND Fare = (59.08749961853027, inf) AND Embarked_C = (0.5, inf) THEN Survived = {1}
IF Sex_female = (-inf, 0.5> AND Age = (-inf, 6.5> AND SibSp = (-inf, 3.0> THEN Survived = {1}
IF Sex_female = (0.5, inf) AND Pclass_3 = (-inf, 0.5> AND Age = (49.5, inf) AND Fare = (-inf, 149.035400390625> THEN Survived = {1}
IF Sex_female = (0.5, inf) AND Pclass_3 = (-inf, 0.5> AND Age = (2.5, inf) AND Fare = (152.5062484741211, inf) THEN Survived = {1}
IF Sex_female = (-inf, 0.5> AND Age = (22.0, inf) AND Fare = (52.277099609375, 59.08749961853027> THEN Survived = {1}
IF Sex_female = (0.5, inf) AND Pclass_3 = (0.5, inf) AND Fare = (-inf, 23.350000381469727> AND Age = (36.5, inf) THEN Survived = {0}
IF Sex_female = (-inf, 0.5> AND Age = (-inf, 6.5> AND SibSp = (3.0, inf) AND Parch = (-inf, 1.5> THEN Survived = {0}
IF Sex_female = (0.5, inf) AND Pclass_3 = (0.5, inf) AND Fare = (23.350000381469727, inf) AND Age = (-inf, 3.5> THEN Survived = {0}
IF Sex_female = (-inf, 0.5> AND Age = (-inf, 6.5> AND SibSp = (3.0, inf) AND Parch = (1.5, inf) AND Fare = (31.331250190734863, inf) THEN Survived = {0}
IF Sex_female = (0.5, inf) AND Pclass_3 = (-inf, 0.5> AND Age = (-inf, 2.5> AND Pclass_2 = (0.5, inf) THEN Survived = {1}
IF Sex_female = (0.5, inf) AND Pclass_3 = (0.5, inf) AND Fare = (23.350000381469727, inf) AND Age = (3.5, 7.0> THEN Survived = {1}
IF Sex_female = (0.5, inf) AND Pclass_3 = (-inf, 0.5> AND Age = (2.5, inf) AND Fare = (149.035400390625, 152.5062484741211> THEN Survived = {0}
IF Sex_female = (-inf, 0.5> AND Age = (6.5, 22.0> AND Fare = (52.277099609375, 59.08749961853027> THEN Survived = {0}
IF Sex_female = (-inf, 0.5> AND Age = (77.0, inf) AND Fare = (-inf, 52.277099609375> THEN Survived = {1}
IF Sex_female = (-inf, 0.5> AND Age = (-inf, 6.5> AND SibSp = (3.0, inf) AND Parch = (1.5, inf) AND Fare = (-inf, 31.331250190734863> THEN Survived = {0}
IF Sex_female = (0.5, inf) AND Pclass_3 = (-inf, 0.5> AND Age = (-inf, 2.5> AND Pclass_2 = (-inf, 0.5> THEN Survived = {0}

3. Importance of features determined by RuleXAI¶

[25]:

explainer.feature_importances_

[25]:

	0 \| attributes	0 \| importances	1 \| attributes	1 \| importances
0	Pclass_3	0.895689	Sex_female	1.078293
1	Sex_female	0.844471	Age	0.973953
2	SibSp	0.683045	Fare	0.493457
3	Fare	0.584592	Pclass_3	0.140172
4	Age	0.396833	Pclass_2	0.126684
5	Pclass_2	0.186594	Embarked_C	0.11173
6	Embarked_C	0.101517	SibSp	0.089413
7	Parch	-0.051929	-	-
8	Pclass_1	-0.138521	-	-

Contrary to the importance of the features returned by the decision tree from sklearn, RuleXAI examines the importance of the conditions in the context of the class. In the case of this dataset, the importance of a feature tells us how much a given feature of a person contributed to the assignment of that person to this class. Analysing the ranking of the feature importance, it can be concluded that what most characterised the non-survivors was whether they traveled in grade 3 or not. Second important feature that had impact on non-survival was the gender of the person, and more precisely whether the person was a woman or not. We can also see that the number of siblings had an impact on survival - if someone had many siblings, the parents probably were not able to ensure the safety of all their children. Looking at the survivors, it can be seen that gender had the greatest impact on survival. We can also see that age had big impact on survival - we can draw the conclusions that probably at the beginning, children and the elderly were rescued, because people in their prime had a greater chance of surviving in difficult conditions. Next, whether someone survived depended on the fare - richer people were saved earlier than poorer.

From the feature ranking obtained with RuleXAI, similar conclusions can be drawn as with the feature ranking obtained with sklearn. The advantage of using RuleXAI, however, is that validity of the features is examined in the context of the class. Thanks to this, even without knowing the context, it would be known which feature influenced the assignment of a given class to a given object the most. An even more in-depth analysis can be performed using the ranking of conditions from the RuleXAI library, as will be shown later in the report

4. Model generation based on top 50% of features for each class from RuleXAI¶

[26]:

import numpy as np

features_importances_rulexai = explainer.feature_importances_

percent = 50
importances_TOP = []
for j in range(0, features_importances_rulexai.shape[1] + 0, 2):
    class_importances = (
        features_importances_rulexai.iloc[:, j]
        .replace("-", np.nan)
        .dropna()
    )
    class_importances_TOP_number = np.round(
        (percent / 100) * class_importances.shape[0]
    )

    if class_importances_TOP_number == 0:
        class_importances_TOP_number = 1

    class_importances_TOP = class_importances.loc[
        0: class_importances_TOP_number - 1
    ]
    importances_TOP.extend(list(class_importances_TOP))

importances_TOP_list = list(set(importances_TOP))

importances_TOP_list

[26]:

['Fare', 'Age', 'SibSp', 'Sex_female', 'Pclass_3']

[27]:

X_train_rulexai_features = X_train[importances_TOP_list]
X_test_rulexai_features = X_test[importances_TOP_list]

X_train_rulexai_features.head(5)

[27]:

	Fare	Age	SibSp	Sex_female	Pclass_3
445	81.8583	4.0	0	0	0
650	7.8958	28.0	0	0	1
172	11.1333	1.0	1	1	1
450	27.7500	36.0	1	0	0
314	26.2500	43.0	1	0	0

[28]:

dt_rulexai_features = DecisionTreeClassifier(random_state=1, max_depth=5)
cv = cross_val_score(dt_rulexai_features,
                     X_train_rulexai_features, y_train, cv=5, scoring = "balanced_accuracy")
print(cv)
print(cv.mean())

[0.74440807 0.83791965 0.72110622 0.79403567 0.78065775]
0.7756254727056493

[29]:

dt_rulexai_features.fit(X_train_rulexai_features, y_train)

[29]:

DecisionTreeClassifier(max_depth=5, random_state=1)

[30]:

balanced_accuracy_score(y_test, dt_rulexai_features.predict(X_test_rulexai_features))

[30]:

0.8017157284673209

By selecting 50% of the most important features from the ranking obtained with RuleXAI, it can be seen that compared to the basic set, the results obtained by the decision tree have improved. Comparing these results with the results obtained for the set containing 50% of the most important features from the ranking determined with the use of sklearn, we can see that the results also have improved.

5. Further analysis using RuleXAI¶

5.1 Rule condition importance¶

Below we present how the importance of the conditions from rules derived from a decision tree can be analysed.

[31]:

explainer.condition_importances_

[31]:

	0 \| conditions_names	0 \| importances	1 \| conditions_names	1 \| importances
0	Sex_female = (-inf, 0.5>	1.444185	Sex_female = (0.5, inf)	1.378759
1	Pclass_3 = (0.5, inf)	0.856801	Age = (77.0, inf)	0.404701
2	SibSp = (3.0, inf)	0.683045	Age = (-inf, 6.5>	0.339353
3	Fare = (149.035400390625, 152.5062484741211>	0.455265	Fare = (52.277099609375, 59.08749961853027>	0.285962
4	Age = (6.5, 22.0>	0.335094	Pclass_3 = (-inf, 0.5>	0.280239
5	Fare = (23.350000381469727, inf)	0.246165	Age = (3.5, 7.0>	0.254798
6	Age = (-inf, 2.5>	0.236955	Pclass_2 = (0.5, inf)	0.126684
7	Age = (36.5, inf)	0.204974	Fare = (59.08749961853027, inf)	0.120677
8	Pclass_2 = (-inf, 0.5>	0.186594	Embarked_C = (0.5, inf)	0.11173
9	Fare = (-inf, 31.331250190734863>	0.119378	SibSp = (-inf, 3.0>	0.089413
10	Fare = (-inf, 52.277099609375>	0.105219	Fare = (15.372900009155273, 23.350000381469727>	0.084646
11	Embarked_C = (-inf, 0.5>	0.101517	Fare = (152.5062484741211, inf)	0.039821
12	Age = (6.5, 77.0>	0.055288	Fare = (23.350000381469727, inf)	0.033452
13	Parch = (-inf, 1.5>	0.053057	Age = (-inf, 2.5>	0.031559
14	Pclass_1 = (-inf, 0.5>	0.042272	Age = (22.0, inf)	0.011474
15	Pclass_3 = (-inf, 0.5>	0.038888	Age = (-inf, 36.5>	0.011118
16	Fare = (-inf, 23.350000381469727>	0.038706	Age = (2.5, inf)	-0.003344
17	Age = (7.0, inf)	0.032743	Age = (2.5, 49.5>	-0.006889
18	Age = (6.5, inf)	0.030661	Age = (6.5, inf)	-0.008256
19	Age = (2.5, inf)	0.005461	Fare = (-inf, 149.035400390625>	-0.012183
20	Age = (-inf, 3.5>	-0.087280	Fare = (-inf, 52.277099609375>	-0.027228
21	Fare = (31.331250190734863, inf)	-0.103671	Fare = (-inf, 15.372900009155273>	-0.031689
22	Parch = (1.5, inf)	-0.104986	Age = (49.5, inf)	-0.060562
23	Fare = (59.08749961853027, inf)	-0.125345	Pclass_3 = (0.5, inf)	-0.140068
24	Fare = (52.277099609375, 59.08749961853027>	-0.151125	Sex_female = (-inf, 0.5>	-0.300466
25	Pclass_1 = (0.5, inf)	-0.180793	-	-
26	Age = (-inf, 6.5>	-0.417063	-	-
27	Sex_female = (0.5, inf)	-0.599715	-	-

Looking at the importance of individual conditions, we can see that the most important condition for a person not surviving is that the person was not a woman, that is, person was a man. On the other hand, the most important condition for a person to survive is that the person was a woman. In this way, we have an explicit confirmation of the hypothesis put forward on the basis of the data context during the analysis of the importance of the features. At this point, note that for categorical variables such as Sex_female, where the feature can be 0 or 1, the rules taken from the decision tree return the condition Sex_female = (-inf, 0.5> when the feature is 0 and Sex_female = (0.5, inf) when the feature takes the value 1.

We also see that the second condition determing that the person did not survive is that they traveled in 3rd grade. This also confirms the hypothesis put forward during the analysis of the feature importance. In the case of the feature importance analysis, we only had information on whether survival was affected by the fact that the person was travelling in 3rd grade. The ranking of conditions gives us an unambiguous confirmation of which decision is impacted by this feature.

The situation is similar with the number of siblings. We can see that if a person had more than 3 siblings, it was more likely that they were in the group of non-survivors. This confirms the hypothesis put forward earlier: with more children, the parents were not able to ensure the safety of all their children.

On the other hand, when looking at the conditions for survivors, it can be seen that in addition to being a woman, the following conditions rank high: Age = (77.0, inf), Age = (-inf, 6.5>. This confirms the hypothesis that children and the elderly people were saved first.

5.2 Local explainability¶

It is often interesting and important to know on what criteria the model made its decision for a given example. In general, thanks to the explanations obtained with XAI methods, the correctness of the model can be verified. Additionally, in some applications of AI it is important that thanks to the use of XAI people affected by the model’s decision better understand their situation and have more trust in the model.

This type of explanation is provided by RuleXAI. The explanations take the form of easy to understand and interpret rules, based on which the model makes a decision for a given example, and the importance of the conditions contained in them.

[32]:

example_X = X_train.iloc[1, :]
example_Y = pd.DataFrame(y_train).iloc[1, :]

explainer.local_explainability(example_X, example_Y, plot = True)

Example:
Age             28.0
SibSp            0.0
Parch            0.0
Fare          7.8958
Pclass_1         0.0
Pclass_2         0.0
Pclass_3         1.0
Sex_female       0.0
Sex_male         1.0
Embarked_C       0.0
Embarked_Q       0.0
Embarked_S       1.0
Survived           0
Name: 650, dtype: object

Rules that covers this example:
IF Sex_female = (-inf, 0.5> AND Age = (6.5, 77.0> AND Fare = (-inf, 52.277099609375> AND Pclass_1 = (-inf, 0.5> THEN Survived = {0}

Importances of the conditions from rules covering the example
             0 | conditions_names  0 | importances
0        Sex_female = (-inf, 0.5>         1.444185
1  Fare = (-inf, 52.277099609375>         0.105219
2               Age = (6.5, 77.0>         0.055288
3          Pclass_1 = (-inf, 0.5>         0.042272

../../_images/rst_tutorials_classification_52_1.png

[32]:

	0 \| conditions_names	0 \| importances
0	Sex_female = (-inf, 0.5>	1.444185
1	Fare = (-inf, 52.277099609375>	0.105219
2	Age = (6.5, 77.0>	0.055288
3	Pclass_1 = (-inf, 0.5>	0.042272

Looking at the explanation, we can see that the model classifies a person as a non-survivor based on the rule stating that the person not survived because: was male, was beetwen (6.5, 77.0> years old, paid little for the fare and did not travel in 1st class. The most important of these conditions was that this person was male.

5.3 Creation of a binary dataset¶

Another functionality provided by RuleXAI is the conversion of the input dataset into a set described by binary features that correspond to specific conditions determined by the model. If for a given condition the example takes the value 0, it means that it does not meet it. If, on the other hand, it takes 1, it means that it satisfies it. This dataset can be used to train other ML models. A significant advantage of such a set is that it has no missing values and only has one type of data (it can be considered as categorical or numerical). Thanks to this, it can be used with any of the available ML models.

[33]:

X_train_tranformed = explainer.fit_transform(X_train, selector=None)

X_train_tranformed.head(5)

[33]:

	Sex_female = (-inf, 0.5>	Age = (6.5, 77.0>	Fare = (-inf, 52.277099609375>	Pclass_1 = (-inf, 0.5>	Sex_female = (0.5, inf)	Pclass_3 = (-inf, 0.5>	Age = (2.5, 49.5>	Fare = (-inf, 149.035400390625>	Pclass_3 = (0.5, inf)	Fare = (-inf, 15.372900009155273>	...	Parch = (1.5, inf)	Fare = (31.331250190734863, inf)	Age = (-inf, 2.5>	Pclass_2 = (0.5, inf)	Age = (3.5, 7.0>	Fare = (-inf, 31.331250190734863>	Pclass_2 = (-inf, 0.5>
0	1	0	0	0	0	1	1	1	0	0	...	1	1	0	0	1	0	1
1	1	1	1	1	0	0	1	1	1	1	...	0	0	0	0	0	1	1
2	0	0	1	1	1	0	0	1	1	1	...	0	0	1	0	0	1	1
3	1	1	1	1	0	1	1	1	0	0	...	1	0	0	1	0	1	0
4	1	1	1	1	0	1	1	1	0	0	...	0	0	0	1	0	1	0

5 rows × 41 columns

[37]:

X_test_transformed = explainer.transform(X_test)

X_test_transformed.head(5)

[37]:

	Sex_female = (-inf, 0.5>	Age = (6.5, 77.0>	Fare = (-inf, 52.277099609375>	Pclass_1 = (-inf, 0.5>	Sex_female = (0.5, inf)	Pclass_3 = (-inf, 0.5>	Age = (2.5, 49.5>	Fare = (-inf, 149.035400390625>	Pclass_3 = (0.5, inf)	Fare = (-inf, 15.372900009155273>	...	Fare = (31.331250190734863, inf)	Pclass_2 = (0.5, inf)	Age = (3.5, 7.0>	Age = (6.5, 22.0>	Fare = (-inf, 31.331250190734863>	Pclass_2 = (-inf, 0.5>
0	1	1	1	1	0	0	1	1	1	1	...	0	0	0	0	1	1
1	1	1	1	1	0	1	1	1	0	1	...	0	1	0	0	1	0
2	1	1	1	1	0	0	1	1	1	1	...	0	0	0	1	1	1
3	0	0	1	1	1	1	1	1	0	0	...	1	1	1	0	0	0
4	0	1	1	1	1	0	1	1	1	1	...	0	0	0	1	1	1

5 rows × 41 columns

[35]:

dt_binary_dataset = DecisionTreeClassifier(random_state=1, max_depth=5)
cv = cross_val_score(dt_binary_dataset,
                     X_train_tranformed, y_train, cv=5, scoring = "balanced_accuracy")
print(cv)
print(cv.mean())

[0.76991271 0.83970831 0.74009356 0.7990524  0.84448161]
0.7986497169696036

[36]:

dt_binary_dataset.fit(X_train_tranformed, y_train)

[36]:

DecisionTreeClassifier(max_depth=5, random_state=1)

[38]:

balanced_accuracy_score(y_test, dt_binary_dataset.predict(X_test_transformed))

[38]:

0.7807425259654559

Looking at the results obtained by the decision tree trained on the created binary set, we can see that it obtains very similar (even slightly better) results than on the original dataset.

5.4 Creation of a binary dataset based on top conditions¶

RuleXAI allows you to create a binary dataset with a selected percentage of the most important conditions.

[39]:

X_train_tranformed = explainer.fit_transform(X_train, selector=0.25)

X_train_tranformed.head(5)

[39]:

	Pclass_3 = (-inf, 0.5>	Sex_female = (-inf, 0.5>	Sex_female = (0.5, inf)	Age = (3.5, 7.0>	Age = (-inf, 2.5>	Age = (-inf, 6.5>	Pclass_3 = (0.5, inf)	Fare = (23.350000381469727, inf)
0	1	1	0	1	0	1	0	1
1	0	1	0	0	0	0	1	0
2	0	0	1	0	1	1	1	0
3	1	1	0	0	0	0	0	1
4	1	1	0	0	0	0	0	1

[44]:

X_test_transformed = explainer.transform(X_test)

X_test_transformed.head(5)

[44]:

	Pclass_3 = (-inf, 0.5>	Sex_female = (-inf, 0.5>	Sex_female = (0.5, inf)	Age = (3.5, 7.0>	Age = (6.5, 22.0>	Age = (-inf, 6.5>	Pclass_3 = (0.5, inf)	Fare = (23.350000381469727, inf)
0	0	1	0	0	0	0	1	0
1	1	1	0	0	0	0	0	0
2	0	1	0	0	1	0	1	0
3	1	0	1	1	0	1	0	1
4	0	0	1	0	1	0	1	0

[41]:

dt_binary_dataset_with_TOP_conditions = DecisionTreeClassifier(random_state=1, max_depth=5)
cv = cross_val_score(dt_binary_dataset_with_TOP_conditions,
                     X_train_tranformed, y_train, cv=5, scoring = "balanced_accuracy")
print(cv)
print(cv.mean())

[0.77414075 0.86419923 0.76004403 0.79849498 0.81633222]
0.8026422425531315

[42]:

dt_binary_dataset_with_TOP_conditions.fit(X_train_tranformed,y_train)

[42]:

DecisionTreeClassifier(max_depth=5, random_state=1)

[45]:

balanced_accuracy_score(y_test, dt_binary_dataset_with_TOP_conditions.predict(X_test_transformed))

[45]:

0.7701841969357893

The results obtained on a binary set containing only 25% of all conditions are very similar to those obtained on the entire set. We can see that with fewer data dimensions, similar results can be obtained. Under appropriate conditions, a binary set can be used also to reduce the dimensionality of the set.

5.5 Condition importance based on non-overlapping rule conditions¶

Below we present another analysis of the importance of the conditions from rules derived from a decision tree. This time the analysis focuses on conditions in rules splitted into base conditions so that individual conditions do not overlap.

[47]:

explainer.explain(basic_conditions=True)
explainer.condition_importances_

[47]:

	0 \| conditions_names	0 \| importances	1 \| conditions_names	1 \| importances
0	Sex_female = (-inf, 0.5)	0.709511	Sex_female = <0.5, inf)	0.593139
1	SibSp = <3.0, inf)	0.365199	Age = <77.0, inf)	0.427564
2	Pclass_3 = <0.5, inf)	0.359394	Age = <3.5, 6.5)	0.306936
3	Fare = <149.035400390625, 152.5062484741211)	0.345837	Fare = <52.277099609375, 59.08749961853027)	0.243723
4	Age = <7.0, 22.0)	0.205906	Age = <2.5, 3.5)	0.217869
5	Fare = (-inf, 15.372900009155273)	0.201038	Pclass_3 = (-inf, 0.5)	0.129432
6	Age = (-inf, 2.5)	0.188995	Age = (-inf, 2.5)	0.127256
7	Pclass_2 = (-inf, 0.5)	0.186594	Pclass_2 = <0.5, inf)	0.126684
8	Fare = <23.350000381469727, 31.331250190734863)	0.094355	Fare = <59.08749961853027, 149.035400390625)	0.092037
9	Pclass_3 = (-inf, 0.5)	0.086410	Fare = <15.372900009155273, 23.350000381469727)	0.09141
10	Age = <36.5, 49.5)	0.083850	Fare = <152.5062484741211, inf)	0.079653
11	Age = <22.0, 36.5)	0.074990	Fare = <23.350000381469727, 31.331250190734863)	0.075764
12	Fare = <31.331250190734863, 52.277099609375)	0.045709	SibSp = (-inf, 3.0)	0.053725
13	Embarked_C = (-inf, 0.5)	0.036915	Age = <22.0, 36.5)	0.051714
14	Parch = (-inf, 1.5)	0.035371	Embarked_C = <0.5, inf)	0.040629
15	Age = <49.5, 77.0)	0.031508	Age = <49.5, 77.0)	0.026622
16	Pclass_1 = (-inf, 0.5)	0.015372	Fare = <31.331250190734863, 52.277099609375)	0.025254
17	Age = <6.5, 7.0)	0.000000	Age = <6.5, 7.0)	0.0
18	Parch = <1.5, inf)	-0.049627	Age = <36.5, 49.5)	-0.001405
19	Pclass_1 = <0.5, inf)	-0.065743	Age = <7.0, 22.0)	-0.035978
20	Fare = <15.372900009155273, 23.350000381469727)	-0.104907	Fare = (-inf, 15.372900009155273)	-0.052213
21	Fare = <59.08749961853027, 149.035400390625)	-0.137746	Fare = <149.035400390625, 152.5062484741211)	-0.05625
22	Fare = <152.5062484741211, inf)	-0.191079	Pclass_3 = <0.5, inf)	-0.060642
23	Age = <3.5, 6.5)	-0.222782	Sex_female = (-inf, 0.5)	-0.151878
24	Fare = <52.277099609375, 59.08749961853027)	-0.266055	-	-
25	Sex_female = <0.5, inf)	-0.330555	-	-
26	Age = <77.0, inf)	-0.340750	-	-
27	Age = <2.5, 3.5)	-0.411382	-	-

Looking at the ranking of conditions that do not overlap, we can come to conclusions similar to the ones drawn from the ranking of overlaping conditions. However, differences can also be seen, e.g. the condition Age = (-inf, 6.5> was high in the ranking of overlapping conditions, suggesting the conclusion that all children under 6.5 were more likely to survive. Nevertheless, when we look at the assessment of the basic conditions, we can see that children aged between 2.5 and 6.5 years [Age = <2.5, 3.5), Age = <3.5, 6.5)] had a good chance of survival, and the condition for children under 2.5 is lower in the ranking. This suggests that infants (under 2.5 years of age) were less likely to survive, possibly due to their dependency.

5.3 Presentation of the rules consisting of non-overlapping base conditions¶

[48]:

rules = explainer.get_rules_with_basic_conditions()
for rule in rules:
    print(rule)

IF [Sex_female = (-inf, 0.5)] AND [Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0)] AND [Fare = (-inf, 15.372900009155273) OR Fare = <15.372900009155273, 23.350000381469727) OR Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375)] AND [Pclass_1 = (-inf, 0.5)] THEN Survived = {0}
IF [Sex_female = <0.5, inf)] AND [Age = <2.5, 3.5) OR Age = <3.5, 6.5) OR Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5)] AND [Fare = (-inf, 15.372900009155273) OR Fare = <15.372900009155273, 23.350000381469727) OR Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375) OR Fare = <52.277099609375, 59.08749961853027) OR Fare = <59.08749961853027, 149.035400390625)] AND [Pclass_3 = (-inf, 0.5)] THEN Survived = {1}
IF [Sex_female = <0.5, inf)] AND [Age = (-inf, 2.5) OR Age = <2.5, 3.5) OR Age = <3.5, 6.5) OR Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5)] AND [Fare = (-inf, 15.372900009155273)] AND [Pclass_3 = <0.5, inf)] THEN Survived = {1}
IF [Sex_female = (-inf, 0.5)] AND [Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0)] AND [Fare = (-inf, 15.372900009155273) OR Fare = <15.372900009155273, 23.350000381469727) OR Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375)] AND [Pclass_1 = <0.5, inf)] THEN Survived = {0}
IF [Sex_female = <0.5, inf)] AND [Age = (-inf, 2.5) OR Age = <2.5, 3.5) OR Age = <3.5, 6.5) OR Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5)] AND [Fare = <15.372900009155273, 23.350000381469727)] AND [Pclass_3 = <0.5, inf)] THEN Survived = {1}
IF [Sex_female = (-inf, 0.5)] AND [Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = <59.08749961853027, 149.035400390625) OR Fare = <149.035400390625, 152.5062484741211) OR Fare = <152.5062484741211, inf)] AND [Embarked_C = (-inf, 0.5)] THEN Survived = {0}
IF [Sex_female = <0.5, inf)] AND [Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375) OR Fare = <52.277099609375, 59.08749961853027) OR Fare = <59.08749961853027, 149.035400390625) OR Fare = <149.035400390625, 152.5062484741211) OR Fare = <152.5062484741211, inf)] AND [Pclass_3 = <0.5, inf)] THEN Survived = {0}
IF [Sex_female = (-inf, 0.5)] AND [Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = <59.08749961853027, 149.035400390625) OR Fare = <149.035400390625, 152.5062484741211) OR Fare = <152.5062484741211, inf)] AND [Embarked_C = <0.5, inf)] THEN Survived = {1}
IF [Sex_female = (-inf, 0.5)] AND [Age = (-inf, 2.5) OR Age = <2.5, 3.5) OR Age = <3.5, 6.5)] AND [SibSp = (-inf, 3.0)] THEN Survived = {1}
IF [Sex_female = <0.5, inf)] AND [Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = (-inf, 15.372900009155273) OR Fare = <15.372900009155273, 23.350000381469727) OR Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375) OR Fare = <52.277099609375, 59.08749961853027) OR Fare = <59.08749961853027, 149.035400390625)] AND [Pclass_3 = (-inf, 0.5)] THEN Survived = {1}
IF [Sex_female = <0.5, inf)] AND [Age = <2.5, 3.5) OR Age = <3.5, 6.5) OR Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = <152.5062484741211, inf)] AND [Pclass_3 = (-inf, 0.5)] THEN Survived = {1}
IF [Sex_female = (-inf, 0.5)] AND [Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = <52.277099609375, 59.08749961853027)] THEN Survived = {1}
IF [Sex_female = <0.5, inf)] AND [Age = <36.5, 49.5) OR Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = (-inf, 15.372900009155273) OR Fare = <15.372900009155273, 23.350000381469727)] AND [Pclass_3 = <0.5, inf)] THEN Survived = {0}
IF [Sex_female = (-inf, 0.5)] AND [Age = (-inf, 2.5) OR Age = <2.5, 3.5) OR Age = <3.5, 6.5)] AND [SibSp = <3.0, inf)] AND [Parch = (-inf, 1.5)] THEN Survived = {0}
IF [Sex_female = <0.5, inf)] AND [Age = (-inf, 2.5) OR Age = <2.5, 3.5)] AND [Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375) OR Fare = <52.277099609375, 59.08749961853027) OR Fare = <59.08749961853027, 149.035400390625) OR Fare = <149.035400390625, 152.5062484741211) OR Fare = <152.5062484741211, inf)] AND [Pclass_3 = <0.5, inf)] THEN Survived = {0}
IF [Sex_female = (-inf, 0.5)] AND [Age = (-inf, 2.5) OR Age = <2.5, 3.5) OR Age = <3.5, 6.5)] AND [Fare = <31.331250190734863, 52.277099609375) OR Fare = <52.277099609375, 59.08749961853027) OR Fare = <59.08749961853027, 149.035400390625) OR Fare = <149.035400390625, 152.5062484741211) OR Fare = <152.5062484741211, inf)] AND [SibSp = <3.0, inf)] AND [Parch = <1.5, inf)] THEN Survived = {0}
IF [Sex_female = <0.5, inf)] AND [Age = (-inf, 2.5)] AND [Pclass_3 = (-inf, 0.5)] AND [Pclass_2 = <0.5, inf)] THEN Survived = {1}
IF [Sex_female = <0.5, inf)] AND [Age = <3.5, 6.5) OR Age = <6.5, 7.0)] AND [Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375) OR Fare = <52.277099609375, 59.08749961853027) OR Fare = <59.08749961853027, 149.035400390625) OR Fare = <149.035400390625, 152.5062484741211) OR Fare = <152.5062484741211, inf)] AND [Pclass_3 = <0.5, inf)] THEN Survived = {1}
IF [Sex_female = <0.5, inf)] AND [Age = <2.5, 3.5) OR Age = <3.5, 6.5) OR Age = <6.5, 7.0) OR Age = <7.0, 22.0) OR Age = <22.0, 36.5) OR Age = <36.5, 49.5) OR Age = <49.5, 77.0) OR Age = <77.0, inf)] AND [Fare = <149.035400390625, 152.5062484741211)] AND [Pclass_3 = (-inf, 0.5)] THEN Survived = {0}
IF [Sex_female = (-inf, 0.5)] AND [Age = <6.5, 7.0) OR Age = <7.0, 22.0)] AND [Fare = <52.277099609375, 59.08749961853027)] THEN Survived = {0}
IF [Sex_female = (-inf, 0.5)] AND [Age = <77.0, inf)] AND [Fare = (-inf, 15.372900009155273) OR Fare = <15.372900009155273, 23.350000381469727) OR Fare = <23.350000381469727, 31.331250190734863) OR Fare = <31.331250190734863, 52.277099609375)] THEN Survived = {1}
IF [Sex_female = (-inf, 0.5)] AND [Age = (-inf, 2.5) OR Age = <2.5, 3.5) OR Age = <3.5, 6.5)] AND [Fare = (-inf, 15.372900009155273) OR Fare = <15.372900009155273, 23.350000381469727) OR Fare = <23.350000381469727, 31.331250190734863)] AND [SibSp = <3.0, inf)] AND [Parch = <1.5, inf)] THEN Survived = {0}
IF [Sex_female = <0.5, inf)] AND [Age = (-inf, 2.5)] AND [Pclass_3 = (-inf, 0.5)] AND [Pclass_2 = (-inf, 0.5)] THEN Survived = {0}

IV. Using the RuleKit library - a versatile tool for rule learning - to generate rules¶

In the previous section, the rules obtained from the decision tree were analysed. In this section the analysis is based on rules obtained using the algorithm dedicated for rule-based learning. A set of such algorithms is provided by the RuleKit library.

1. Data preparation for RuleKit¶

RuleKit supports missing values and categorical data, so the step of preparing data specifically for this algorithm is not necessary - what was done in Section I is sufficient. The only thing to remember is that RuleKit accepts missing values for numeric columns as nan, while for categorical columns as None. For this reason it was necessary to change the missing values in the Embarked column from nan to None.

[49]:

import numpy as np

X = data.drop(["Survived"], axis=1)
X.Embarked.replace(np.nan, None, inplace = True)
y = data.Survived

X.head(5)

[49]:

	Pclass	Sex	Age	SibSp	Fare	Embarked
0	3	male	22.0	1	7.2500	S
1	1	female	38.0	1	71.2833	C
2	3	female	26.0	0	7.9250	S
3	1	female	35.0	1	53.1000	S
4	3	male	35.0	0	8.0500	S

2. Data split for training and test datasets¶

[50]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

3. Building and testing the model¶

[51]:

from sklearn.model_selection import cross_val_score
from rulekit import RuleKit
from rulekit.classification import RuleClassifier
from rulekit.params import Measures

RuleKit.init()

rc = RuleClassifier(
    induction_measure=Measures.C2,
    pruning_measure=Measures.C2,
    voting_measure=Measures.C2,
    min_rule_covered=5
)
cv = cross_val_score(rc, X_train, y_train, cv=5, scoring = "balanced_accuracy")
print(cv)
print(cv.mean())

[0.73958333 0.88666667 0.70763889 0.8027027  0.82383191]
0.7920847009219034

[52]:

rc.fit(X_train, y_train)

[52]:

<rulekit.classification.RuleClassifier at 0x29d6da16fd0>

[53]:

balanced_accuracy_score(y_test, rc.predict(X_test))

[53]:

0.749756125552304

The quality of the rule model is similar to that obtained for the decision tree. By controlling the rule model induction parameters even better results could be obtained, but this is not the subject of this notebook.

4. Presentation of the rules obtained by RuleKit¶

[54]:

for rule in rc.model.rules:
    print(rule)

IF Age = (-inf, 0.92) THEN Survived = {1}
IF Pclass = {2} AND Sex = {female} THEN Survived = {1}
IF Parch = (-inf, 1.50) AND Sex = {female} AND Fare = <29.36, inf) THEN Survived = {1}
IF Pclass = {1} AND Sex = {female} AND Age = <8, inf) AND Fare = <29.36, inf) THEN Survived = {1}
IF Parch = (-inf, 1.50) AND Sex = {female} AND SibSp = (-inf, 2.50) AND Fare = <21.72, inf) THEN Survived = {1}
IF Sex = {female} AND SibSp = <0.50, inf) AND Age = <27.50, inf) AND Fare = <15.98, 22.34) THEN Survived = {1}
IF Parch = (-inf, 1.50) AND Sex = {female} AND SibSp = (-inf, 2.50) AND Fare = <10.48, inf) THEN Survived = {1}
IF Parch = (-inf, 3.50) AND Sex = {female} AND SibSp = (-inf, 0.50) AND Fare = <10.83, inf) THEN Survived = {1}
IF Parch = (-inf, 1.50) AND Sex = {female} AND Fare = <6.99, inf) THEN Survived = {1}
IF SibSp = (-inf, 2) AND Age = <8, 43.50) AND Fare = <82.66, inf) THEN Survived = {1}
IF SibSp = (-inf, 1.50) AND Age = <3, 62) AND Fare = <74.38, inf) THEN Survived = {1}
IF SibSp = (-inf, 1.50) AND Age = <22.50, 44.50) AND Fare = <52.28, 143.59) THEN Survived = {1}
IF Pclass = {2} AND Age = (-inf, 6.50) THEN Survived = {1}
IF Embarked = {S} AND Age = <2.50, 6.50) THEN Survived = {1}
IF Parch = (-inf, 0.50) AND Fare = <29.85, inf) THEN Survived = {1}
IF Pclass = {1} AND Parch = (-inf, 0.50) AND Age = (-inf, 48.50) AND Fare = <26.14, 30.75) THEN Survived = {1}
IF Embarked = {C} AND Age = (-inf, 29.50) AND Fare = <7.56, 135.07) THEN Survived = {1}
IF Parch = (-inf, 3.50) AND SibSp = (-inf, 2.50) AND Age = (-inf, 51.50) AND Fare = <18.38, inf) THEN Survived = {1}
IF Age = (-inf, 42.50) AND Fare = <10.48, inf) THEN Survived = {1}
IF SibSp = (-inf, 0.50) AND Age = <30.50, 44.50) AND Fare = <7.91, inf) THEN Survived = {1}
IF Parch = (-inf, 3.50) AND SibSp = (-inf, 2.50) AND Fare = <7.91, inf) THEN Survived = {1}
IF Age = <20, 27.50) AND Fare = <7.13, 7.80) THEN Survived = {1}
IF Parch = (-inf, 3.50) AND SibSp = (-inf, 4.50) AND Age = (-inf, 60.50) AND Fare = <7.69, inf) THEN Survived = {1}
IF SibSp = <4.50, inf) THEN Survived = {0}
IF Parch = <3.50, inf) THEN Survived = {0}
IF Age = <26.50, inf) AND Fare = (-inf, 7.13) THEN Survived = {0}
IF Fare = <2.01, 7.13) THEN Survived = {0}
IF Sex = {male} AND Age = <13, 60.50) AND Fare = (-inf, 26.27) THEN Survived = {0}
IF Sex = {male} AND Age = <13, 77) AND Fare = (-inf, 52.28) THEN Survived = {0}
IF Embarked = {S} AND Sex = {male} AND Age = <13, 30.50) AND Fare = <7.80, inf) THEN Survived = {0}
IF Sex = {male} AND Age = <13, 77) AND Fare = (-inf, 86.29) THEN Survived = {0}
IF Sex = {male} AND Age = <13, 77) AND Fare = (-inf, 387.66) THEN Survived = {0}
IF Pclass = {3} AND Sex = {male} AND Age = <6.50, 25.50) AND Fare = <16, inf) THEN Survived = {0}
IF Pclass = {3} AND Fare = <23.35, 52.28) THEN Survived = {0}
IF Fare = <9.41, 10.48) THEN Survived = {0}
IF Embarked = {S} AND Age = <20.50, 30.50) AND Fare = <7.87, 8.08) THEN Survived = {0}
IF Embarked = {S} AND Parch = (-inf, 0.50) AND Fare = (-inf, 10.48) THEN Survived = {0}
IF Age = <17.50, inf) AND Fare = (-inf, 10.48) THEN Survived = {0}
IF Embarked = {S} AND Pclass = {3} AND Parch = (-inf, 0.50) AND SibSp = (-inf, 2.50) AND Age = <16.50, inf) AND Fare = <7.99, 51.70) THEN Survived = {0}
IF Pclass = {3} AND Age = <6.50, inf) AND Fare = <13.29, 15.17) THEN Survived = {0}
IF Parch = (-inf, 2.50) AND Age = <2.50, inf) AND Fare = (-inf, 29.41) THEN Survived = {0}

V. Analysis with RuleXAI of rules derived with RuleKit¶

1. Initialisation and explaination¶

[55]:

from rulexai.explainer import RuleExplainer
explainer = RuleExplainer(model=rc, X=X_train, y=y_train, type="classification")
explainer.explain()

[55]:

<rulexai.explainer.RuleExplainer at 0x29d6e38f3a0>

2. Feature importance determined by RuleXAI¶

[56]:

explainer.feature_importances_

[56]:

	1 \| attributes	1 \| importances	0 \| attributes	0 \| importances
0	Sex	2.153698	Fare	2.745895
1	Fare	1.834094	Sex	1.481201
2	Age	1.518890	Pclass	0.731226
3	Pclass	0.437313	Parch	0.571222
4	SibSp	0.161694	Age	0.550799
5	Embarked	0.135499	SibSp	0.513230
6	Parch	0.089201	Embarked	0.205367

3. Rule condition importance¶

[57]:

explainer.condition_importances_

[57]:

	1 \| conditions_names	1 \| importances	0 \| conditions_names	0 \| importances
0	Sex = {female}	2.153698	Sex = {male}	1.481201
1	Age = (-inf, 0.92)	0.510823	Pclass = {3}	0.731226
2	Age = <2.5, 6.5)	0.426003	Fare = <2.01, 7.13)	0.519133
3	Age = (-inf, 6.5)	0.359902	Fare = <9.41, 10.48)	0.517857
4	Pclass = {2}	0.275112	Fare = (-inf, 10.48)	0.516811
5	Fare = <82.66, inf)	0.267579	SibSp = <4.5, inf)	0.515306
6	Fare = <52.28, 143.59)	0.257940	Parch = <3.5, inf)	0.508929
7	Fare = <29.85, inf)	0.247359	Fare = (-inf, 7.13)	0.482425
8	Fare = <74.38, inf)	0.239459	Embarked = {S}	0.205367
9	Fare = <10.48, inf)	0.213492	Fare = <7.87, 8.08)	0.20006
10	Pclass = {1}	0.162201	Age = <13.0, 77.0)	0.12964
11	Embarked = {C}	0.154946	Fare = <13.29, 15.17)	0.123496
12	Fare = <29.36, inf)	0.149020	Fare = <23.35, 52.28)	0.112628
13	Fare = <7.91, inf)	0.125088	Fare = (-inf, 29.41)	0.110282
14	Fare = <18.38, inf)	0.098349	Age = <20.5, 30.5)	0.087622
15	Parch = (-inf, 1.5)	0.071067	Fare = (-inf, 26.27)	0.071871
16	SibSp = (-inf, 1.5)	0.066773	Fare = (-inf, 52.28)	0.068313
17	Fare = <21.72, inf)	0.065569	Age = <6.5, 25.5)	0.062968
18	Fare = <10.83, inf)	0.062618	Parch = (-inf, 0.5)	0.060664
19	Age = <20.0, 27.5)	0.059567	Age = <13.0, 30.5)	0.055904
20	Age = <30.5, 44.5)	0.057521	Age = <6.5, inf)	0.052733
21	SibSp = (-inf, 2.5)	0.056610	Age = <17.5, inf)	0.048465
22	Age = <27.5, inf)	0.043101	Age = <13.0, 60.5)	0.041519
23	Fare = <15.98, 22.34)	0.035205	Age = <26.5, inf)	0.040535
24	Fare = <7.56, 135.07)	0.032317	Fare = (-inf, 86.29)	0.03164
25	Age = (-inf, 48.5)	0.030201	Age = <2.5, inf)	0.019917
26	SibSp = <0.5, inf)	0.028345	Fare = <7.99, 51.7)	0.015628
27	Fare = <26.14, 30.75)	0.026043	Age = <16.5, inf)	0.011496
28	Age = <22.5, 44.5)	0.024332	Fare = (-inf, 387.66)	0.006411
29	Parch = (-inf, 3.5)	0.024229	Parch = (-inf, 2.5)	0.00163
30	Fare = <7.69, inf)	0.018042	SibSp = (-inf, 2.5)	-0.002076
31	SibSp = (-inf, 2.0)	0.008422	Fare = <7.8, inf)	-0.014855
32	Fare = <6.99, inf)	0.007874	Fare = <16.0, inf)	-0.015803
33	SibSp = (-inf, 4.5)	0.006384	-	-
34	Age = <3.0, 62.0)	0.004901	-	-
35	Age = <8.0, 43.5)	0.004844	-	-
36	Age = (-inf, 51.5)	0.002246	-	-
37	Age = (-inf, 60.5)	0.002245	-	-
38	Age = (-inf, 42.5)	0.000703	-	-
39	Age = <8.0, inf)	-0.003570	-	-
40	Age = (-inf, 29.5)	-0.003928	-	-
41	SibSp = (-inf, 0.5)	-0.004841	-	-
42	Parch = (-inf, 0.5)	-0.006095	-	-
43	Fare = <7.13, 7.8)	-0.011860	-	-
44	Embarked = {S}	-0.019447	-	-

Looking at the importance of the features and conditions obtained from the rules determined by RuleKit, one can come to conclusions similar to those obtained for rules determined from the decision tree. The main difference is that in rules generated by RuleKit, the fare also plays an important role - the lower the fare, the person traveled in a lower class what determined their survival.

4. Creation of a binary dataset based on top conditions¶

[59]:

X_train_tranformed = explainer.fit_transform(X_train, selector=25)

X_train_tranformed.head(5)

[59]:

	Fare = <52.28, 143.59)	Fare = <29.85, inf)	Pclass = {1}	Fare = <74.38, inf)	Age = <2.5, 6.5)	Sex = {male}	Pclass = {3}	Fare = <10.48, inf)	Fare = (-inf, 10.48)	Sex = {female}	Pclass = {2}	Age = (-inf, 6.5)
0	1	1	1	1	1	1	0	1	0	0	0	1
1	0	0	0	0	0	1	1	0	1	0	0	0
2	0	0	0	0	0	0	1	1	0	1	0	1
3	0	0	0	0	0	1	0	1	0	0	1	0
4	0	0	0	0	0	1	0	1	0	0	1	0

[60]:

X_test_transformed = explainer.transform(X_test)

X_test_transformed.head(5)

[60]:

	Fare = <29.85, inf)	Age = <2.5, 6.5)	Sex = {male}	Pclass = {3}	Fare = <10.48, inf)	Fare = (-inf, 10.48)	Sex = {female}	Pclass = {2}	Age = (-inf, 6.5)
0	0	0	1	1	1	0	0	0	0
1	0	0	1	0	1	0	0	1	0
2	0	0	1	1	0	1	0	0	0
3	1	1	0	0	1	0	1	1	1
4	0	0	0	1	1	0	1	0	0

A binary dataset created in this way can be used to create a classifier using another machine learning algorithm. The advantage of this approach is that RuleKit has created rules on a set containing null values and containing both numeric and categorical variables. However, the prepared dataset consists only of numerical values 0 and 1 determining whether a given condition has been met and does not contain empty values. Therefore, you can easily use algorithms that deal only with numerical values and do not handle missing values, such as RandomForest as shown below.

[62]:

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=1)
cv = cross_val_score(rf, X_train_tranformed,
                     y_train, cv=5, scoring = "balanced_accuracy")
print(cv)
print(cv.mean())

[0.7570922  0.86241057 0.703082   0.73188406 0.8113155 ]
0.7731568645642581

[64]:

rf.fit(X_train_tranformed, y_train)

[64]:

RandomForestClassifier(random_state=1)

[65]:

balanced_accuracy_score(y_test, rf.predict(X_test_transformed))

[65]:

0.7861077638147702

VI. Summary¶

The presented analysis shows how the RuleXAI library may be used for data analysis and model explanation. Explanations, both global and local, are performed using the generated rule-based model representation.